4th scanpy session - Clustering and cell type annotation

4th scanpy session - Clustering and cell type annotation

Understanding Data Normalization and Clustering Techniques

Introduction to Data Normalization and Visualization

  • The session begins with a recap of data normalization, batch effect correction, and various visualization techniques. The aim is to enhance understanding of the data structure for further analysis.

Clustering Cell Types Using Levene Clustering

  • Introduction to Levene clustering as a community detection algorithm that groups cells based on similarity. It operates on graphs by comparing connections within groups versus between groups.
  • Explanation of community detection using graph theory, likening it to social networks like Facebook where nodes represent individuals and edges represent friendships.

Resolution Parameter in Clustering

  • Discussion on how clustering does not require a predefined number of groups but uses a resolution parameter to determine the number of cuts made between clusters.
  • Demonstration with PBMC dataset showing how different resolutions affect cluster formation—lower resolution results in fewer clusters while higher resolution yields more.

Visualizing Gene Expression in Clusters

  • Overview of methods for visualizing gene expression across clusters, highlighting the dot plot as an effective tool where dot color indicates gene expression level and size reflects the fraction of expressing cells.

Marker Genes Analysis

  • Observations from marker gene plots reveal that some markers are highly expressed in specific clusters while others are dispersed across multiple cell types, complicating cluster identification.
  • Comparison between weak markers (e.g., CD8 T cells with low expression levels) and strong markers (e.g., B cells with high specificity), emphasizing the importance of marker specificity for accurate cell type annotation.

Advanced Visualization Techniques

  • Mention of biolean plots as another method for visualizing marker distribution during analysis. Future sessions will cover additional visualization tools such as heat maps and matrix plots.

Manual vs Automated Cell Type Annotation

  • Acknowledgment that cell type annotation remains largely manual despite available tools like Garnet which classify cells based on defined marker gene sets. Caution is advised against over-reliance on automated systems without thorough knowledge of the data context.

Pseudotime and Trajectory Inference

Understanding Data Structure and Manifold in High Dimensional Space

Diffusion Pseudotime Process

  • The discussion begins with the aim to uncover the structure of data within a high-dimensional space, focusing on drawing paths through the data using a diffusion pseudotime process developed by Lali.
  • This process is based on random walks, allowing researchers to start from a specific cell and navigate through others to reach an endpoint, effectively creating a metric for aligning cells in pseudo-time space.

Gene Dynamics and Bifurcations

  • The concept of bifurcations is introduced, where cells can diverge into different pathways (left or right), enabling analysis of gene dynamics across these branches.
  • Alex's work aims to generalize this approach by connecting various settings such as discrete cell types and continuous phenotypes, which are often seen in developmental processes.

Topological Representations

  • Different topologies are discussed: discrete topology for distinct cell types, line topology for gradual changes between types, tree topology for multiple bifurcations, circle topology related to the cell cycle, and complex topology reflecting spatial positions.
  • Alex successfully unified these diverse topologies using a single-cell graph representation that illustrates connectivity among clusters based on nearest neighbors.

Application to Organismal Tissues

  • The application of this graph-based approach is demonstrated using planaria (flatworms), known for their regenerative capabilities due to a pool of stem cells that differentiate into various tissues.
  • The stability of the parka model under subclustering shows consistent connections between stem cells and major tissue clusters.

Visualization and Benchmarking

  • A multitude of differentiation paths complicates assigning starting or ending points within the dataset; however, the graph approach proves effective in visualizing these complexities.
Video description

In the fourth session of the scanpy tutorial, we describe how to annotate a data set based on louvain clustering. We further introduce different plotting options to visualise gene expression patterns. This is the recording of the scanpy tutorial held at Helmholtz Munich in July 2020.