3rd scanpy session - Normalisation, Batch correction, Highly variable Genes, Embeddings
Normalization and Batch Effect Correction in Transcriptomics
Understanding Normalization Techniques
- The session begins with a recap of quality control and gene filtering, transitioning into normalization methods for transcriptomic data.
- Short transcript pieces are typically mapped to the genome, often excluding introns. Gapped mapping is necessary due to this exclusion.
- There are two main protocols: full-length and free-prime. Full-length captures entire transcripts while free-prime only captures the end adjacent to the polyadenylated sequence.
- In full-length protocols, gene length impacts captured molecule counts; shorter genes yield fewer reads. Conversely, free-prime protocols show consistent capture across gene lengths.
- Two common normalization methods discussed are TPM (transcripts per million per gene length) for full-length data and CPM (counts per million) for free-prime data.
Cell Type Specific Normalization
- A V-shape pattern can emerge when comparing library size versus expressed genes, indicating varying expression levels among cells.
- SCRUN is introduced as a method that clusters data by cell type before computing size factors specific to those populations, enhancing normalization accuracy.
- If no V-shape is present, CPM normalization suffices without significant issues.
- SCRUN's reliance on R packages necessitates integration with Python notebooks via RP2 for effective analysis.
Exploring Batch Effect Correction
- The discussion shifts to batch effect correction, illustrated using a dataset from gut epithelium samples showing distinct banded structures indicative of batch effects.
- Banded structures suggest strong biases where cells from the same batch cluster together; correcting these effects is crucial for accurate analysis.
- Post-correction should ideally result in overlapping batches without losing biological signal integrity; over-correction leads to loss of meaningful clustering based on cell types.
- Various methods exist for addressing batch effects—over 20 identified—each producing different outputs depending on their approach and effectiveness in preserving biological signals.
What Are the Different Outputs of Batch Effect Correction?
Types of Outputs
- There are three main outputs from batch effect correction: a corrected feature matrix, an embedding (2D representation), or a graph object that changes cell neighborhoods without providing detailed gene data.
- An embedding allows for further analysis while retaining raw counts, but creating corrected gene output from a graph object is challenging.
Benchmarking Study Insights
- A benchmarking study led by Malta shows that effective batch effect correction should ideally preserve biological signals, with scores approaching one indicating better performance.
- Most methods tested fall into the top right score regime, indicating good performance; however, none fully reach optimal levels as they tend to lose some biological signal during correction.
Method Performance and Selection
- The study evaluated various methods across different datasets, including those with nearly 1 million cells. BBKNN performed well but is complex due to its graph-based nature.
- Some methods like Syrah and Trevi failed to produce results on large datasets due to high computational demands.
How to Select Highly Variable Genes?
Gene Selection Techniques
- Selecting highly variable genes improves batch effect correction performance. This can be done through normalization of dispersion and binning based on mean expression.
- The "Surah approach" involves sorting genes by mean expression and grouping them into bins for selection based on variability.
Comparison of Methods
- The Cell Ranger method allows for fixed numbers of highly variable genes to be returned; it has shown better results in some studies compared to Surah's approach.
- Users can choose between these methods based on their specific needs and outcomes observed in their analyses.
Understanding Embeddings and Data Visualization Techniques
Overview of Embedding Methods
- PCA (Principal Component Analysis) is a linear method that rotates data along the strongest variance axes, often used for noise reduction before further analysis.
Alternative Visualization Techniques
- t-SNE (Stochastic Neighbor Embedding) optimizes neighborhood relations in high-dimensional space using t-distribution, resulting in clustered data representations.
Limitations of Non-linear Methods
Comparing UMAP and t-SNE for Data Visualization
Advantages of UMAP over t-SNE
- UMAP combines advantages from various methods, allowing for faster computation compared to t-SNE, which can take minutes while UMAP takes seconds.
- It better resolves the global structure of data, making it more effective in visualizing complex datasets.
Embedding Techniques
- UMAP computes neighborhoods and embeds them into a low-dimensional space, similar to other dimensionality reduction techniques.
- The Force Atlas method uses molecular dynamics principles where similar cells cluster together while dissimilar ones are pushed apart, facilitating 2D embedding.
Understanding PCA Scree Plots
Analyzing Variance in PCA
- A scree plot ranks principal components by their variance contribution; the first few components show a steep decline in variance.
- The long tail of slowly decreasing variance indicates noise within the PCA results, suggesting that only the initial components should be focused on for analysis.
Elbow Method Application
- The elbow method is used to distinguish between informative and noisy principal components by identifying a point where the slope changes significantly (the "elbow").
Preparing Data for Analysis
Steps for Effective Data Normalization