1st scanpy session - overview and experimental considerations
Introduction to Single Cell RNA Sequencing
Overview of the Workshop
- The workshop is led by Van Gutner, a postdoc at Fabian Tyson's lab, focusing on single-cell RNA sequencing data analysis.
- The session includes setting up the environment, loading data, and installing necessary packages in breakout rooms.
- After a break, participants will learn about quality control and data pre-processing before lunch.
- Post-lunch topics include normalization and batch effect correction, crucial for analyzing single-cell data.
Transition from Bulk Genomics to Single Cell Analysis
- Bulk genomics is likened to making a smoothie where different cell types blend together, leading to loss of specific information about rare cells.
- In contrast, single-cell RNA sequencing allows access to all cell types and their characteristics individually.
Understanding Droplet Sequencing Technology
Advantages of Droplet Sequencing
- The droplet sequencing pipeline encapsulates single cells into nanoliter droplets for efficient processing.
- This technology significantly reduces costs while enabling the sequencing of thousands of cells simultaneously.
Growth in Single Cell Experiments
- There has been substantial growth in published experiments from 2014 to 2017, with current projects involving millions of cells.
Experimental Aspects of Single Cell RNA Sequencing
Sample Preparation Process
- The process begins with extracting RNA from samples followed by converting them into cDNA for sequencing.
- Millions of sequences are generated that need mapping to the genome for further analysis.
Library Preparation Protocols
- Various protocols exist for library preparation; some capture full-length transcripts while others focus on the three-prime end only.
Considerations in Experimental Design
Factors Influencing Protocol Selection
- Researchers must consider how many cells can be processed and what additional information can be obtained during sorting.
Unique Molecular Identifiers (UMIs)
cDNA Synthesis and Amplification Process
Overview of cDNA Creation
- The process begins with the creation of complementary DNA (cDNA), where unique molecular identifiers (UMIs) are added to each cDNA molecule. These UMIs serve as short barcodes, typically consisting of eight to ten base pairs, allowing for the identification of individual molecules.
Amplification Techniques
- To enhance signal strength, cDNA amplification is performed using techniques such as Polymerase Chain Reaction (PCR) or in vitro transcription (IVT). This step is crucial for generating multiple copies of the cDNA.
Importance of UMI in Data Quality
- The amplification step is vital for distinguishing between true biological copies and technical duplicates created during PCR. UMIs help identify whether a copy originated from the original cell or was generated during amplification.
- Without UMIs, it becomes challenging to differentiate between these copies, leading to increased noise in sequencing data. This distinction is essential when evaluating different protocols.
Data Mapping and File Management
Library Construction and Sequencing
- After constructing a library from amplified cDNAs, the next step involves mapping transcriptomes back to the genome post-sequencing.
File Types and Storage Recommendations
- Sequencing facilities generate PCL files that are converted into FASTQ files. While both file types are considered raw data, public resources typically require submission of FASTQ files due to their size.
- It’s advised not to store large FASTQ files on personal devices; instead, utilize dedicated storage solutions since they may be needed for future data remapping.
Mapping Process Explained
- The mapping process involves aligning short DNA sequences derived from transcripts or cDNAs back to their corresponding genes within the genome. This requires high-performance computing resources rather than standard laptops.
Understanding FASTQ Files
Structure of FASTQ Files
- A typical FASTQ file consists of four lines:
- Line one starts with '@' followed by a sequence identifier.
- Line two contains the raw sequence letters.
- Line three begins with '+' and may include additional descriptions.
- These identifiers facilitate machine communication rather than human readability; thus, users do not need to manually inspect each file despite generating thousands during sequencing processes.
Sensitivity in Single Cell RNA Sequencing
Molecular Detection Limits Across Protocols
- A comparison across various RNA sequencing protocols reveals significant differences in sensitivity levels. Bulk samples require at least 10,000–100,000 copies for detection while single-cell protocols can detect much lower quantities.
- Some advanced methods can detect down to one molecule but often involve trade-offs regarding throughput and sensitivity. For instance, Smart-seq offers high sensitivity but low throughput compared to other methods like Drop-seq or Chromium from 10x Genomics which provide higher throughput but less sensitivity.
Trade-offs in Protocol Selection
Single Cell RNA Sequencing: Workflow and Analysis
Choosing the Right Protocol
- The shallow sequence protocols provide comprehensive data but may not be suitable for studying low copy number transcription factors, where a more sensitive protocol like Smart-seq is preferred.
- The choice of protocol depends on the research question; high-throughput platforms like Chromium are favored for capturing rare cell types effectively.
Basic Workflow Overview
- The workflow spans from library preparation to sequencing, resulting in a count matrix that represents cells by genes (30,000 genes x 3,000 cells).
- Each entry in the count matrix indicates the number of RNA molecules detected per gene in each cell.
Data Preprocessing Steps
- Preprocessing involves quality control (QC), normalization for biases, and feature selection focusing on highly variable genes.
- Batch effect correction addresses systemic biases across samples; this preprocessing phase typically consumes about 80% of the total analysis time.
Dimensionality Reduction and Clustering
- After preprocessing, data is reduced from 30,000 dimensions to 2D for easier interpretation through techniques like clustering.
- Clustering groups cells based on dissimilar properties using marker genes to annotate clusters with known functions or identities.
Advanced Analysis Techniques
- Trajectory inference helps understand stem cell differentiation into mature cell types and identifies key genes driving fate decisions.
- Differential expression analysis compares gene changes between conditions (e.g., healthy vs. diseased states), revealing shifts in cell type compositions due to perturbations.
Resources for Further Learning
- For an in-depth understanding of single-cell RNA sequencing workflows, refer to Malta's paper titled "Current Best Practices in Single Cell RNA Sequencing Analysis," which includes practical tutorials.
Introduction to Scanpy Framework
- Scanpy utilizes an annotated data frame structure containing a data matrix (cells x genes), along with metadata annotations regarding gene expression and other relevant features.
Understanding End Data Objects in Bioinformatics
Structure of End Data Objects
- The end data object consists of dimensions, with 1000 cells and 12,000 genes. It is crucial for users to understand how to create this during their work.
- Metadata includes various observations such as gene counts, mitochondrial fractions, and annotations related to variance and cell neighbors.
- The ops m object contains low-dimensional representations like PCA, t-SNE, UMAP, diffusion maps, and force atlas graphs.
- Layers within the data structure include counts that correspond to the x object but may have different normalization or batch correction applied.
- Users can save normalized versions of their data in layers while retaining access to raw count metrics.
Practical Considerations for Using Docker Containers
- Transitioning from raw count metrics to cell type annotation requires careful consideration of memory usage when using Docker containers.
- More complex tutorials necessitate at least six gigabytes of memory and one gigabyte of swap; simpler tutorials require less memory.
- Users should not panic if their computer does not meet these specifications; adjustments can be made in settings.