20. Precision Medicine

20. Precision Medicine

Understanding Precision Medicine

Introduction to Precision Medicine

  • Peter Szolovits introduces the concept of precision medicine, noting the lack of a precise definition and emphasizing the importance of understanding disease subtypes.
  • He highlights various data types used for clustering diseases, including demographics, comorbidities, vital signs, medications, procedures, disease trajectories, and image similarities.

The Role of Genetics in Precision Medicine

  • Szolovits discusses the Human Genome Project's hope that genetic insights would lead to more effective therapies for diseases.
  • He references a 2017 National Research Council study titled "Toward Precision Medicine," which outlines new capabilities in compiling molecular data on patients.

Advancements Driving Precision Medicine

  • The cost of sequencing a human genome has dramatically decreased from $3 billion to under $1,000, enabling broader access to genetic information.
  • There is increasing success in using molecular information for diagnosis and treatment improvements; public attitudes are shifting towards accepting genetic data collection due to perceived benefits.

Integrating Diverse Data Sources

  • The report suggests integrating various individual patient data types akin to how Google Maps layers information using coordinates.
  • Szolovits mentions the NIH's "All of Us" project aimed at collecting comprehensive health data from one million diverse participants across the U.S. population.

Historical Context and Future Vision

  • He compares "All of Us" with historical studies like the Framingham Heart Study that collected extensive health data over generations.
  • The goal is to create an information commons that supports biomedical research by identifying significant associations within large datasets.

Implications for Healthcare Outcomes

Sam Johnson and the Concept of Dropsy

Overview of Sam Johnson

  • Sam Johnson was a prominent 17th-century British writer known for his contributions to encyclopedias, dictionaries, poetry, and commentary.
  • He humorously remarked about his health issues, stating: "My diseases are an asthma and a dropsy and, what is less curable, 75 years old."

Understanding Dropsy

  • Audience members recognized "dropsy" from literature, particularly Jane Austen novels and period dramas like Masterpiece Theatre.
  • Dropsy refers to water sickness or edema; it is not a disease itself but a symptom of various underlying conditions such as pulmonary disease or heart failure.
  • The last recorded instance of dropsy as a cause of death in the U.S. was in 1949, indicating its decline in medical terminology.

Precision Medicine and Data Clustering

  • Zack Kohane from Harvard proposed using high-dimensional data spaces to analyze patient information effectively.
  • In high-dimensional datasets, data points often cluster on lower-dimensional manifolds rather than being uniformly scattered.

Patient Clusters and Outliers

  • Patients within clusters represent typical cases; those at the edges may indicate unusual conditions warranting further investigation.

Case Study: A Sick Puppy

  • A case study involved a child with ulcerative colitis who faced severe complications after initially responding well to treatment.
  • The challenge lies in determining whether patients are outliers based on their recent medical history compared to established clusters.

Treatment Approaches Using Machine Learning

Understanding Time-Varying Data in Genetics

The Challenge of High-Dimensional Data

  • The discussion begins with the complexities of representing time-varying data, emphasizing the need for optimal weighting and normalization across dimensions.
  • Questions arise about whether all dimensions in high-dimensional space are equally significant or if some carry more weight depending on the specific problem context.

Introduction to Genetics

  • A brief overview of molecular cell biology is provided, highlighting that biology often defies strict rules, as noted by a biologist's quote: "Biology is the science of exceptions."
  • The speaker recalls an experience from a special class aimed at engineering faculty, where they learned about numerous exceptions in biological theories.

Mendelian Inheritance

  • The concept of inheritance is introduced through Gregor Mendel's work, which identified discrete factors (genes) responsible for traits passed from parents to offspring.
  • Mendel's experiments with pea plants established foundational principles of dominant and recessive inheritance patterns.

Discovery of DNA

  • Johann Miescher discovered nuclein (DNA), but it wasn't until 1952 that Hershey and Chase confirmed DNA as the carrier of genetic information.
  • Watson, Crick, and Franklin later elucidated DNA's double helix structure and its role in heredity through replication processes.

Defining Genes and Their Functions

  • A gene is defined as a fundamental unit of heredity located on chromosomes that encodes functional products like RNA or proteins.
  • Despite advances, understanding which parts of DNA code for genes remains challenging; much non-coding DNA’s function is still unknown.

Central Dogma and Gene Expression Regulation

  • Francis Crick proposed that nucleic acid specificity derives from base sequences; this was initially controversial but later validated.

Understanding Gene Function and Complexity

The Role of Repressors and Activators

  • Repressors prevent activators from binding or alter the activator to change rate constants, showcasing a mechanism in gene regulation.
  • Only about 1.5% of DNA consists of exons that code for mRNA and proteins; the function of the remaining 98.5% is largely unknown.

The Misconception of "Junk DNA"

  • The term "junk DNA" is misleading as evolution would likely eliminate non-functional DNA; cells expend energy maintaining this DNA.
  • Introns are segments spliced out during RNA processing, while regulatory sequences (about 5%) include promoters, repressors, and enhancers.

Speculations on Non-Coding DNA

  • Hypotheses suggest non-coding regions may serve as a reservoir for useful genetic material in changing environments, though this remains speculative.
  • Gerald Fink's definition expands genes to any transcribed segment of DNA with a function, not limited to those coding for proteins.

Alternative Splicing and Its Implications

  • Eukaryotic genes undergo alternative splicing where introns are removed and sometimes only select exons are retained, leading to diverse protein products.
  • Complexities arise as RNA can convert back into DNA (as seen in viral infections), highlighting intricate cellular mechanisms.

Advances in Genetic Editing Technologies

  • CRISPR-CAS9 represents a breakthrough in gene editing by utilizing bacterial defense mechanisms against viruses; ethical concerns arise from its application in humans.
  • Controversial experiments have been conducted on human genomes to confer resistance against diseases like HIV, raising questions about safety and ethics.

Understanding RNA Functions Beyond Protein Coding

  • Various types of RNA exist beyond mRNA: long non-coding RNAs regulate genes while small RNAs can inhibit translation through interference mechanisms.
  • Post-translational modifications affect protein degradation rates differently; chromatin structure complicates transcription accessibility despite dense packing.

The Future of Genomic Research

Cost of Genome Sequencing and RNA Analysis

Advances in Genome Sequencing

  • The cost of sequencing a genome has dramatically decreased from $3 billion to just a few hundred dollars, with whole exome sequencing available for $299 at 50x coverage.
  • For an additional $100, customers can opt for 100x coverage, highlighting the importance of replicates due to the noisy nature of these techniques.

RNA Sequencing Innovations

  • A new trend is the ability to sequence RNA transcribed from DNA; kits are available for $360 that can analyze RNA from up to 100 individual cells.
  • Cancer care has integrated genomic analysis into routine practice, where samples are sequenced to identify damaged genes and potential drug responses based on genetic variants.

Characterizing Disease Subtypes Using Gene Expression

Methodology Overview

  • The discussion shifts towards technical methods for characterizing disease subtypes using gene expression arrays, referencing a pivotal paper by Alizadeh from 2001.
  • The process involves extracting coding RNA, creating complementary DNA (cDNA), amplifying it, and utilizing microarrays containing numerous DNA fragments.

Microarray Techniques

  • cDNA is labeled with fluorescent proteins before being applied to microarrays; this allows for measurement of gene expression levels through fluorescence intensity.
  • An alternative method involves comparing normal tissue against cancerous tissue using dual-color labeling (green for normal and red for cancer), facilitating ratio measurements without calibration issues.

Clustering Analysis in Breast Cancer Research

Experimental Results

  • In typical microarray experiments, hierarchical clustering reveals patterns in gene expression across breast cancer biopsy specimens.
  • Clustering results indicated distinct groups within tumor samples correlating with specific pathological features, showcasing the effectiveness of this analytical approach.

Insights on Gene Lists and Survival Rates

  • Questions arise regarding how gene lists are generated; modern studies often utilize extensive lists encompassing thousands of known genes.

Understanding Gene Expression and Clinical Outcomes

Clustering Based on Gene Expression

  • Research indicates that gene expression levels can reveal clusters of patients with different clinical outcomes, independent of their clinical conditions.
  • The Alizadeh paper demonstrated similar findings by analyzing 96 samples of normal and malignant lymphocytes, correlating identified clusters with established lymphoma types.

Phenotypes and Genetic Analysis

  • A phenotype can be a diagnosed disease (e.g., breast cancer), a type of lymphoma, or even traits like weight or eye color. It encompasses any clinically known characteristic.
  • Analyzing phenotypes in relation to genes is done through GWAS (Genome-Wide Association Studies), which identify genetic differences linked to specific phenotypic variations.

Genetic Variations in GWAS

  • GWAS typically focuses on single nucleotide polymorphisms (SNPs), where the genome differs from the reference genome at specific loci.
  • Copy number variations are also significant; for instance, Huntington's disease is associated with repeat counts in DNA sequences—healthy individuals have fewer repeats than those who develop the disease.

PheWAS: Exploring Phenotype Associations

  • In contrast to GWAS, PheWAS (Phenome-Wide Association Study) examines how various phenotypes correlate with a particular genetic variant across multiple traits.

Interpreting GWAS Results

  • A typical output from a GWAS is represented as a Manhattan plot, showing gene expression differences between affected and unaffected individuals for a specific phenotype.
  • Statistical significance is assessed using methods like Bonferroni correction due to multiple hypothesis testing; only genes exceeding this threshold are considered potential candidates for further study.

Challenges in Genetic Analysis

  • Caution is advised before applying findings directly to treatment; confounding factors may affect results. Researchers often create knockout mice models to validate gene-disease relationships experimentally.

Addressing Statistical Limitations

Understanding Statistical Significance and Genetic Studies

The Debate on P-Values

  • The head of the American Statistical Association published a controversial article arguing that statistical significance does not equate to practical significance, suggesting a shift towards Bayesian reasoning.
  • There is ongoing discussion about the limitations of Genome-Wide Association Studies (GWAS), particularly regarding common variants with small effect sizes.

Challenges in GWAS Findings

  • A study conducted by Zach Kohane and Kat Liao revealed genes with odds ratios between 1.1 and 1.2, which are statistically significant but lack substantial real-world impact compared to higher odds ratios like those for smoking-related lung cancer.
  • Kohane advocates for focusing on clinical factors with stronger predictive relationships rather than weak genetic associations.

Insights from Mendelian Mutations

  • The expectation that the Human Genome Project would uncover numerous highly penetrant Mendelian mutations was largely unmet; most known diseases were already identified through historical research.
  • Current research emphasizes rare genetic variants with small effects, as they may hold keys to understanding complex conditions.

Unknown Disease Diagnosis Initiatives

  • A challenge involving eight children with undiagnosed conditions utilized their genetic data alongside family histories to identify potential genetic causes.
  • Participants developed diverse analytical pipelines, which converged over time, leading to improved methods for diagnosing unknown diseases.

Recent Advances in Type 2 Diabetes Research

  • A recent gene-wide association study focused on type 2 diabetes examined 94 previously associated loci and various glycemic traits, body metrics, and disease associations.

Matrix Factorization and Its Applications

Understanding Non-Negative Matrix Factorization

  • The technology discussed is non-negative matrix factorization, which addresses the challenge of negative associations by duplicating columns for traits with both positive and negative values.
  • The goal is to approximate a larger matrix (47x94) using two smaller matrices (47xK and Kx94), facilitating dimension reduction through unsupervised methods like auto-encoders.

Error Minimization in Matrix Reconstruction

  • The objective involves minimizing the error between the original matrix X and its reconstruction from W and H, allowing for determination of an optimal value for K.
  • A regularized L2 distance loss function is used, incorporating penalty terms based on the sizes of W and H, along with relevance weights to enhance computational efficiency.

Findings from Type 2 Diabetes Analysis

  • An analysis involving 17,000 individuals revealed five subtypes of type 2 diabetes present in over 82% of iterations, highlighting variations related to beta cells and proinsulin levels.
  • Three additional subtypes were linked to obesity, lipid metabolism issues, and liver function abnormalities.

Interpretation of Results Through Spider Diagrams

  • Spider diagrams illustrate different influences across clusters; factors are categorized into negative correlations (inner circle), zero correlation (middle), and positive correlations (outer circle).
  • Despite statistically significant findings (e.g., DI contributing significantly at p-value = 6.6 x 10^-37), many effects observed were relatively small in magnitude.

Introduction to PheWAS Methodology

  • PheWAS serves as a reverse GWAS approach introduced by Josh Denny et al. in 2010, utilizing data from Vanderbilt's biobank focusing on European Americans for comparability with existing GWAS data.
  • Researchers selected multiple SNPs associated with various diseases while clustering billing codes into phenotypes relevant for their study.

Data Utilization and Statistical Analysis

Analysis of SNP Associations and Gene Expression

Insights on SNP Analysis

  • The Bonferroni-corrected analysis reveals that only multiple sclerosis is significantly associated with a specific SNP, while other conditions like malignant neoplasm of the rectum and benign digestive tract neoplasms show intriguing but non-significant associations.
  • Audience inquiry about data availability leads to clarification that access to certain hospital data may be restricted, although collaboration opportunities exist for those willing to volunteer at institutions like Vanderbilt.
  • The NCBI's Gene Expression Omnibus (GEO) contains vast amounts of sample data, but it often lacks detailed curation, making it challenging to ascertain the specifics of each sample.

Limitations in Genetic Association Studies

  • Despite expectations, the association between the selected SNP and lupus is not significant (p-value = 0.5), while its link to multiple sclerosis remains statistically significant.
  • A notable finding indicates that an expected association between a SNP related to coronary artery disease and carotid plaque deposition does not hold up (p-value = 0.82).

Exploring Expression Quantitative Trait Loci (eQTL)

  • The concept of Expression Quantitative Trait Loci (eQTL) suggests using gene expression levels as traits instead of focusing solely on disease categories, which could simplify analyses.
  • Understanding gene expression involves complexities beyond mere presence; factors such as activation or repression must also be considered when analyzing genetic variations' effects on RNA expression levels.

Population Variability in Gene Expression

  • Evidence shows significant variability in gene expression among different populations: 17% variation in African descent individuals, 26% in Asian populations, highlighting potential confounding factors like environment and tissue differences.
  • Despite challenges in correlating expression levels with disease phenotypes, eQTL analyses have identified relationships with conditions such as asthma and Crohn's disease.

Complex Traits and Bayesian Network Models

  • Current research focuses on understanding complex diseases that are not Mendelian by employing advanced models like Bayesian networks to analyze interactions between genetic variants and traits.
  • Various models can illustrate how genetic variants influence both gene expression levels and diseases—these include direct causation or conditional independence scenarios among variables involved.

Understanding Genome-Phenome Association Studies

Hypothesis Generation and Likelihood Calculation

  • The approach involves generating a large set of hypotheses and calculating the likelihood of data given each hypothesis. The hypothesis with the highest likelihood is considered closest to correct.

Overview of UK Biobank Data

  • The UK Biobank includes approximately 500,000 de-identified individuals with full exome sequencing, though only about 10% of desired data has been collected. Many participants have behavioral data from 24-hour activity monitors and online questionnaires. About 20% have imaging linked to their electronic health records, providing insights into health outcomes like cancer or hospital episodes.

Recent Findings from Genetic Studies

  • Recent studies highlighted include genetic variants that protect against obesity and type 2 diabetes, risks associated with meat consumption and bowel cancer, as well as genetic causes for poor sleep. These findings illustrate the diverse research being conducted using biobank data.

Heritability Insights

  • Height shows significant heritability (0.46) linked to parental height, while educational attainment appears influenced by parental education rather than genetics alone, indicating social factors play a role. Interestingly, even TV-watching habits show some heritable traits based on genetics.

Gene Set Enrichment Analysis

  • Gene set enrichment analysis emphasizes that genes typically do not act in isolation; understanding metabolic processes requires examining groups of interacting genes rather than individual ones. This method aims to strengthen gene-wide associations by analyzing biologically-defined gene sets originally starting at 1,300 but now expanded to 18,000 sets at the Broad Institute.

Techniques in Gene Association Studies

  • A clever technique orders genes within a set by their correlation with specific traits; those closer to the beginning are more likely involved in disease association due to stronger correlations. This random walk process helps identify significant associations across various diseases and biological factors effectively.

Current Limitations in Advanced Analytical Methods

Video description

MIT 6.S897 Machine Learning for Healthcare, Spring 2019 Instructor: Peter Szolovits View the complete course: https://ocw.mit.edu/6-S897S19 YouTube Playlist: https://www.youtube.com/playlist?list=PLUl4u3cNGP60B0PQXVQyGNdCyCTDU1Q5j Prof. Szolovits gives an introduction to disease subtyping and discusses collecting genome data, precision medicine modality space (PMMS), genotypes and phenotypes, and the history of genetics and genome sequencing. He also discusses the Framingham study. License: Creative Commons BY-NC-SA More information at https://ocw.mit.edu/terms More courses at https://ocw.mit.edu