Model selection, part 2

Model selection, part 2

AIC: Understanding the Akaike Information Criterion

Introduction to AIC

  • The Akaike Information Criterion (AIC) is introduced as a method based on information theory, differing from classical statistics that rely on p-values.
  • AIC aligns more closely with Bayesian statistics, focusing on modal probabilities in data analysis.

Probabilistic Models and Their Importance

  • Probabilistic models allow for the computation of probabilities for various outcomes; for example, determining the likelihood of getting a certain number of hits when tossing a coin multiple times.
  • In phylogenetics, having a probabilistic model enables the calculation of probabilities for different alignments based on specific parameters.

Kullback-Leibler Divergence

  • The Kullback-Leibler divergence measures the distance between probability distributions, providing insight into how well one distribution approximates another.
  • For discrete probability distributions, this measure involves comparing probabilities across all possible values in a dataset.

Model Selection Using AIC

  • To find an effective model that approximates reality, one should minimize the Kullback-Leibler divergence from the true probability distribution.
  • A visual representation illustrates three models attempting to approximate reality; among them, Q2 has the smallest divergence and is thus preferred.

Calculating AIC

  • AIC estimates expected relative Kullback-Leibler distances between models and reality. It cannot provide absolute distances due to unknown true realities but offers comparative insights.
  • The formula for calculating AIC is straightforward: it combines the likelihood of a fitted model with its number of free parameters. Smaller AIC values indicate better models.

Practical Application of AIC

  • When applying AIC in practice, one fits multiple alternative models to data without requiring nested structures or limits on quantity.

Understanding AIC and Model Selection

Introduction to AIC

  • The Akaike Information Criterion (AIC) is calculated as the log-likelihood plus two times the number of free parameters, allowing for model comparison based on AIC values.
  • The smallest AIC values indicate the best models; in this case, the CVM + I + G model was identified as superior among those tested.

Delta AIC and Model Probabilities

  • To enhance model selection, one can compute Delta AIC values by subtracting the minimal AIC value from each model's AIC value.
  • For example, if the minimal AIC is at the top of a table, its Delta AIC will be 0; subsequent models will have positive Delta values reflecting their relative fit.

Calculating Akaike Weights

  • After computing Delta AIC values, Akaike weights are derived using an exponential function applied to -0.5 times each Delta value.
  • These weights represent probabilities that any given model is the best one based on available data; for instance, a 45% chance for one model being optimal.

Bayesian Connection and Scientific Inquiry

  • This probabilistic approach aligns with Bayesian inference principles where uncertainty quantification is crucial in scientific reasoning.
  • Using probabilities allows researchers to assess multiple hypotheses simultaneously rather than relying solely on null hypothesis testing.

Practical Applications of Model Selection

  • Constructing a comprehensive set of plausible alternative models enables effective evidence assessment through computed model probabilities.
  • This method differs significantly from traditional null hypothesis testing by evaluating various plausible models instead of just one.

Multi-modal Inference and Parameter Importance

Making Robust Predictions

  • Multi-modal inference allows predictions to be made more robustly by averaging predictions across different models weighted by their respective probabilities.

Estimating Parameters Across Models

  • When parameters appear in multiple investigated models (e.g., gamma shape parameter), averaging these estimates enhances reliability using model probabilities as weights.

Assessing Parameter Importance

  • By summing up probabilities from models containing specific parameters (like transitions), researchers can determine their relative importance within a system under study.

Case Study: Comparing Evolutionary Models

Hypotheses Overview

  • Considering two hypotheses regarding sequence evolution:
  • Jukes-Cantor model with uniform substitution rates (one free parameter).
  • Kimura two-parameter model with distinct rates for transitions and transversions (two free parameters).

Model Comparison and AIC Calculation

Log Likelihood Values of Models

  • The Jukes-Cantor model has a log likelihood of -2034.3, indicating that probabilities are always between 0 and 1, resulting in negative logarithm values.
  • The Kimura 2-parameter (K2P) model shows a slightly larger log likelihood of -2026.2, suggesting it fits the data better than the Jukes-Cantor model.

AIC Calculation for Model Assessment

  • To assess models, we compute the Akaike Information Criterion (AIC), using the formula: AIC = -2 * log likelihood + 2 * number of parameters.
  • For Jukes-Cantor, AIC is calculated as -2 * (-2034.3) + 2 * 1 = 4050.6; for K2P, it is -2 * (-2026.2) + 2 * 2 = 4056.4.

Delta AIC and Model Probabilities

  • Delta AIC is computed by subtracting the smallest AIC from each model's AIC; K2P has a Delta value of 0 while Jukes-Cantor has a Delta value of approximately 14.
  • The exponential function of -0.5 times Delta AIC gives probabilities: for Jukes-Cantor it's approximately 0.0000825; for K2P it's e^0 = 1.

Final Model Probability Calculations

  • Total probability sums to about 1.0000825; dividing individual probabilities by this sum yields results: Jukes-Cantor at ~0.08% and K2P at ~99.92%.