Experimentation and Causal Inference Debate | Statsig

Name: Experimentation and Causal Inference Debate | Statsig
Uploaded: 2025-12-12T20:23:18.000Z
Duration: 2 h 8 min 50 s

Introduction of Guests and Topic Overview

Guest Introductions

Chris is recognized as the godfather of observational causal inference at Amazon, having built a platform for causal inference models.

John is a prominent data scientist at Meta, known for his work in scaling experimentation analysis and generating significant incremental revenue.

SWAT has experience working on both Amazon's and Meta's central experimentation teams.

Main Discussion Topic

The focus of the discussion is on experimentation and observational causal inference, emphasizing that all useful analysis implies something causal.

The importance of causal analysis is highlighted, stating it can confidently determine if action X is better than action Y.

Misconceptions in Causal Analysis

Common Misconceptions

Chris discusses misconceptions about experimentation and observational studies that lead to financial losses for companies.

He emphasizes that model development should be decision-focused; good models drive good decisions rather than being overly complex or cutting-edge.

Regret in Decision Making

Chris introduces the concept of "regret" from decision theory, which measures the difference between actual decisions made versus ideal ones.

He argues that improving decision value over time should guide data scientists' activities.

A/B Testing Methodology

Variability in A/B Testing Approaches

John points out a common error: applying a single methodology across diverse applications without considering context-specific needs.

He shares personal experiences from Expedia where he attempted to apply uniform methodologies unsuccessfully across different teams.

Business Decisions as Driving Force

Both speakers agree that business decisions should be the primary focus when conducting experiments or observational studies.

Terminology Confusion

Clarifying Causal Inference Terms

There’s confusion around terminology; many use "causal inference" solely for observational methods when it actually encompasses both experimental and observational approaches.

Matching Methods to Applications

Emphasis on matching appropriate methods to specific applications; while experiments yield unbiased data, observational data has its own advantages due to its continuous nature.

When to Apply Different Approaches

Application Context for Causal Inference vs. Experimentation

The discussion transitions into determining when to apply observational approaches versus experimental methods.

Experimentation vs. Observational Methods in Research

The Preference for Experimental Approaches

The speaker emphasizes the importance of conducting experiments over observational methods when ethical considerations allow, suggesting that experiments yield more reliable data.

However, they acknowledge situations where ethical issues may arise, such as testing sensitive content near brands, which could necessitate reliance on observational methods.

Limitations and Trade-offs of Experiments

The discussion highlights that while experimental learning is ideal, practical constraints like the number of variables to test can limit experimentation.

It is noted that experiments have limitations regarding coverage and power but provide unbiased signals about causal impacts.

Hybrid Systems and Reinforcement Learning

A hybrid approach combining modeling with experimentation is suggested as potentially the most effective method for research.

In reinforcement learning (RL), model-free approaches are contrasted with those requiring parameterized models to navigate complex behaviors effectively.

Challenges in Observational Methods

The speaker discusses how observational methods can still yield valuable insights despite challenges like confounding factors that complicate causal inference.

They stress the need for careful consideration of assignment processes in observational studies to mitigate risks associated with confounding variables.

Practical Applications at Amazon vs. Meta

Reflecting on their experience at Amazon, the speaker notes a lack of experimental options due to technical limitations or insufficient power for certain questions.

They advocate for seeking experimental feedback whenever possible rather than defaulting to observational methods out of convenience.

Current Practices in Causal Analysis

A comparison between Amazon and Meta reveals differing approaches: Amazon often relies on economists and observational analysis due to business constraints, while Meta prioritizes experimentation.

The speaker mentions that both companies utilize centralized systems for causal analysis but acknowledges varying levels of maturity in technical infrastructure across organizations.

Causal Inference and Decision-Making at Meta

Emphasis on Randomized Experiments

At Meta, causal information is predominantly derived from randomized experiments, accounting for over 90% of impact claims.

While there are supported packages for synthetic controls and matching, the primary focus remains on conducting experiments.

Teams strive to adapt non-experimental methods to align with experimental frameworks whenever possible.

Organizational Dynamics in Causal Inference

A natural experiment example from Instagram highlights a division between user-side and advertiser-side data scientists within the same organizational structure.

Both sides utilize the same tools but exhibit different interests; advertisers show more interest in observational causal inference due to longer feedback loops.

Value of Surprising Results

The value of causal studies lies in their ability to reveal surprises; confirming existing decisions does not add significant value.

Convincing decision-makers often requires presenting evidence that contradicts their initial beliefs or visions.

Bayesian Decision Framework

The Bayesian decision framework emphasizes estimating the value of information; changing decisions based on new insights leads to better outcomes.

Organizational dynamics complicate decision-making as executives may have strong prior beliefs that influence their openness to new information.

Challenges in Influencing Decisions

Data science should be integrated into decision-making processes, ensuring that data points are considered alongside contextual information for improved outcomes.

Even trustworthy experimental results can face scrutiny, making it challenging to shift executive opinions without substantial evidence over time.

Measurement in Marketing: Insights and Strategies

Understanding the Role of Measurement in Marketing Decisions

The speaker emphasizes the importance of recognizing that while stakeholders may have strong convictions about market trends, their understanding may lack precision regarding investment strategies.

Measurement can validate these convictions but also reveal discrepancies between perceived and actual impacts on metrics like subscriber growth or retention.

A structured approach to critiquing measurement estimates is essential when results are unfavorable; this includes identifying potential flaws such as noise or short-term focus.

Systematic evaluation fosters a culture of logical argumentation rather than relying solely on gut feelings, enhancing trust among team members.

Emphasizing a truth-seeking mindset allows for constructive feedback loops that improve business strategies based on measured outcomes.

Navigating Challenges with Measurement Data

The speaker discusses how measurement should not be viewed as adversarial; instead, it serves to guide businesses toward better decision-making through data-driven insights.

A VP requests a summary of impact quantified by budget influence and value unlocked, highlighting the need for clear metrics in evaluating performance.

Real-world Applications and Examples

One participant reflects on their significant budget involvement but struggles to quantify specific impacts delivered, indicating a common challenge in measuring effectiveness.

An analogy is provided where an early career experience involved preventing an ill-advised expansion of a beloved product line based on data analysis that indicated diminishing returns.

The discussion illustrates how proper modeling can prevent costly investments that no longer yield positive ROI, emphasizing the importance of adapting strategies over time.

Questions arise about the necessity of large marketing expenditures (e.g., Super Bowl ads), reflecting ongoing skepticism within the industry regarding traditional advertising methods.

Experiments and Their Impact on Business Decisions

The Role of Experiments in Business

The speaker emphasizes the significant impact of running experiments on a large user base, highlighting that the results are directly measurable and can influence business decisions.

A critical environment for decision-making is discussed, particularly during a period referred to as "meta sufficiency year," which fostered a culture of critical analysis and decision-making.

Complementary Nature of Experiments and Observational Approaches

The speaker asserts that experiments and observational approaches are complementary rather than substitutes, enhancing each other's effectiveness.

A referenced paper analyzes advertising effectiveness by combining experimental data with observational estimates to provide continuous causal insights.

Bridging Experimental Data with Observational Insights

The discussion includes how hybrid systems can leverage both types of data; observational data provides ongoing insights while experiments help calibrate these observations.

An example from the predictive incrementality paper illustrates how showing an ad to a customer should ideally lead to increased revenue, emphasizing the importance of measuring true incrementality in advertising.

Attribution Heuristics in Advertising

Traditional attribution heuristics (e.g., last touch attribution) are critiqued for being biased but still offer valuable predictive signals regarding ad performance.

The speaker explains that while last touch attribution may overestimate an ad's value, it can still indicate potential incremental effects when analyzing advertising success.

Future Directions in Analytical Tools

There is a call for developing ensemble methods that combine various heuristics to predict experimental outcomes more accurately.

The speaker anticipates advancements in standard analytical tools over the next five years, aiming for improved accuracy through hybrid models integrating experimentation with observational causal inference.

Analysis of Observational vs. Experimental Studies

The Impact of Causal Inference on Beliefs

The speaker reflects on a paper that utilized causal inference and observational approaches, concluding that the results were not significantly better than random noise.

A shift in belief occurs as the speaker transitions from Amazon's observational studies to experimentation at Beta, realizing both methods have their merits.

Understanding Experimental and Observational Models

Experiments provide ground truth while causal models offer flexibility; combining both can enhance predictive accuracy.

A specific study compared experimental results with observational data in advertising, revealing significant discrepancies due to missing auction-level data.

Nuanced Interpretations of Causal Models

The interpretation of findings suggests that not all observational models are flawed; context matters greatly.

Treatment assignment mechanisms play a crucial role in determining the validity of observational estimates, especially when targeting customer attributes.

Confounding Factors and Domain Dependency

Domains vary in confounding levels; upper-funnel activities tend to be more confounded than lower-funnel ones due to selection biases.

Both observational and experimental methods can yield credible or flawed insights depending on execution quality.

Responsible Use of Observational Methods

Despite criticisms, observational methods can still provide valuable insights if approached with discipline and risk assessment.

Sensitivity analyses can demonstrate robustness in findings, allowing for informed decision-making even when ideal experiments aren't feasible.

How General Should Observational Causal Inference Be?

The Applicability of Experimental Methods

Discussion on the generalizability of observational studies and experimentation, highlighting that experimentation can be more generalized than observational methods.

Emphasis on the complexity of applying basic experimental methods universally across different business contexts; sophisticated teams may require tailored approaches.

Challenges in Observational Causal Inference

Noted difficulty in applying correct methods for observational causal inference without expert guidance, which can complicate results interpretation.

Mention of Amazon's DSI model allowing businesses to generate causal impact but recognizing the need for specialists in certain situations.

Risks and Validation in Observational Estimates

Debate over whether automated systems can effectively warn users about inaccurate observational estimates, stressing the importance of validation.

Historical context provided where finance teams used rudimentary methods for causal estimates, underscoring the necessity for a robust observational causal model.

Nuances in Interpretation

Importance of understanding biases in observational studies and how they affect decision-making; practitioners often misinterpret these numbers as equivalent to experimental results.

Highlighting the challenge analysts face when extracting nuanced interpretations from various OCIs compared to standardized A/B testing results.

Debating Best Practices: Scale and Application

Experimentation vs. Observational Approaches

Introduction to a debate focused on identifying bad practices within experimentation and observational methodologies.

Discussion on starting new businesses with limited historical data; suggests beginning with metrics and logging rather than relying solely on observational methods.

Specific Case Study: Netflix Advertising

Consideration of both experimental and observational approaches for determining advertising strategies at Netflix, emphasizing careful handling of experiment results.

Clarification that while experiments can guide ad implementation strategies, they may not definitively answer broader questions like whether ads should be shown at all.

Cold Start Problem and Measurement Strategies

Understanding the Cold Start Problem

The cold start problem persists, emphasizing the need for disciplined measurement plans to transition out of initial scrappy phases in business.

Many business challenges are repetitive from a measurement standpoint; influential papers like the "surrogates paper" highlight how customer behaviors can predict long-term outcomes.

Leveraging Surrogate Behaviors

Driving engagement and purchasing behaviors can serve as proxies for predicting long-term impacts on revenue.

A combination of observational data (longer horizon) and experimental data (shorter horizon) is beneficial for extrapolating insights about new interventions.

Generative AI and Counterfactual Analysis

Exploring AI Generated Counterfactuals

Generative AI can create counterfactual scenarios by simulating customer behavior through agentic models, allowing for high-powered A/B testing.

While these simulations provide semi-mechanistic representations of customer behavior, they may not fully capture real-life complexities due to potential confounding factors.

Analogies with Real-world Applications

The analogy of self-driving cars illustrates how simulated environments can train algorithms effectively by mapping rewards accurately between simulations and reality.

Agents navigating e-commerce websites can yield valuable predictions about real-world incrementality based on their interactions with features on the site.

Practical Considerations in Experimentation

Challenges in Mimicking Human Behavior

Initial thoughts on using model temperature to replay scenarios were deemed impractical; however, leveraging context or system prompts could enhance plausibility in mimicking user behavior.

By providing behavioral data from various users to an LLM, it may be possible to predict subsequent actions more accurately.

Behavioral Economics and Experiment Replication

Insights from Behavioral Economics Experiments

A researcher replicated classic behavioral economics experiments, particularly focusing on price gouging as discussed in a Daniel Connean paper. The findings indicated that agents' reactions aligned with the original study's results.

The researcher experimented with agents endowed with different preferences, noting that more libertarian agents were more accepting of price gouging, which was reflected in the data.

The main inquiry raised was about the utility of these results. The researcher aimed to test various hypotheses to identify viable predictions for future AB testing.

Conducting these experiments is cost-effective compared to traditional methods, allowing for a broader range of tests using simulated data rather than relying solely on historical experiments.

Discussion on Reproducibility in Experiments

An audience member raised concerns regarding reproducibility in academic fields, questioning how often experiments are rerun and what practices ensure consistent results.

At Meta, it is standard practice to conduct back tests before launching new experiments. This involves comparing pre-experiment conditions with post-experiment outcomes.

There is variability in rigor among teams when comparing point estimates; some apply statistical methodologies while others may not be as thorough.

A mental model suggested for understanding replication focuses on whether subsequent runs predict initial outcomes accurately, considering potential structural breaks due to evolving conditions or biases.

Challenges in High-Stakes Decision-Making

A question arose regarding high-stakes decisions where traditional experimentation or causal inference isn't feasible (e.g., effects of tariffs or immigration policies).

Suggested approaches include examining existing literature and analyses from similar past scenarios to inform decision-making despite the lack of direct experimental evidence.

How to Convince Executives with Little Data Background?

Bridging the Gap Between Analytics and Business

The speaker emphasizes the importance of sharing methodologies and experiences to convince business executives who lack a data background. They stress that it's crucial to provide actionable insights rather than debating which analytical model is superior.

An executive expresses frustration about receiving vague advice, insisting on concrete data-driven recommendations for decision-making, highlighting the need for clarity in communication.

Building credibility is essential; simply presenting analysis results isn't effective. Engaging with various stakeholders throughout the process helps establish trust and support for conclusions drawn from data.

Establishing Trust and Understanding

The discussion highlights social proof as a mechanism to build trust within organizations. A network of trusted individuals can help validate scientific findings without needing deep technical understanding from executives.

It's important to ensure that there are systems in place to audit analyses, preventing misleading or incorrect information from being presented. This provides peace of mind for leaders who may not have time to verify every detail.

Communicating Data Effectively

Executives should be equipped with basic quantitative literacy, focusing on core concepts rather than overwhelming them with raw data. Presenting easily consumable summaries, like uncertainty intervals, can facilitate better understanding.

Avoiding overly detailed explanations allows executives to focus on key takeaways instead of getting lost in complex methodologies used by data science teams.

The Importance of Experimentation

The conversation shifts towards selling experimentation within companies. There’s a significant gap between those who understand its value and those who do not; many people are unaware that their hypotheses may often be incorrect.

Highlighting research indicating that 80% of hypotheses are wrong underscores the necessity for an experimental culture where mistakes lead to learning and improved decision-making over time.

Addressing Challenges in Experimental Design

A question arises regarding challenges faced in social network experiments due to inherent assumptions being violated. It’s noted that platforms must control these variables effectively through post-processing techniques.

Discussion includes how platforms can support various types of experiments by allowing randomization at different levels while adjusting for potential interference among units involved in the study.

Network Experiments and Bayesian Statistics

Understanding Network Experiments

Network experiments can be ignored or executed after detecting certain parameters, emphasizing the importance of network-level targeting.

To run a network experiment, one must first establish a network cluster by analyzing the graph's connectedness and defining clusters based on various parameters.

A trade-off exists between cluster purity and statistical power; pure clusters may reduce power, leading to potential bias in results.

The Value of Bayesian Statistics

Bayesian statistics offer two main functions: regularization through shrinkage for noisy data and a clear philosophy for uncertainty reporting.

Traditional confidence intervals often fail to convey true probabilities; Bayesian methods provide more intuitive insights into uncertainty, aiding communication with executives.

Bayesian structures allow for expressive modeling of hierarchical effects, enhancing clarity in understanding complex relationships within data.

Decision Theory and Experimentation

Integrating uncertainty distributions helps evaluate decision-making processes, such as expected regret and the financial value of additional information from experiments.

The speaker emphasizes that many critical questions can only be effectively addressed using Bayesian methodologies.

Practical Applications of Bayesian Methods

The effectiveness of experimentation programs depends on how experimental data is utilized; robust frameworks are essential for accurate impact reporting.

In some cases, whether to use Bayesian or frequentist approaches may not significantly affect outcomes if strong priors are absent.

Reevaluating Statistical Thresholds

Recent discussions challenge the arbitrary p-value threshold of 0.05; transitioning to a Bayesian framework allows for better evaluation criteria in shipping decisions.

By employing shrinkage techniques within a Bayesian decision-making context, organizations can refine their shipping strategies based on experimental outcomes.

Decision-Making Frameworks and Future Events

Exploring the Beijian Decision-Making Framework

The Beijian decision-making framework allows for experimentation with various thresholds to optimize results, highlighting that traditional fixed values (like the 0.05 threshold) may not suit all scenarios.

By considering specific risk profiles and costs of missed opportunities, users can calibrate their decisions and policies more effectively within this framework.

The speaker expresses a fascination with applying a Bayesian perspective at the portfolio level, indicating an interest in evolving decision-making strategies.

Networking and Upcoming Events

As the session concludes, attendees are encouraged to network while speakers remain available for discussions.

An upcoming webinar titled "Decision Making in the Age of AI" is announced, featuring Cassie, Google's first decision scientist, as a key speaker alongside Tim's head of data.

Participants will receive a follow-up email containing insights from the event and registration details for the upcoming webinar once approvals are secured.