The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)

Name: The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)
Uploaded: 2023-07-27T12:00:36.000Z
Duration: 2 h 46 min 8 s

The Importance of Experimentation

In this section, the speaker emphasizes the importance of experimentation in software development and highlights how even small changes can have unexpected impacts.

Test Everything and Embrace Failure

The speaker advocates for testing everything, ensuring that any code change or new feature is introduced through an experiment.

Small bug fixes and changes can sometimes have surprising and unexpected impacts.

It is crucial to allocate time for high-risk, high-reward ideas, even if they are likely to fail. However, it is important to be prepared for failure as well.

Introduction of Ronnie Kohavi

This section introduces Ronnie Kohavi, a renowned expert on A/B testing and experimentation. It provides an overview of his background and experience in leading experimentation teams at companies like Airbnb, Microsoft, and Amazon.

Ronnie Kohavi's Background

Ronnie Kohavi is considered a world expert on A/B testing and experimentation.

He has held positions such as Technical Fellow of Relevance at Airbnb, Corporate Vice President at Microsoft leading the experimentation platform team, and Director of Data Mining and Personalization at Amazon.

Currently, he works as a full-time advisor and instructor in addition to being the author of the book "Trustworthy Online Controlled Experiments."

A Sneak Peek into the Conversation with Ronnie Kohavi

This section provides a preview of the topics covered in the conversation with Ronnie Kohavi. It mentions specific areas such as starting experiments at companies, changing company culture to be more experiment-driven, signs of potentially invalid experiments, trust in experiments' success, getting started with running experiments, understanding p-values and Twyman's Law.

Topics Covered in the Conversation

When should companies start considering running experiments?

How can company culture be changed to embrace experimentation?

What are signs that experiments may be invalid?

The importance of trust in successful experiment culture and platform.

How to get started with running experiments at a company.

Understanding p-values and Twyman's Law.

Insights and experiences related to Airbnb and experiments.

Introduction to the Podcast Episode

This section introduces the podcast episode featuring Ronnie Kohavi. It mentions the focus on creating an experiment-driven culture at companies or fine-tuning existing ones.

Creating an Experiment-Driven Culture

The episode aims to provide insights for creating an experiment-driven culture at companies.

It is also relevant for those looking to refine their existing experiment-driven cultures.

Sponsorship Message: Mixpanel

This section includes a sponsorship message from Mixpanel, highlighting their ability to provide deep insights into user behavior throughout the funnel.

Mixpanel Sponsorship Message

Mixpanel offers deep insights into user behavior at every stage of the funnel.

Their pricing scales as your company grows, making it accessible for businesses of all sizes.

By capturing website activity data and providing multi-touch attribution, Mixpanel helps improve every aspect of the user funnel.

Sponsorship Message: Round

This section includes a sponsorship message from Round, emphasizing its role in building authentic relationships among tech leaders.

Round Sponsorship Message

Round is a private network built by tech leaders for tech leaders.

It combines coaching, learning, and authentic relationships to help individuals identify their goals and accelerate their path towards achieving them.

Building and managing networks doesn't have to feel like networking when using Round's platform.

Interviewing Ronnie Kohavi on A/B Testing and Experimentation

This section marks the beginning of the interview with Ronnie Kohavi. The host acknowledges Ronnie's expertise in A/B testing and experimentation and expresses excitement about delving deeper into the topic.

Introduction to the Interview

Ronnie Kohavi is recognized as a leading expert on A/B testing and experimentation.

The interview aims to explore the world of experimentation and provide insights for running better experiments.

Most Surprising A/B Test Result

In this section, the host asks Ronnie about the most unexpected or surprising result he has encountered from an A/B test.

Unexpected A/B Test Result

Ronnie shares an example from his book and class, which involves changing how ads were displayed on Bing.

Moving the second line of ads to the first line resulted in a significant impact, making it one of the most surprising public examples of an A/B test.

The Impact of a Trivial Change on Revenue

This section discusses the implementation of a trivial change that had a significant impact on revenue at Bing. It highlights the importance of considering return on investment and how small changes can lead to substantial results.

Implementing a Trivial Change

A backlog item at Bing, which had been there for months, involved implementing a simple idea.

Despite other items being rated higher, this particular idea was considered trivial to implement.

Engineers spent a couple of hours implementing the change, resulting in an unexpected outcome.

Surprising Revenue Increase

After implementing the trivial change, an alarm indicating an issue with the revenue metric was triggered multiple times.

Initially skeptical, as previous alarms were due to bugs or data problems, it was discovered that there was nothing wrong with the implemented change.

The simple idea increased revenue by about 12%, amounting to approximately $100 million for Bing at that time.

Importantly, this change did not negatively impact user metrics.

Trivial Change with Significant Impact

Increasing revenue by displaying more ads is a common approach but often compromises user experience.

In this case, shifting two lines in search results led to the biggest revenue impact in Bing's history.

Further experiments were conducted to understand the factors contributing to this success.

Unpredictable Experiment Outcomes

This section explores examples where experiment outcomes were unpredictable and yielded surprising results. It emphasizes the need for humility when predicting experiment outcomes and highlights the importance of institutional memory.

Opening Listings in New Tabs

At Airbnb, running an experiment where listings opened in new tabs instead of directly navigating to them resulted in significant gains.

Initially debated and faced pushback from designers who questioned why it should be done, but it proved highly beneficial.

This experiment was conducted before Airbnb existed, and the learning from it was applied later.

Institutional Memory and Learning

The importance of institutional memory is highlighted, as beneficial experiments can be forgotten over time.

Reintroducing successful ideas to teams can lead to further improvements.

Acknowledgment is given to a mutual friend who facilitated the conversation about this experiment.

Rarity of Massive Results from Minimal Effort

This section addresses the rarity of achieving massive results with minimal effort in experiments. It emphasizes that such outcomes are infrequent and discusses the typical incremental nature of progress.

Seeking Extraordinary Results

Many people hope for extraordinary results from minimal effort in experiments.

While there are examples of small efforts leading to significant gains, they are rare occurrences.

Most progress is made through gradual improvements rather than one-time breakthroughs.

Inch by Inch Progress

Bing ads' revenue per thousand searches graph demonstrates how small improvements contribute to overall progress.

Monthly improvements are observed, although occasional setbacks may occur due to legal reasons or other factors.

The journey towards success often involves consistent incremental advancements.

By following these guidelines, you can create a comprehensive and informative markdown file summarizing the key points discussed in the transcript.

New Section

In this section, the speaker discusses the importance of continuous improvement and shares examples from their experience at a company called Airbnb. They highlight the significance of small improvements adding up over time and mention a metric they aim to improve by two percent each year.

Focusing on Continuous Improvement

The speaker mentions that at Airbnb, they have a metric that they aim to improve by two percent every year.

Small improvements, even as low as 0.1 or 0.15 percent, can add up over time.

The speaker emphasizes that these improvements are not achieved through one big idea but rather through many smaller ideas.

They share that out of the experiments conducted at Airbnb, only eight percent were successful in improving the key metrics they were targeting.

New Section

In this section, the speaker discusses the failure rate of experiments and provides insights into what companies should expect when running experiments. They share their own experience with failure rates at Microsoft and Bing.

Failure Rates in Experimentation

The speaker mentions that overall, about two-thirds (66%) of ideas fail at Microsoft.

In Bing, which is a more optimized domain, the failure rate was around 85%.

At Airbnb, the speaker observed a high failure rate of 92% for experiments.

They note that other companies like Booking.com and Google Ads have also published numbers indicating an 80 to 90% failure rate for experiments.

New Section

In this section, the speaker explains why a high failure rate in experiments is common and highlights the importance of learning from failures. They also discuss how implementation issues or lack of consideration can lead to experiment failures.

Understanding Experiment Failure

The speaker explains that when running experiments, it's important to consider that not every experiment maps to a new idea.

They mention that around 10% of experiments tend to be aborted on the first day due to implementation issues or unforeseen factors.

The speaker emphasizes that failure rates of 80 to 92% are humbling but common in experimentation.

They stress the importance of documenting and summarizing learnings from experiments for institutional memory and organizational learning.

New Section

In this section, the speaker discusses patterns and resources for improving experimentation success rates. They mention a paper called "Rules of Thumb" that analyzes thousands of experiments and extract patterns. They also recommend a website called goodui.org, which provides insights based on experiment results shared by users.

Resources for Experimentation Success

The speaker mentions a paper called "Rules of Thumb" that analyzes thousands of experiments and extracts patterns for improving success rates.

They recommend a website called goodui.org, where experiment results are shared, patterns are derived, and insights about their effectiveness are provided.

The speaker highlights the value of these resources in helping product managers create roadmaps based on successful experiment patterns.

New Section

In this section, the speaker emphasizes the importance of institutional memory in organizations. They discuss the need for documenting successes and failures from experiments and share their efforts in promoting institutional learning through quarterly meetings.

Institutional Learning

The speaker stresses the significance of documenting successes and failures from experiments to facilitate organizational learning.

They mention conducting quarterly meetings to discuss the most surprising experiments as a way to promote institutional memory.

The speaker defines surprising experiments as those with significant differences between expected outcomes and actual results.

Learning from unexpected outcomes, whether positive or negative, is considered valuable for future decision-making.

Due to limitations in available content, some sections may be shorter than others.

New Section

In this section, the speaker discusses an example of a surprising result from an experiment conducted at Microsoft to improve the Windows indexer. The experiment showed better indexing and relevance, but it negatively impacted battery life due to increased CPU consumption on laptops.

Example of Surprising Result in Experiment

The speaker shares an example of a surprising result from an experiment conducted at Microsoft to improve the Windows indexer.

The team was able to show offline that the experiment improved indexing and relevance.

However, when the experiment was run in real-time, it was found that it significantly reduced battery life on laptops due to increased CPU consumption.

This unexpected outcome highlights the importance of considering all factors and documenting lessons learned for future iterations.

New Section

In this section, the speaker discusses how to remember surprises and learnings from experiments over time. They emphasize the importance of documentation and having a searchable history of experiments.

Remembering Surprises and Learnings

When people leave a project or organization, it is crucial to document successes and failures so that others can learn from them even years later.

Maintaining a comprehensive deck internally with records of these surprises can help teams remember important insights.

Having a searchable history of experiments allows for easy retrieval of information by using keywords.

At Microsoft, thousands of experiments were conducted annually (around 20,000 to 25,000), making searchability essential for efficient knowledge sharing.

Regularly conducting meetings to discuss the most successful and interesting experiments helps foster a culture of experimentation and innovation.

New Section

In this section, the speaker addresses concerns about running too many experiments and being overly data-driven. They share their perspective on the balance between incremental changes and high-risk, high-reward ideas.

Balancing Experimentation and Innovation

The speaker advocates for testing everything, including small bug fixes and feature changes, through experiments to uncover unexpected impacts.

Experimenting too much is not a concern, but rather focusing solely on incremental changes can limit innovation.

A portfolio approach is recommended, where some experiments are aimed at incremental improvements while others explore high-risk, high-reward ideas.

It is important to allocate efforts towards breakthrough possibilities while being prepared for a higher failure rate.

Examples of failed experiments integrating social components in search engines like Bing and Netflix are mentioned as learning experiences.

New Section

In this section, the speaker emphasizes the value of experiments as a source of data-driven insights. They discuss an example of an experiment that failed after significant investment but provided valuable learnings.

Value of Experiments and Learnings

Experiments serve as an oracle that provides valuable data-driven insights.

Even if an experiment fails, it can still provide valuable learnings and prevent further investment in unsuccessful ideas.

The speaker shares an example from Bing where integrating with social media platforms was attempted but ultimately failed after extensive experimentation.

Despite the failure, the data collected from hundreds of experiments helped make an informed decision to abort the idea.

Similar failures were observed in other companies like Netflix and Airbnb when attempting to incorporate social components into their platforms.

The transcript provided does not contain timestamps for all sections.

Is there anything that is not worth A/B testing?

In this section, the speaker discusses the necessary ingredients for A/B testing and when it may not be suitable to conduct such tests.

Worthwhile Ingredients for A/B Testing

Not every domain is suitable for A/B testing. For example, mergers and acquisitions cannot be tested as they are one-time events.

Sufficient units, mostly users, are required for statistical significance in A/B testing. If a company is too small and lacks enough users, it may be too early to conduct A/B tests.

Software companies have an advantage in running A/B tests as it is easy to build a platform and the incremental cost of running experiments becomes low over time.

When to Start A/B Testing

Startups often question when they should begin conducting A/B tests. The general rule of thumb is to start experimenting when there are tens of thousands of users.

To detect significant changes (e.g., 5% improvement) in metrics, such as conversion rates on a retail site, at least 200,000 users are needed.

Building a culture of experimentation and integrating it into the platform should start even before reaching 200,000 users.

The trade-off between investing in the platform and running A/B tests

This section explores the trade-off between investing in the platform and running A/B tests. It also highlights how the maturity of a platform affects the ease and cost of conducting experiments.

Reducing Marginal Cost through Platform Investment

Once a platform is built, the incremental cost of running experiments should approach zero.

At Microsoft, where experimentation was well-established, there was no hesitation about testing everything due to low experiment costs.

However, at Airbnb, with a less mature platform requiring more analysts to interpret results, the cost of running experiments was higher.

Reasons to Not Run A/B Tests

Apart from being too small to conduct A/B tests, there may be other reasons not to run them.

The maturity and complexity of the platform can impact the ease and cost of interpreting results.

Startups should consider these factors before deciding whether or not to run A/B tests.

The importance of overall evaluation criteria in A/B testing

This section emphasizes the significance of having an overall evaluation criteria (OEC) in A/B testing. It explains why optimizing for revenue alone is not sufficient and how trade-offs between revenue and user experience need to be considered.

Optimizing Beyond Revenue

Optimizing for revenue alone can lead to actions that harm the user experience.

For example, increasing ad placements on a search page may generate more revenue initially but negatively impact long-term user experience.

An OEC considers additional metrics like churn rate and time taken by users to find successful results.

Constraint Optimization Problem

The OEC can be framed as a constraint optimization problem where increasing revenue is desired within a fixed amount of available space on a page.

By setting constraints on ad placements based on pixel count, it becomes possible to balance revenue growth with maintaining a positive user experience.

Improving Conversion Rate and User Satisfaction

The speaker discusses the importance of not only converting users to make a purchase but also ensuring their satisfaction with the listing in the long term.

Focusing on Long-Term User Satisfaction

It is not enough to convert a user to buy or pay for a listing; it is important to ensure their happiness with the listing when they actually stay there.

The challenge lies in predicting the rating that users will give to a listing after their stay, as this data is not available at present.

Building a training set that allows for predicting user satisfaction becomes crucial in making decisions such as offering more expensive options for users who prefer nicer places.

Key Metric: Lifetime Value

The speaker emphasizes the importance of defining an OEC (Objective Evaluation Criteria) that is causally predictive of the lifetime value of users. This helps prioritize long-term success over short-term gains.

Defining OEC Based on Lifetime Value

The key metric to consider is lifetime value, which requires thinking about actions and strategies that will benefit the business in the long term.

Short-term metrics like retention rates and time to achieve tasks can be measured but should be balanced against long-term goals.

By incorporating models and forecasts, it becomes possible to understand and optimize for longer-term metrics.

Learning from Long-Term Experiments and Historical Data

The speaker suggests two approaches for gaining insights into long-term metrics: running long-term experiments and utilizing historical data through modeling and data science techniques.

Learning from Long-Term Experiments

Running experiments over an extended period can help understand how changes impact key metrics.

Increasing or decreasing odds in certain scenarios can provide valuable insights into the effects on metrics.

Utilizing Historical Data and Modeling

Building models that leverage background knowledge and historical data can help make predictions and inform decision-making.

An example is the email team at Amazon, where understanding the impact of sending recommendations led to modeling the cost of spamming users and optimizing campaigns based on unsubscribe rates.

Uncovering Insights and Making Improvements

The speaker highlights the value of small insights in uncovering fundamental improvements. They discuss how integrating unsubscribe options in emails led to better campaign targeting and increased user satisfaction.

Integrating Unsubscribe Options for Better Campaign Targeting

Initially, giving credit to an email team at Amazon based on purchases from email referrals led to spamming users with more emails.

By modeling the cost of spamming, considering unsubscribes as a loss in long-term value, campaigns were optimized.

Offering default unsubscribe options for specific types of emails reduced negative impacts and improved targeting.

Small Insights Leading to Fundamental Improvements

The speaker emphasizes that small insights can lead to significant breakthroughs in strategies and product development. They discuss how A/B testing may result in incremental changes, but small discoveries can have a profound impact.

Small Insights Driving Fundamental Improvements

Redesigning a product entirely rarely yields positive results; teams often need to backtrack and address unintended consequences.

Small insights gained through experimentation or data analysis can guide strategic decisions and help identify areas for improvement.

These discoveries often lead to new features or approaches that enhance user experience and drive better outcomes.

The Importance of Incremental Changes

In this section, the speaker emphasizes the importance of making incremental changes instead of taking on multiple big changes at once. They discuss how implementing too many changes simultaneously can lead to failure and a lack of flexibility.

Incremental Changes vs. Big Redesigns

Making small, incremental changes is more beneficial than attempting multiple big changes at once.

Implementing too many changes together increases the likelihood of negative outcomes.

It is better to learn from smaller changes and adjust accordingly rather than launching a complete redesign.

While big redesigns may be necessary in some cases, they should be approached with caution and readiness for potential failure.

Embracing Failure through Experiments

The speaker discusses the importance of running experiments and embracing failure early on. They share their experience in introducing experimentation at Microsoft and how it helped identify failed ideas.

Running Experiments to Learn from Failure

Organizations that run experiments are humbled by the realization that most ideas fail.

Introducing an experimentation platform can help identify failed ideas early on.

Bing was one of the first teams at Microsoft to implement experimentation at scale, leading to surprising results.

Office also started running experiments after realizing that many of their ideas were failing.

Using Data to Convince Teams

The speaker talks about using data to convince teams against large-scale redesigns or rethinking entire processes. They share examples and resources that can help build a case for incremental improvements.

Providing Data as Evidence

Sharing data on failed redesigns can help convince teams against starting from scratch or rethinking entire processes.

The speaker teaches these concepts in their class and has shared related information on LinkedIn.

Data-driven evidence can be instrumental in convincing teams to iterate and learn as they go instead of pursuing drastic changes.

Introducing EPO for Experimentation

The speaker introduces EPO, a next-generation A/B testing platform built by Airbnb alums. They highlight the importance of experimentation and how EPO can help modern growth teams streamline their experiments.

Introducing EPO for Modern Growth Teams

EPO is a Next Generation A/B testing platform used by companies like DraftKings, Zapier, ClickUp, Twitch, and Cameo.

Traditional commercial tools often do not integrate well with modern growth team stacks.

EPO allows for quick delivery of results, avoids prolonged analytic cycles, and helps identify root causes efficiently.

It provides advanced metrics beyond basic click-through rates.

The transcript provided does not contain additional sections or timestamps that can be summarized.

Rethinking and Taking Big Bets

The speaker suggests rethinking strategies and taking big bets to break out of local minima or maxima. They acknowledge that while most big bets may fail, the breakthroughs achieved can be significant.

Considering Big Bets

It is important to allocate a percentage of resources to big bets in order to potentially break out of local minima or maxima.

While there is a high likelihood of failure (around 80% according to the speaker), the potential breakthroughs make it worthwhile.

The 80% rule of thumb is often used when deciding how much to invest in known versus high-risk, high-reward projects.

Allocation Strategies

Organizations typically allocate resources based on percentages, with different areas receiving different proportions.

For example, Google allocates around 70% for searching ads, 20% for apps and new initiatives, and 10% for infrastructure.

The key consideration should be whether the projects being shipped are positively impacting the business. Flat or negative results should be avoided unless legally required.

Importance of Experimentation

Experimentation plays a crucial role in decision-making and product development.

Shipping projects that do not add value or complicate code maintenance should be avoided.

Legal requirements may necessitate shipping even if the impact is flat or negative, but this should be an exception rather than the norm.

Airbnb's Approach and Lessons Learned

The discussion touches upon Airbnb's approach towards experimentation and its potential impact on success. The speaker shares insights from their experience at Airbnb but notes limitations due to confidentiality restrictions.

Emphasizing Experiments at Airbnb

In the search relevance team at Airbnb, everything was A/B tested before launching.

While design aspects could be focused on by Brian, the team ensured that all aspects related to neural networks and search were thoroughly tested.

The speaker believes that if Airbnb had run more controlled experiments, it could have been even more successful.

Airbnb's Direction

The speaker is restricted from discussing current details about Airbnb.

It is mentioned that there may be an interesting natural experiment happening at Airbnb with a shift away from emphasizing experiments and potentially reducing the role of paid ads in growth strategies.

The long-term impact and success of this approach remain uncertain.

Reflections on Airbnb during COVID-19

The conversation briefly touches upon the challenges faced by Airbnb during the COVID-19 pandemic. However, due to confidentiality restrictions, specific details cannot be shared.

Impact of COVID-19 on Airbnb

The speaker acknowledges that they are unable to provide detailed insights into their experience at Airbnb during the pandemic.

They mention that it was a challenging time for the company, but no specific information can be disclosed.

Due to confidentiality restrictions, limited information can be provided regarding specific experiences or outcomes at Airbnb during COVID-19.

New Section

In this section, the speaker discusses the importance of running A/B tests in uncertain times and emphasizes the need for external generalizability. They also mention how Airbnb's revenue would have increased if they had stayed the course during a downturn.

Importance of A/B Testing in Uncertain Times

Running A/B tests becomes even more important during uncertain times to determine if changes are actually beneficial.

External generalizability is crucial to understand if changes made during a specific period will work in different circumstances.

Replicating experiments after some time may be necessary to assess their effectiveness in changing conditions.

Airbnb's Revenue and Data-driven Decision Making

The speaker disagrees with the notion that a company should abandon data-driven decision making when bookings decrease significantly.

They believe that if Airbnb had stayed on course and continued with data-driven approaches, their revenue would have still increased.

Online experiences were one investment made by Airbnb during a challenging period, despite initial unpromising data. Today, it is considered successful.

New Section

In this section, the speaker talks about their book titled "Trustworthy Online Controlled Experiments" and shares their surprise at its success. They also mention that all proceeds from the book are donated to charities.

Surprising Success of the Book

The speaker was pleasantly surprised by the book's sales exceeding expectations and Cambridge's predictions.

The book focuses on practical aspects rather than being statistically oriented, making it accessible to a wider audience.

It has sold over 20,000 copies in English and has been translated into multiple languages.

All proceeds from the book are donated to charities.

New Section

In this section, the speaker explains why trust is crucial in experimentation and highlights the importance of an experimentation platform as a safety net and oracle.

Importance of Trust in Experimentation

The experimentation platform serves as a safety net, allowing quick aborting of bad launches and ensuring safe deployments.

At the end of an experiment, the platform provides insights into key metrics and builds trust through surrogates, debugging, and guardrail metrics.

Trust is essential when presenting experimental results as science to gain organizational trust.

Building checks into the experiment process helps maintain trust, unlike some early implementations that lacked such safeguards.

New Section

In this section, the speaker discusses the statistical naivety of early implementations like Optimizely and how it led to inflated false positive rates. This affected users' trust in the platform's success claims.

Statistical Naivety of Early Implementations

Optimizely's early implementation allowed real-time computation of p-values for stopping experiments based on statistical significance.

However, relying solely on real-time p-value monitoring led to a high false positive rate (type one error).

Users who believed they were successful based on initial positive revenue results started questioning the platform when long-term outcomes did not align with expectations.

The summary has been created using only content from the transcript.

New Section

In this section, the speaker discusses the importance of statistical significance in experiments and how a sample ratio mismatch can lead to invalid results.

Signs of Invalid Experiments

A common issue is a sample ratio mismatch, where the distribution of users between control and treatment groups deviates from the intended ratio.

This can be identified by comparing the actual user distribution with the expected ratio using statistical tests.

A slight deviation from the intended ratio, such as 50.2% vs. 49.8%, can still indicate an issue.

Other factors that may invalidate experiments include bots hitting different parts of the website or issues with data pipelines.

It is important to diagnose and understand why a sample ratio mismatch occurs to ensure reliable experiment results.

New Section

The speaker shares insights on diagnosing and addressing sample ratio mismatches in experiments.

Diagnosing Sample Ratio Mismatches

Microsoft's research found that around 8% of experiments suffered from sample ratio mismatches.

Bots are a common factor causing deviations in user distribution between control and treatment groups.

Bots may fail to parse pages correctly, leading to uneven hits on different parts of the website.

Data pipeline issues can also skew experiment results, especially when trying to remove bad traffic under certain conditions.

Campaigns pushing users from external sources can introduce biases if not accounted for in experiment design.

Addressing Sample Ratio Mismatches

Initially, simply displaying a banner warning about sample ratio mismatches was ineffective as people ignored it.

To encourage further investigation and debugging, a compromise was reached where experiment results were hidden behind an "OK" button after highlighting every number in the scorecard with a red line.

Product managers often have biases towards presenting successful results, so it is crucial to address sample ratio mismatches and ensure the validity of experiment outcomes.

New Section

The speaker discusses the natural bias towards presenting successful results and the compromises made to address sample ratio mismatches.

Addressing Bias and Compromises

Product managers often have a bias towards presenting successful results, even when there are sample ratio mismatches.

Initially, a banner warning was displayed, but it was ignored by users.

Experiment results were then hidden behind an "OK" button after highlighting every number in the scorecard with a red line.

This allowed for further investigation and debugging while ensuring that the potential issues were acknowledged.

It is important to recognize biases and take steps to ensure the accuracy and reliability of experiment outcomes.

The Importance of Being Skeptical

In this section, the speaker emphasizes the importance of being skeptical when interpreting experimental results and highlights the concept of "Feynman's Law" which suggests that if a result looks too good to be true, there is likely a flaw in the experiment.

Feynman's Law and Flaws in Experiments

Feynman's Law, coined by a person in the UK who worked in radio media, states that if a result looks too good to be true, there is likely a flaw in the experiment.

When encountering a seemingly significant improvement in results, it is important to investigate further as there is often a high probability of something being wrong with the outcome.

While there may be outliers where significant results are valid, it is crucial to replicate and verify experiments multiple times to ensure accuracy.

Understanding P-values

This section focuses on explaining p-values and clarifying common misconceptions associated with their interpretation.

Misconceptions about P-values

Many people mistakenly interpret one minus p-value as the probability that the treatment is better than control. However, this interpretation is incorrect.

P-values assume that the null hypothesis is true and calculate the probability of observing data that matches this hypothesis. To obtain the desired probability (probability that treatment is better than control), Bayes Rule needs to be applied by considering prior probabilities.

The false positive risk associated with p-values tends to be much higher than the commonly assumed 5%. For example, in a case where the success rate is only 8%, a statistically significant result with a p-value less than 0.05 has a 26% chance of being a false positive.

Replication and False Positive Rate

This section discusses the importance of replication in reducing false positive rates and highlights how combining experiments can provide more accurate results.

Replication and Combining Experiments

When conducting experiments, it is crucial to replicate them to increase confidence in the results.

By combining experiments using methods like Fisher's method or Snuffer's method, the joint probability can be calculated, resulting in a lower overall probability. This approach helps reduce false positive rates.

Lowering the p-value threshold and requiring replication at higher levels (e.g., below 0.01) can lead to more successful outcomes and significantly lower false positive rates.

Tracking Experiment Failures

This section emphasizes the value of keeping track of experiment failures historically within a company or specific product.

Importance of Tracking Failures

Keeping track of experiment failures provides valuable insights into historical performance within a company or specific product.

Understanding failure rates helps set realistic expectations for success and informs decision-making processes.

Monitoring failure rates allows teams to assess their overall performance and make informed adjustments to improve experimentation outcomes.

The Build vs Buy Question

In this section, the speaker discusses the build versus buy question when it comes to experimentation platforms. They mention that it is usually not a binary choice and that companies often do both - building some components and buying others.

Considerations for Build vs Buy

When considering whether to build or buy an experimentation platform, it is important to look at what vendors offer and what agencies say about the topic.

It is common for companies to both build and buy components of an experimentation platform.

The decision of how much to build versus buy depends on factors such as company size and resources.

Availability of Third-Party Products

The speaker talks about the availability of third-party products for experimentation platforms. They mention that today there are many reliable vendors offering good experimentation platforms, which was not the case in the past.

Third-Party Experimentation Platforms

When the speaker started working on building experimentation platforms at Amazon and Microsoft, they had to create their own because there were no existing options available.

However, nowadays there are many vendors providing trustworthy experimentation platforms.

It is recommended to consider using one of these third-party products if they meet your company's needs.

Shifting Culture towards Experimentation

The speaker shares insights on how to shift a company's culture towards embracing experimentation and A/B testing. They draw from their experience at Microsoft where they successfully implemented a culture of experimentation within Bing.

Shifting Culture Tips

Start with a team or department where running experiments is already easy and frequent.

Choose a team that launches frequently (e.g., weekly or bi-weekly) rather than one that launches infrequently (e.g., every six months).

Ensure that the team's objectives and optimization criteria (OEC) are clear and agreed upon.

Cross-pollination between teams can help spread the culture of experimentation within a company.

Benefits of Successful Experimentation

The speaker discusses the positive impact of successful experimentation on shifting company culture. They share their experience at Microsoft, where the success of Bing's experimentation platform led to increased acceptance and adoption by other teams.

Benefits of Successful Experimentation

Once Bing's experimentation platform proved successful, more people within Microsoft became open to the idea of running experiments.

Cross-pollination between teams played a significant role in spreading awareness about the benefits of experimentation.

Successful results from one team can inspire others to consider adopting similar practices.

Choosing the Right Team for Experimentation

The speaker provides advice on selecting the right team or department for starting an experimentation initiative. They emphasize finding a team that launches frequently and has clear optimization goals.

Selecting an Experimental Team

Choose a team that launches frequently, such as weekly or bi-weekly.

Look for teams that have a clear understanding of their optimization goals (OEC).

Avoid teams with conflicting or unclear objectives, as it becomes challenging to align on experiment outcomes.

Challenges with Optimization Goals

The speaker highlights challenges related to defining optimization goals when running experiments. They share an example from Microsoft.com where different constituencies had conflicting opinions on what should be optimized.

Challenges with Optimization Goals

Some projects or websites may have multiple constituencies with different optimization goals.

It is crucial to ensure clarity and alignment on what is being optimized before running experiments.

In the example of Microsoft.com, there was a disagreement on whether more time spent on the support side was good or bad.

Importance of Building Experimentation Platforms

The speaker emphasizes the importance of building experimentation platforms rather than relying solely on one-off experiments. They discuss the motivation behind building platforms and reducing the marginal cost of experiments.

Building Experimentation Platforms

The motivation for building experimentation platforms is to reduce the marginal cost of experiments.

Self-service platforms allow users to set up and run experiments with ease.

As the number of metrics grows, having a platform becomes essential for efficient experiment management and analysis.

Investing in Platform Automation

The speaker discusses the importance of investing in platform automation to improve experiment analysis. They mention that weak analysis can lead to hiring more data scientists to compensate for platform limitations.

Investing in Platform Automation

Weak analysis can result from insufficient platform capabilities.

Hiring data scientists may be necessary when the platform lacks automation features.

It is recommended to invest in building a robust platform that allows for self-service analysis without heavy reliance on data scientists.

Assessing Progress with Six Axes

The speaker suggests using six axes as a framework for assessing progress in building an experimentation platform. They explain how these axes can guide decision-making and identify areas for improvement.

Assessing Progress with Six Axes

Use six axes as a framework to assess progress in building an experimentation platform.

Evaluate where your organization stands on each axis (crawl, walk, run, fly).

Identify areas that need improvement based on your current position on each axis.

This summary covers selected sections from the transcript and does not include every detail mentioned.

Speeding up Data Analysis and Variance Reduction

In this section, the speaker discusses ways to speed up data analysis and reduce variance in metrics.

Mechanisms for Efficient Data Analysis

Variance reduction techniques can help reduce the variance of metrics, allowing for faster results with fewer users.

Examples include capping metrics to account for skewed data distributions. For instance, at Airbnb, the metric "nights booked" may be capped at 30 days per month.

Another technique is called Cupid, which uses pre-experiment data to adjust results and achieve unbiased but lower-variance outcomes.

Lightning Round and Book Recommendations

In this section, the speaker shares book recommendations and talks about a recent TV show they enjoyed.

Book Recommendations

"Calling" is a recommended book that offers insightful perspectives on extreme ideas and challenges their validity.

"Hard Facts, Dangerous Half-Truths, and Total Nonsense" by Stanford professors explores commonly accepted notions that lack justification.

"Mistakes Were Made (But Not by Me)" delves into fallacies we often succumb to and their humbling consequences.

Favorite Recent Movie or TV Show

The speaker recommends watching the short series "Chernobyl," which portrays the disaster with artistic liberties while staying true to real events. They mention having a personal connection as they were born near Chernobyl.

Interview Questions and Favorite Product Discovery

In this section, the speaker discusses a favorite interview question they ask candidates and shares a product discovery they love.

Favorite Interview Question

When conducting technical interviews, one question the speaker finds revealing is asking about the static qualifier in languages like C++. Surprisingly, more than 50% of engineering job candidates struggle to answer this question correctly.

Favorite Product Discovery

The speaker mentions their love for Blink cameras, which are small and long-lasting battery-powered cameras. They highlight the ability to observe unexpected events, such as identifying how a skunk entered their yard through a hole in the fence.

Timestamps have been associated with relevant bullet points to facilitate studying the transcript.

The Impact of Changing the Way Teams Develop Products

In this section, the speaker discusses a minor change in the product development process that had a significant impact on team execution. This change was inspired by their experience at Amazon.

Implementing Structured Narratives Instead of PowerPoint Presentations

At Amazon, the speaker learned about using structured narratives instead of PowerPoint presentations for idea development.

Instead of starting with a PowerPoint, teams begin with a structured document that outlines the questions they need to answer for their idea.

These documents were initially paper-based but are now commonly created using word processors or Google Docs.

The impact of this change was remarkable as it allowed for more honest feedback and better retention of information after meetings.

Applying A/B Testing Principles to Life

In this section, the speaker discusses applying A/B testing principles to personal life decisions and emphasizes the importance of evidence hierarchy.

Hierarchy of Evidence and Trust Levels

The speaker emphasizes the concept of the hierarchy of evidence when making decisions in personal life.

Anecdotal evidence should be treated with caution, while observational studies can be given some trust.

As one moves up the hierarchy towards natural experiments and control experiments, trust levels should increase.

Many people overlook this hierarchy when consuming news or information, leading to potential misinformation.

Sharing Knowledge on Evidence Hierarchy and Control Experiments

In this section, the speaker encourages sharing knowledge about evidence hierarchy and control experiments with family, friends, and children.

Importance of Evidence Hierarchy Education

It is crucial to educate others about the hierarchy of evidence and how to evaluate information critically.

The speaker shares examples of observational studies that were later proven incorrect through control experiments.

Recommends sharing a book on this topic and teaching a class to promote data-driven decision-making.

Call to Action and Conclusion

In this section, the speaker provides information on how to connect with them online and suggests actions listeners can take.

Connecting Online and Promoting Data-Driven Decisions

The speaker can be found on LinkedIn for further communication.

Encourages listeners to make data-driven decisions, understand control experiments, and use science in decision-making.

Mentions their book as a resource for learning more about these topics, with all proceeds going to charity.

Offers a discount code for their course available through Maven.

Closing Remarks

In this final section, the speaker thanks the host and audience for their time and concludes the podcast episode.

Gratitude and Farewell

Expresses gratitude to the host for inviting them and acknowledges great questions from the audience.

Thanks listeners for tuning in and encourages subscribing, rating, or reviewing the podcast.

Provides information on accessing past episodes or learning more about the podcast.

Channel: Lenny's Podcast - Videos

Video description

Ronny Kohavi, PhD, is a consultant, teacher, and leading expert on the art and science of A/B testing. Previously, Ronny was Vice President and Technical Fellow at Airbnb, Technical Fellow and corporate VP at Microsoft (where he led the Experimentation Platform team), and Director of Data Mining and Personalization at Amazon. He was also honored with a lifetime achievement award by the Experimentation Culture Awards in September 2020 and teaches a popular course on experimentation on Maven. In today’s podcast, we discuss: • How to foster a culture of experimentation • How to avoid common pitfalls and misconceptions when running experiments • His most surprising experiment results • The critical role of trust in running successful experiments • When not to A/B test something • Best practices for helping your tests run faster • The future of experimentation Enroll in Ronny’s Maven class, Accelerating Innovation with A/B Testing, at https://bit.ly/ABClassLenny. Promo code “LENNYAB” will give $500 off the class for the first 10 people to use it. — Brought to you by Mixpanel—Event analytics that everyone can trust, use, and afford: https://mixpanel.com/startups | Round—The private network built by tech leaders for tech leaders: https://www.round.tech/apply?utm_campaign=lennys-letter&utm_medium=email-ad&utm_source=email-marketing&utm_content=send-2-2023-07-27 | Eppo—Run reliable, impactful experiments: https://www.geteppo.com/ Find the full transcript at: https://www.lennysnewsletter.com/p/the-ultimate-guide-to-ab-testing Where to find Ronny Kohavi: • Twitter: https://twitter.com/ronnyk • LinkedIn: https://www.linkedin.com/in/ronnyk/ • Website: http://ai.stanford.edu/~ronnyk/ Where to find Lenny: • Newsletter: https://www.lennysnewsletter.com • Twitter: https://twitter.com/lennysan • LinkedIn: https://www.linkedin.com/in/lennyrachitsky/ In this episode, we cover: (00:00) Ronny’s background (04:29) How one A/B test helped Bing increase revenue by 12% (09:00) What data says about opening new tabs (10:34) Small effort, huge gains vs. incremental improvements (13:16) Typical fail rates (15:28) UI resources (16:53) Institutional learning and the importance of documentation and sharing results (20:44) Testing incrementally and acting on high-risk, high-reward ideas (22:38) A failed experiment at Bing on integration with social apps (24:47) When not to A/B test something (27:59) Overall evaluation criterion (OEC) (32:41) Long-term experimentation vs. models (36:29) The problem with redesigns (39:31) How Ronny implemented testing at Microsoft (42:54) The stats on redesigns (45:38) Testing at Airbnb (48:06) Covid’s impact and why testing is more important during times of upheaval (50:06) Ronny’s book, Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (51:45) The importance of trust (55:25) Sample ratio mismatch and other signs your experiment is flawed (1:00:44) Twyman’s law (1:02:14) P-value (1:06:27) Getting started running experiments (1:07:43) How to shift the culture in an org to push for more testing (1:10:18) Building platforms (1:12:25) How to improve speed when running experiments (1:14:09) Lightning round Referenced: • Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing: https://experimentguide.com/ • Seven rules of thumb for website experimenters: https://exp-platform.com/rules-of-thumb/ • GoodUI: https://goodui.org • Defaults for A/B testing: http://bit.ly/CH2022Kohavi • Ronny’s LinkedIn post about A/B testing for startups: https://www.linkedin.com/posts/ronnyk_abtesting-experimentguide-statisticalpower-activity-6982142843297423360-Bc2U • Sanchan Saxena on Lenny’s Podcast: https://www.lennyspodcast.com/sanchan-saxena-vp-of-product-at-coinbase-on-the-inside-story-of-how-airbnb-made-it-through-covid-what-he8217s-learned-from-brian-chesky-brian-armstrong-and-kevin-systrom-much-more/ • Optimizely: https://www.optimizely.com/ • Optimizely was statistically naive: https://analythical.com/blog/optimizely-got-me-fired • SRM: https://www.linkedin.com/posts/ronnyk_seat-belt-wikipedia-activity-6917959519310401536-jV97 • SRM checker: http://bit.ly/srmCheck • Twyman’s law: http://bit.ly/twymanLaw • “What’s a p-value” question: http://bit.ly/ABTestingIntuitionBusters • Fisher’s method: https://en.wikipedia.org/wiki/Fisher%27s_method • Evolving experimentation: https://exp-platform.com/Documents/2017-05%20ICSE2017_EvolutionOfExP.pdf • CUPED for variance reduction/increased sensitivity: http://bit.ly/expCUPED • Ronny’s recommended books: https://bit.ly/BestBooksRonnyk • Chernobyl on HBO: https://www.hbo.com/chernobyl • Blink cameras: https://blinkforhome.com/ • Narrative not PowerPoint: https://exp-platform.com/narrative-not-powerpoint/ Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com. Lenny may be an investor in the companies discussed.