Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar
Building Effective AI Products Through Evals
Importance of Evals in AI Development
- Building effective evaluations (evals) is crucial for creating successful AI products, as it offers the highest return on investment (ROI).
- The eval process is engaging and addictive for those involved, providing significant learning opportunities.
- Typically, a single well-executed eval can serve as a foundation for future product development.
Misconceptions Surrounding Evals
- There exists considerable controversy and strong opinions regarding evals; some individuals have become skeptical due to past negative experiences.
- A common misconception is that AI can autonomously conduct evals; however, this approach often fails to yield effective results.
Best Practices in Conducting Evals
- Appointing a "benevolent dictator"—a trusted individual with domain expertise—can streamline the eval process and prevent unnecessary complications from committee decisions.
- Product managers often fulfill this role effectively due to their understanding of both the product and its market.
Rise of Evals as a Critical Skill
- The concept of eval has gained prominence over the last two years, becoming essential knowledge for product builders in the AI space.
- Hamill Hussein and Shrea Shankar are leading figures in educating others about eval processes through their popular online course on Maven.
Course Offerings and Insights
- Their course has reached over 2,000 product managers and engineers across 500 companies, including major players like OpenAI and Anthropic.
- This episode serves as an accessible primer on eval concepts, aiming to inspire listeners to engage with the topic actively.
Sponsorship Highlights
Finn - Customer Service AI Agent
- Finn is presented as a top-performing AI agent designed for customer service with impressive resolution rates.
Dscout - Research Platform for Design Teams
What are EVALs and How Do They Improve AI Applications?
Understanding EVALs
- EVALs are systematic methods to measure and enhance AI applications, making data analytics accessible for LLM (Large Language Model) applications.
- At their core, EVALs involve creating metrics to analyze application performance, enabling iterative improvements through experimentation.
Concrete Example of EVAL Application
- For instance, in a real estate assistant application that fails to perform tasks correctly, prior to implementing EVALs, developers would rely on guesswork or "vibe checks" for troubleshooting.
- As applications scale, relying solely on vibe checks becomes impractical; thus, EVALs provide measurable feedback signals for confident iterations.
Defining Success with EVAL Metrics
- The goal of an EVAL is to establish tests that assess the accuracy and reliability of the AI agent's responses—identifying errors such as misinformation or poor communication styles.
- While unit tests check specific functionalities, they represent only a small part of the broader spectrum of quality measurement in AI applications.
Broader Scope of Evaluation Techniques
- Beyond unit tests, evaluating an AI assistant involves assessing its adaptability to vague user requests and identifying new user cohorts over time.
- Regular data analysis can reveal trends like user satisfaction (e.g., thumbs up), contributing valuable insights into product improvement.
Real-Life Example: Nurture Boss
- A practical example discussed is Nurture Boss—a company providing an AI assistant for property managers handling various operational tasks like customer service and appointment bookings.
Understanding AI Application Development
Introduction to Nurture Boss and Data Utilization
- The speaker discusses the use of customer and property data in developing an AI application called Nurture Boss, emphasizing its comprehensive nature.
- They introduce the concept of error analysis as a crucial first step in building evaluations for the application, focusing on identifying issues within the data.
- Various tools for data loading are mentioned, including Brain Trust, Phoenix Arise, and Langmith, highlighting flexibility in tool choice.
Observability Tools and Data Analysis
- A detailed log from a customer interaction with Nurture Boss is presented as a "trace," which records sequences of events essential for understanding AI performance.
- The importance of traces in AI applications is discussed; they provide insights into how different components interact during user queries.
System Prompts and User Interaction
- The system prompt used by the AI assistant is revealed, showcasing its role in guiding responses to both current and prospective residents.
- The rarity of accessing actual company product prompts is noted, indicating their significance as proprietary information.
Analyzing User Queries
- Specific user interactions are analyzed; for instance, a query about apartment availability leads to an examination of how well the AI responds to such requests.
- The response from the AI regarding apartment availability highlights potential shortcomings in lead management capabilities.
Importance of Product Perspective in Error Analysis
- Emphasis is placed on involving product personnel in error analysis since they understand user experience better than developers alone.
Error Analysis in AI Applications
The Importance of Manual Error Analysis
- The process begins with a manual note-taking approach, emphasizing the need for human oversight in identifying errors within AI applications.
- Sampling data is encouraged; even small samples can yield significant insights, leading to an addictive learning experience for developers.
- Tools are available to facilitate quick note-taking during error analysis, allowing for efficient tracking of issues.
Identifying Different Types of Errors
- An example highlights a technical error in a text messaging application where user messages become garbled, affecting system responses.
- This type of error indicates that the interaction handling is flawed rather than an issue with AI performance itself.
- Noting these errors helps developers understand underlying problems that may not be immediately visible.
Effective Note-Taking Strategies
- Developers should focus on capturing the most apparent upstream error first instead of trying to document every issue at once.
- Initial attempts at this process may be challenging, but proficiency improves with practice and repetition.
Diverse Error Examples and Learning Opportunities
- A scenario illustrates an AI's incorrect response regarding virtual tours, showcasing how hallucinations can mislead users about available services.
- Recognizing such discrepancies provides valuable context for engineers to refine their applications and avoid misinformation.
Limitations of LLM in Error Analysis
- A common question arises about using large language models (LLMs) for automated error analysis; however, they often lack necessary contextual understanding.
- Relying solely on LLM outputs can lead to overlooking critical errors that require human insight and expertise.
Benevolent Dictatorship in Open Coding
The Role of a Benevolent Dictator
- A "benevolent dictator" simplifies the open coding process, preventing teams from getting bogged down by unnecessary committee involvement.
- In smaller organizations, appointing one trusted individual can streamline decision-making and enhance efficiency.
- This role is crucial for maintaining tractability in the coding process, ensuring it remains cost-effective and manageable.
Importance of Domain Expertise
- The benevolent dictator should possess domain expertise relevant to the project, such as legal or mental health knowledge.
- Selecting an expert ensures informed decisions that align with business needs and context.
Emphasizing Progress Over Perfection
- It's essential to focus on making progress rather than achieving perfection; quick signal detection is key to effective work.
- The goal is to gather insights rapidly without incurring excessive costs or delays.
Real-world Examples of AI Interaction
Analyzing User Interactions
- An example illustrates how users interact with AI regarding apartment leasing inquiries, showcasing real-world complexities.
- The assistant's abrupt transfer call highlights areas needing improvement in user experience design.
Documenting Observations Effectively
- When documenting interactions, avoid subjective terms like "jank"; instead, focus on clear observations about what occurred during the interaction.
Determining Sample Size for Analysis
Recommended Practices for Data Collection
- It’s suggested to conduct at least 100 examples of data collection to ensure comprehensive learning and insight generation.
- There isn't a strict number; the aim is to reach a point where no new insights are gained from additional examples.
Understanding Theoretical Saturation
Conceptual Framework for Data Analysis
- "Theoretical saturation" refers to the point in qualitative analysis when no new concepts emerge from data review.
- Developing intuition around this concept helps determine when enough data has been collected for meaningful analysis.
Next Steps After Data Collection
Understanding Basic Counting in Data Science
The Power of Basic Counting
- Basic counting is highlighted as a fundamental yet powerful analytical technique in data science, often undervalued despite its simplicity.
- The speaker shares their experience using AI tools to categorize notes, emphasizing the straightforward approach of uploading a CSV file for analysis.
Categorizing Notes with LLM
- The term "open codes" is introduced, which refers to initial categorizations that help organize data. This terminology is recognized by language models (LLMs).
- Axial codes are explained as categories derived from open codes, aimed at identifying common failure modes within the data.
Purpose and Application of Axial Codes
- Axial codes serve to cluster failure modes, allowing for identification of the most prevalent issues that need addressing.
- The discussion emphasizes synthesizing categories and themes from raw data to facilitate better understanding and problem-solving.
Generating Actionable Insights
- Examples of axial codes generated include capability limitations and communication quality; however, some categories may require refinement for actionability.
- The importance of specificity in categorization is stressed—broad terms can hinder actionable insights.
Iterative Process in Data Analysis
- It’s noted that while LLMs can synthesize information effectively, they do not propose solutions automatically; this remains the analyst's responsibility.
- Flexibility in prompts allows users to tailor outputs based on specific needs or stages within user stories, enhancing relevance and utility.
Challenges in Tool Development
- There’s an acknowledgment that many people are unaware of effective coding practices or how to build tools for them, indicating a gap in knowledge.
- The speakers reflect on their own experiences with error analysis and emphasize the importance of grounding new techniques in established theory rather than inventing entirely new methods.
Conclusion on Error Analysis Techniques
Exploring Error Analysis in Machine Learning
Introduction to Error Analysis
- The speaker expresses enthusiasm for identifying and categorizing problems encountered during a project, emphasizing the enjoyment of the process.
- A video featuring Andrew Ng is referenced, highlighting his contributions to machine learning education and discussing error analysis as a long-standing technique in stochastic systems.
Techniques and Tools for Error Analysis
- The discussion mentions that the techniques being used are not new; they are adaptations of existing machine learning principles applied to current projects.
- The speaker notes that while there is a comprehensive course on this topic, they will share key insights without going through every detail.
Utilizing AI for Categorization
- Various tools can be employed for error analysis, including ChatGPT and Julius AI. Jupyter notebooks are highlighted as popular among product managers for data science tasks.
- After generating axial codes from open codes, the speaker emphasizes refining these codes to make them more specific and actionable.
Importance of Detailed Open Codes
- The process involves reviewing axial codes against open codes to ensure clarity and specificity in categorization.
- Examples of refined axial codes include issues related to scheduling, human handoff errors, formatting errors, and unfulfilled promises in conversational flow.
Automating Categorization with AI
- An example is provided where an AI tool (Gemini AI) is used to categorize notes into predefined categories automatically.
- Emphasis is placed on the necessity of detailed descriptions in open codes; vague terms like "janky" hinder effective categorization by both humans and AI.
Iteration and Improvement Process
- The importance of iterating on open codes after receiving suggestions from AI is discussed; it’s crucial to assess whether these suggestions align with user understanding.
- Introducing a category labeled "none of the above" allows users to identify gaps in their axial coding system, prompting further refinement or rewording as necessary.
Efficiency Gains Through Repetition
Understanding Conversational Flow Issues in AI Products
Awareness of Problems
- The importance of being aware of issues within a product is emphasized, suggesting that ignorance can lead to significant oversights.
- Acknowledgment that many users are unaware of the problems present in their products, indicating a gap in understanding and communication.
Analyzing Data with Pivot Tables
- Introduction of pivot tables as a tool for analyzing categorized traces to identify issues effectively.
- Discovery of 17 conversational flow issues through data analysis, highlighting the utility of pivot tables for deeper insights.
Identifying Key Problems
- Transition from chaos to clarity by identifying major problems such as conversational and human handoff issues.
- Discussion on whether certain errors require evaluations (evals), noting that some may be straightforward fixes not needing formal testing.
Cost-Benefit Analysis of Evaluations
- Emphasis on evaluating whether an issue warrants an email or code-based test, considering the cost-benefit trade-off involved.
- Warning against rushing into evaluations without first grounding oneself in actual errors to avoid missteps.
Types of Evaluations: Code-Based vs. LLM as Judge
- Differentiation between code-based evaluations and using LLM (Language Model) as a judge for more complex failure modes.
- Explanation that code-based eval is akin to unit tests, allowing automated checks for specific failure modes without manual intervention.
Automated Evaluation Strategies
- Suggestion to build automated evaluators for checking failure modes across numerous traces efficiently.
- Clarification that while simple checks can be coded (e.g., JSON format), complex scenarios may necessitate human-like judgment from an LLM.
Testing AI Responses
- Description of how testing involves asking questions and verifying consistent responses through coding methods.
LLM as a Judge: Evaluating Human Judgment
The Role of LLM in Evaluation
- The discussion begins with the concept of using Large Language Models (LLMs) as judges to evaluate human judgment, emphasizing that this task is simpler than creating the original agent.
- LLM judges are designed to assess specific failure modes, providing binary outputs (pass or fail), which simplifies the evaluation process and enhances reliability.
Testing and Monitoring with LLM Judges
- LLM judges can be utilized not only for unit testing but also for real-time monitoring of production systems, allowing for daily assessments of application quality through sampling traces.
- This approach counters criticisms that LLM evaluations lack real-world applicability by demonstrating their effectiveness in live environments.
Creating Effective Judge Prompts
- An example prompt for an LLM judge is introduced, focusing on a specific failure scenario. The importance of binary decision-making (yes/no) is highlighted to avoid ambiguity in evaluations.
- Simplifying the evaluation criteria prevents confusion over scoring scales, ensuring clear communication about what constitutes acceptable performance.
Challenges in Evaluation Metrics
- The conversation touches on issues related to expert-curated content and how arbitrary scoring systems can lead to misinformation and misunderstandings regarding evaluation results.
- Concerns are raised about potential drama within the evaluation space due to varying interpretations of scores, stressing the need for clarity and consistency.
Best Practices for Using LLM Judges
- It’s advised that users should actively engage with the prompts generated by LLMs rather than accepting them blindly; iteration and refinement are crucial steps in developing effective judge prompts.
How to Align LLMs with Human Judgments?
Importance of Alignment in LLM Evaluation
- It is crucial to ensure that the language model (LM) aligns with human judgments before its release. This involves measuring the LM against established axial codes to assess agreement.
- The evaluation process includes assessing an LLM trace using a spreadsheet, where specific rules are applied to determine if errors occurred based on human judgment.
- Manual review of data is necessary, although prior axial coding can provide insights and reduce redundancy in evaluations. However, additional reviews may be needed for more comprehensive data.
Understanding Agreement Metrics
- Product managers should be cautious about relying solely on agreement metrics, as high percentages can be misleading if they do not account for infrequent but significant errors.
- A 90% agreement rate might appear favorable but could mask underlying issues if errors occur only 10% of the time, leading to false confidence in the model's performance.
Analyzing Error Types
- A matrix analysis helps visualize discrepancies between human and LM judgments regarding error occurrence. Non-green diagonal entries indicate misalignment that needs addressing.
- If significant misalignments exist, product managers should prompt further iterations on prompts used for evaluating the LLM judge to minimize these discrepancies.
Evolving Product Requirements through Data Insights
- The concept of using eval prompts as dynamic product requirements documents (PRDs) is emphasized; they continuously inform how agents should respond based on real-time data.
- Traditional PRDs remain important for outlining initial expectations; however, ongoing evaluations can reveal new insights that enhance these documents over time.
Flexibility and Continuous Improvement
- It's essential for product teams to remain adaptable as they uncover new failure modes and expectations through iterative testing with LLMs.
Understanding Criteria Drift in LLM Development
Importance of Criteria Drift
- The discussion emphasizes the significance of "criteria drift" in evaluating outputs from large language models (LLMs), highlighting its relevance in user studies.
- A study conducted with developers revealed that traditional evaluation methods struggle due to evolving perceptions of what constitutes good or bad output as users review more examples.
Challenges in Evaluation
- Developers noted that they could only identify failure modes after reviewing multiple outputs, indicating a limitation in pre-defined rubrics for evaluation.
- This challenge is particularly pronounced among experienced developers who have previously built LLM pipelines, suggesting a need for adaptive evaluation strategies.
Product Development Insights
- The conversation shifts to product development, emphasizing that while powerful tools can assist in ensuring correctness, they do not replace existing processes like peer reviews.
- Developers typically generate between four and seven evaluative prompts based on their experiences, focusing on fixing specific failure modes rather than exhaustive evaluations.
Prioritization in Testing
- It’s suggested that developers should prioritize evaluations for the most problematic areas rather than attempting to cover every potential issue.
- The focus should be on high-risk scenarios that could significantly impact business outcomes, such as inappropriate or harmful outputs.
Data Analysis Techniques
- Emphasizing the power of data analysis, it is noted that even basic techniques like counting can lead to significant improvements when applied effectively.
- Developers are encouraged to explore various ways to analyze conversational flow issues and other problems within their data sets.
Utilizing LLM Judges for Continuous Improvement
Integration into Development Processes
- After creating an LLM judge, developers often integrate it into unit tests and online monitoring systems to track performance over time.
- Successful products leverage these judges for continuous improvement and maintain a competitive edge by keeping insights proprietary.
Systematic Approach to AI Development
Understanding Evaluation in Product Management
The Importance of Evaluation
- The speaker emphasizes the necessity of systematic evaluation in product management, highlighting a step-by-step approach to identify and prioritize issues effectively.
- Product managers are empowered to build profitable products through the practice of evaluation skills, which can be learned over time.
Debate on Evaluation Practices
- A discussion arises regarding the controversy surrounding evaluation (eval), with strong opinions on its importance and value within the community.
- Misconceptions about eval stem from rigid definitions; some view it solely as unit tests or data analysis without considering broader monitoring metrics.
Challenges and Misunderstandings
- Past negative experiences with eval have led to skepticism; for instance, failed attempts at using LLM judges resulted in distrust towards eval processes.
- Social media platforms like X (formerly Twitter) amplify misunderstandings about eval, leading to polarized opinions based on individual experiences rather than nuanced discussions.
The Role of Evals in AI Development
- Successful applications often rely on robust eval processes; coding agents exemplify this by building upon well-evaluated foundational models.
- Some developers claim they do not use formal evaluations but instead rely on intuition ("vibes"), which may overlook systematic error analysis that is still implicitly occurring.
Nuances in Different Domains
- Developers likely engage in informal evaluations through user interactions and feedback loops, even if they don't label it as such.
- Coding agents differ from other AI products due to their unique development environment where domain experts actively use and refine the tools being created.
Conclusion: Understanding Contextual Differences
- It's crucial to recognize that different domains require tailored evaluation approaches; for example, medical professionals may not have the same tolerance for errors as software developers do.
Discussion on Evaluation Methods in AI
The Role of Human Evaluations
- The conversation begins with the importance of human evaluations within the broader context of evaluation methods, emphasizing that while manual evaluations can be time-consuming, they are essential for quality assessment.
- It is noted that individuals at Anthropic possess high-level skills in data analysis and software engineering, which contribute to effective evaluations. However, not everyone has these skills readily available.
Challenges with Dogfooding
- A cautionary note is raised about "dogfooding," where companies claim to use their products internally but may not engage deeply enough to provide meaningful feedback.
- The discussion highlights a perceived debate between traditional evaluation methods and A/B testing, questioning whether A/B tests alone suffice for comprehensive product assessments.
Understanding A/B Testing
- A/B testing is described as a systematic method involving two experimental conditions compared against a success metric; however, it requires prior evaluation metrics to be effective.
- Concerns are expressed regarding premature A/B testing without adequate error analysis, suggesting that assumptions about product requirements may lead to misguided tests.
Importance of Data Science in Evaluations
- Emphasis is placed on grounding hypotheses in actual data rather than hypothetical scenarios when conducting evaluations or tests.
- The acquisition of Stat Sig by OpenAI raises questions about its implications for future evaluation practices and whether it signifies a shift towards prioritizing evaluations over traditional methods.
Clarifying the Concept of Evaluation
- There’s an assertion that the term "eval" might create confusion as it attempts to differentiate itself from established data science practices; fundamentally, both involve similar analytical approaches.
- The speaker suggests reframing discussions around evaluations as simply applying data science techniques to understand product performance better rather than introducing new terminology.
Strategic Implications of Acquisitions
- The acquisition of Stat Sig by OpenAI is viewed as potentially strategic but remains uncertain; it's suggested that competitors likely utilize similar tools already.
Understanding Evaluation in AI Development
Importance of Evaluation Metrics
- The discussion emphasizes the critical role of evaluation (eval) metrics in AI development, highlighting that many have yet to fully grasp their significance.
- Current focus among major labs has been on general benchmarks like MMLU scores and human eval, which are essential for foundational models but not necessarily aligned with product-specific evaluations.
- Existing evaluation products often lack error analysis and rely on generic tools, which do not effectively address specific application needs.
- There is a call for more structured thinking around application-specific evaluations, suggesting that current efforts are insufficient and need broader adoption within the community.
Misconceptions About Evaluations
- A prevalent misconception is the belief that one can simply purchase a tool to automate evaluations without needing human oversight; this approach is flawed as human input remains crucial.
- Another common issue is neglecting data analysis; many fail to realize the insights gained from examining individual traces can significantly inform problem-solving strategies.
- It's important to understand there isn't a single correct method for conducting evaluations. Various approaches exist depending on product maturity and available resources.
Tips for Effective Evaluations
- One key tip is to embrace data examination rather than fear it; structured processes can help navigate uncertainties during evaluation efforts.
Using AI Effectively in Data Analysis
Embracing AI Tools
- The discussion emphasizes the importance of utilizing AI tools to enhance data presentation without replacing human input.
- A recurring theme is the necessity of analyzing data, with a suggestion to create personalized tools for easier access and understanding.
Custom Tool Development
- Many professionals are encouraged to develop their own tools, which has become more accessible due to advancements in AI technology that can assist in creating simple web applications.
- An example is provided where a team created an application that simplifies data viewing by organizing different communication channels and automating error tracking.
Importance of Data Engagement
- Engaging with data is highlighted as a high-return-on-investment (ROI) activity, crucial for improving products and business success.
- The goal of using these evaluations is not merely to identify issues but actively fix them to enhance user experience with AI products.
Time Investment in Data Analysis
- Initial time investment for error analysis typically spans three to four days, followed by minimal ongoing maintenance (about 30 minutes weekly).
- This upfront effort leads to significant long-term benefits, alleviating concerns about continuous time commitment.
Fun in Data Review Process
- The process of reviewing and annotating data can be enjoyable; it allows for critical thinking about product effectiveness.
How to Master Error Analysis and Application Improvement
Overview of the Course Content
- The course covers a comprehensive syllabus including error analysis, automated evaluators, and strategies for improving applications. It aims to create a self-sustaining improvement cycle for students.
- Unique topics include building custom interfaces for error analysis, with live coding sessions demonstrating practical applications using various tools like cloud code cursor.
- Cost optimization is also discussed, focusing on how to maintain quality while reducing expenses by substituting expensive models (e.g., GPT-5) with more affordable alternatives (e.g., GPT-4 mini).
Student Perks and Resources
- A meticulously crafted 60-page book is provided, detailing the entire evaluation process so students can focus on learning without extensive note-taking.
- An AI assistant similar to Lennybot has been developed, containing all course-related content such as lessons, office hours discussions, and public resources. Students receive 10 months of free access.
Community Engagement
- Students gain access to an active Discord community comprising all past participants of the class. This platform facilitates ongoing interaction and support among learners.
Lightning Round Insights
Recommended Books
- Shrea recommends "Pachinko" by Min Jin Lee for fiction and "Apple in China," which explores Apple's manufacturing processes in Asia.
Discussion on TV Shows and Personal Interests
Parenting and Entertainment Choices
- The speaker shares their experience as a father of two, noting that they primarily watch children's content, specifically mentioning watching "Frozen" multiple times in a week.
Discovering Classic Series
- Another participant mentions watching "The Wire" for the first time, highlighting its reputation and quality as a classic series.
AI-Assisted Tools in Research
Introduction to Cursor and Cloud Code
- Shrea discusses her enthusiasm for using Cursor and Cloud Code, emphasizing their utility in research work involving writing papers and coding.
- She expresses excitement about AI-assisted coding tools that allow her to be more ambitious in her projects.
User Experience Insights
- Hamill praises Cloud Code's user experience (UX), stating it reflects significant effort and design quality.
Life Mottos and Philosophical Perspectives
Personal Mantras
- Hamill shares his life motto: "Keep learning and think like a beginner," which emphasizes continuous growth.
- Shrea advocates for understanding opposing viewpoints during debates, promoting empathy over conflict.
Vision for Collaboration
- Shrea articulates her vision for Evals, focusing on collective success rather than individual wealth accumulation.
Mutual Appreciation Among Guests
Compliments Shared
- Hamill describes Shrea as one of the wisest individuals he knows despite their age difference.
- In return, Shrea admires Hamill's energy and momentum, attributing her motivation to him.
Closing Remarks and Contact Information
How to Connect with the Speakers
- Shrea provides details on how to reach her via email through her website; she encourages sharing successes related to AI eval implementations.
Invitation for Community Engagement
- Hamill invites others to contribute by teaching eval concepts or writing blog posts about their experiences with AI tools.
Final Thoughts