Stop Fixing Your Claude Skills. Autoresearch Does It For You

Name: Stop Fixing Your Claude Skills. Autoresearch Does It For You
Uploaded: 2026-03-13T23:50:39.000Z
Duration: 32 min 32 s

How to Improve Cloud Code Skills with Auto Research

Introduction to Cloud Code Skills

The speaker expresses enthusiasm for Cloud Code skills but notes their unreliability, achieving intended output only 70% of the time.

The goal is to combine Cloud Code skills with a new AI development called Auto Research for improved reliability and accuracy.

Overview of Auto Research

Auto Research was introduced by Andre Carpathy, a former OpenAI member, allowing agents to autonomously optimize processes.

In this context, the focus is on improving skills over time by refining prompts using the Auto Research methodology.

Key Components of the Auto Research Repo

The relevant files in the GitHub repo are prepare.py, train.py, and program.md.

While prepare.py deals with machine learning specifics, train.py and program.md are crucial for skill improvement.

Implementing Skill Improvement

Users will provide a prompt in program.md instructing an agent to enhance the skill based on Auto Research methods.

Evaluation criteria (eval metrics) will be established to measure skill performance improvements over time.

Real-world Application Example

The speaker shares a personal experience where they used Auto Research on an old app, significantly reducing load speed from 1100ms to 67ms through iterative testing.

This method resulted in an 81.3% improvement in performance metrics, showcasing potential gains for skill accuracy as well.

Essential Ingredients for Successful Implementation

Three key components are necessary:

An objective metric (e.g., evaluation pass rate).

A reliable measurement tool that operates without human intervention.

Something tangible to change (e.g., skill instructions or prompts).

Conclusion and Future Implications

By leveraging these strategies, users can continuously improve their skills while also generating valuable research data that can inform future iterations or models.

Skill Improvement Through Evaluation

Setting Up the Skill Evaluation Process

The agent will receive instructions to evaluate its performance against a suite of tests, aiming for continuous improvement every five minutes.

Understanding Prompts and Their Variability

Skills are defined as prompts, which can yield different results each time they are run due to their inherent noise. A standardized approach is necessary for consistent quality improvement.

Importance of Repeated Testing

To assess skill outputs effectively, multiple runs are required to identify the mode (most frequent result) and median (average), acknowledging that AI outputs represent data distributions.

Benchmarking Performance

Just like academic testing assesses knowledge, skills must be benchmarked using binary questions (yes/no) to evaluate their effectiveness systematically.

Criteria for High-Quality Diagrams

Four criteria have been established for evaluating diagrams:

Legibility and grammatical correctness of text.

Adherence to a defined color palette (pastel colors).

Linear structure (left-to-right or top-to-bottom orientation).

Absence of numbers or ordinals in the design.

Creating an Automated Evaluation System

Initial Setup Requirements

Communication with Claude Code is essential; this example uses an anti-gravity window with the Claude Code extension.

Utilizing External Resources

The Andre Carpathy auto research repository needs to be accessed and integrated into the evaluation process.

Defining the Evaluation Test Suite

A voice transcription tool called Whisper Flow will be used to instruct the system on building a self-improving skill system based on predefined constraints.

Execution Plan for Diagram Generation

Every two minutes, ten diagrams will be generated based on specific functions. These will undergo evaluation through the test suite, adjusting prompts as needed until optimal results are achieved.

Overview of Diagram Generator Skill

The diagram generator skill focuses on creating clean hand-drawn style diagrams from natural language inputs, emphasizing clarity and professionalism in design.

How to Optimize Skills Using Auto Research

Overview of the Process

The output is designed to resemble a whiteboard sketch with pastel colors and simple icons. The process involves sending requests to Nano Banana Pro 2, which generates content that can be pasted into Excaladraw.

Each generation costs approximately 2 cents, leading to an estimated total of $10 for optimizing skills over 50 tests, which is a positive return on investment given potential ad revenue from YouTube videos.

Scoring Mechanism

The speaker clarifies their scoring mechanism: generating 10 images evaluated against four criteria, resulting in a maximum score of 40.

A real-time dashboard displays results, showing initial scores and improvements over iterations. For example, one experiment improved from a score of 32 to 37.

Importance of Evaluation Criteria

Different users may have varying definitions of "good," emphasizing the need for personalized evaluation metrics. Time invested in running evaluations significantly impacts outcomes.

Recommendations include defining simple yes/no evaluation criteria to streamline the assessment process and improve efficiency.

Automation and Iteration

The system autonomously runs evaluations every two minutes while mutating prompts based on previous results. This method can be applied across various skills for optimization.

The speaker plans to create a meta skill that optimizes all skills in their repository by leveraging this automated research approach.

Tips for Effective Evaluation

Successful runs produce high-quality outputs with minimal errors (e.g., achieving a score of 39 out of 40). Simple binary evaluations are recommended for clarity.

Avoid overly strict criteria that could lead models to optimize for irrelevant factors rather than quality content.

Conclusion and Resources

Users are encouraged to adopt these strategies without barriers such as email sign-ups or gatekeeping.

Emphasizing simplicity in evaluation will yield better results; complex scoring systems may introduce variability that complicates outcomes.

Auto Research Applications

Exploring the Versatility of Auto Research

Auto research can be applied to a multitude of areas beyond just skills and prompts, including websites and landing pages.

It is useful for split testing various elements such as titles, thumbnails, and emails.

The speaker emphasizes the flexibility of auto research, suggesting it can be utilized in virtually any context desired by users.

There is an ongoing evolution within the ecosystem as individuals discover more effective methods for implementing auto research over time.

For those who may find certain aspects confusing, particularly regarding Claude portions, additional resources are recommended for further understanding.