This AI Coder Is On Another Level (Pythagora Tutorial)

Name: This AI Coder Is On Another Level (Pythagora Tutorial)
Uploaded: 2024-10-01T14:33:19.000Z
Duration: 1 h 26 min 21 s

Building a Full Stack LLM Benchmarking Application

Introduction to the Project

The speaker introduces the goal of building a full stack LLM benchmarking application using Pythagora, aiming for 1,600 lines of code in under two hours.

The setup includes installing the Pythagora extension in VS Code and ensuring Node.js is installed as part of the framework.

Project Overview

The project is named "Benchmark," focusing on creating a legitimate full stack application rather than simple toy applications like to-do lists.

The benchmarking application will allow users to input test questions for various large language models (LLMs), facilitating automatic testing against these benchmarks.

Detailed Application Description

A comprehensive prompt has been prepared that outlines how the benchmarking application should function, which will be shared in the video description.

Key features include:

A homepage with a welcome message and published tests list.

Test view pages accessible by users.

User authentication, including admin roles and an admin dashboard.

Development Process with Pythagora

After defining the project details, Pythagora's agents begin working on different aspects of development.

The spec writer checks prompt complexity while the architect agent plans out technology choices, confirming usage of Node.js and Express.

Configuration and API Integration

Developers are prompted to configure environment variables such as API keys for OpenAI and Anthropic; these keys will be revoked post-video publication for security reasons.

Interaction with Pythagora involves iterative cycles where it performs tasks autonomously but requires user input when necessary, such as debugging or providing custom configurations.

Task Management within Development

As development progresses, Pythagora creates a detailed development plan through its tech lead agent.

Developer Agent in Action

Overview of the Development Process

The developer agent initiates code writing, displaying real-time updates and diffs of the new code being generated.

A progress bar is available to track application building stages, alongside a logs tab for monitoring front-end and back-end activities.

Instructions are provided for human testing, including starting the server using npm start or a convenient button.

Verifying Application Functionality

After starting the server, confirmation is achieved by accessing Local Host in a browser to ensure the application is running correctly.

Users are prompted to register; successful registration leads to verification through login functionality.

The system generates significant amounts of code automatically, visible under various files such as user.js.

Database Interaction

Users must check their MongoDB client or Atlas for database integrity, specifically looking for a users' collection with correct user documents.

Confirmation of user creation includes checking hashed passwords and roles within the database.

Admin User Creation Script

A task is initiated to create an admin user script; upon execution, it updates relevant code files while requiring input for admin credentials.

The process involves updating environment variables (EMV), providing necessary details like username and password before proceeding.

Testing and Iterative Development

After creating an admin user via npm command, further tests are required to validate app functionality and access control features.

Successful login leads to navigating towards an admin dashboard; however, initial attempts may result in error pages indicating expected outcomes during testing phases.

Implementing New Features

A new task begins focused on developing an admin dashboard page; ongoing coding efforts are documented with explanations detailing each step taken by the developer agent.

Real-time updates reflect changes made in code as tasks progress iteratively through defined steps.

Admin Dashboard Functionality Implementation

Initial Setup and Navigation

The admin dashboard is accessed by logging in with admin credentials, revealing a simple interface that displays usernames and roles.

The development process involves multiple epics, indicating structured progress through various application features.

Role Management Features

A new task is introduced to allow admins to change user roles between "viewer" and "creator," showcasing the evolving functionality of the application.

Demonstration of changing a user's role from viewer to creator, followed by refreshing the dashboard to confirm successful updates.

Access Control Testing

After updating roles, testing access control ensures non-admin users are denied access to the admin page, confirming security measures are effective.

Successful re-login as an admin confirms that all functionalities are operational before proceeding with further tasks.

New Page Creation for Tests

A new task involves creating a page that lists tests with their names and creation dates; this enhances the application's usability.

Logging in as a regular user shows no tests available initially, validating that the test listing feature is functioning correctly.

Database Population and Deletion Features

Implementing a script to populate the database with sample test data is confirmed successful after running necessary commands.

Introduction of delete buttons on the test page allows admins to remove tests easily; deletion actions are demonstrated successfully.

Pagination Implementation

Pagination is discussed as an upcoming feature; it aims to improve navigation when there are numerous tests listed on the page.

Testing pagination includes verifying controls appear when more than ten tests exist, ensuring users can navigate through pages effectively.

Issue Resolution in Test Data Population

Test Creation and Debugging Process

Running Tests and Pagination

The existing test is deleted, allowing for a rewrite of the code. After restarting the app and refreshing the page, new tests appear successfully.

Pagination functionality is confirmed as working correctly with 10 tests displayed per page after adding more tests.

Creating New Test Page

A new page for test creation is initiated, featuring fields for user messages, review messages, request numbers, and dynamic sections for LLM providers and models.

Emphasis on the complexity of building a full-stack application; users should expect to invest significant time in development rather than quick results.

Testing Form Functionality

Steps outlined to test the form include starting the server, logging in as an admin or creator, navigating to create a test link, and ensuring all fields are displayed correctly.

The process involves filling out the form with specific inputs (e.g., user message "What is 2 plus 2") and testing various LLM options.

Submitting Tests to Database

Upon clicking "create test," it was noted that nothing happened initially; however, upon inspecting console logs, it was confirmed that the form submitted correctly.

Backend functionality implementation begins to handle test creation effectively while verifying results through database checks.

Identifying Bugs in Test Names

An issue arises where all test names are deleted during creation; this bug needs investigation as names appear blank in entries.

Logging added to help identify issues; attempts made to recreate the problem by submitting forms without names.

Resolving Issues with Test Names

The problem identified relates to missing name fields in sample data required by the model. Steps taken include refreshing pages after running scripts.

After troubleshooting, it’s confirmed that names now appear correctly. However, adjustments are needed so that users are prompted for a name when creating tests.

Final Adjustments and Feedback Loop

Test Creation and Execution Process

Overview of Test Creation

The process begins with the creation of a test, including fields for test name and user messages. A successful test creation is confirmed.

The next step involves updating the test list page to display newly created tests, which has already been completed.

Backend Logic Implementation

Emphasis is placed on implementing backend logic for executing tests, marking it as an important task in the development process.

Real-time progress tracking for test execution is identified as the next significant feature to implement.

Testing Procedures

The testing procedure includes logging in as an admin, creating a new test, and navigating to view all tests.

Upon executing a newly created test, immediate completion is noted; however, there are concerns about visibility of progress indicators.

Debugging and Feedback

Issues arise regarding the absence of a visible progress bar during execution. This prompts further debugging to ensure that logic for sending tests to LLM (Language Learning Model) functions correctly.

A new test involving multiple LLM requests is created to verify if this affects the visibility of progress updates.

Progress Tracking Confirmation

Observations confirm that when using multiple LLM requests, the progress bar appears. However, clarity on whether code execution utilizes these models remains uncertain.

Subsequent tests show improved functionality with visible progress bars and confirmation that results are being stored in the database effectively.

Evaluation Process Using Anthropics

Implementing Evaluation Features

Following successful testing procedures, focus shifts towards implementing evaluation processes using anthropic models alongside OpenAI API keys.

Final Testing Steps

Testing LLMs with GPT-3.5 Turbo and Claude 3

Initial Setup and Execution

The process begins with testing GPT-3.5 Turbo alongside another LLM, Claude 3 Sonet, by creating a test and executing it.

Steps to start the app include logging in, navigating to the test page, executing the test, and checking results.

Test Creation Process

A new test is created titled "6 + 6" with user input expecting an answer of 12.

An error occurs when trying to navigate to the SL test page after creating a new test; this issue also appears on initial access to the slash test page.

Debugging Errors

The error is logged for further analysis; backend logs are copied as there are no frontend logs available.

The speaker appreciates how easily they can pinpoint errors due to comprehensive logging integrated into their workflow.

Progressing Through Tests

After fixing the previous issue, tests are executed again successfully, showing linked names and completed tasks.

Results from both models indicate correct evaluations (GPT response: true; Claude response: true), showcasing effective benchmarking capabilities.

Enhancements and Features

Plans for implementing a bar chart feature are discussed along with self-reflection on code quality improvements.

The application updates its routing files automatically as part of ongoing development efforts.

Publishing Test Results

New functionality allows publishing of test results for sharing purposes; users can revert or add features easily using Pythagora tools.

Upon attempting to publish results, another error arises which requires reproduction for debugging.

Final Debugging Steps

The debugging process involves adding logs again while reproducing errors encountered during publishing attempts.

Despite repeated issues during testing phases, the speaker finds value in seamless debugging processes that allow quick identification of problems.

Debugging the Publishing Process

Encountering Errors During Publishing

The user experiences a bug while publishing a test, receiving an error message despite the status indicating that it was published successfully.

Upon attempting to publish another test, the same error occurs; however, backend logs confirm that the publishing function is operational.

The need for additional logging is identified to better understand why errors are being reported during the publishing process.

Analyzing Frontend Logs

The user inspects browser developer tools to analyze frontend logs and identifies issues that may be causing errors in the application.

After gathering more information from both backend and frontend logs, they proceed with sending this data for further analysis.

Fixing Bugs and Testing Functionality

With sufficient log data collected, efforts are made to fix the identified bugs. A successful test of the publishing function confirms that everything is now working correctly.

The next step involves implementing backend logic for publishing tests and ensuring that users can see published tests without needing authentication.

Finalizing Application Features

Updating Homepage Display

The final task includes updating the homepage to display a list of published tests accessible by non-authenticated users.

Users can view published tests on the homepage but cannot access detailed results yet; progress tracking indicates upcoming features.

Completing Application Setup

After confirming functionality, steps are taken to create documentation (README file), detailing setup instructions and reproduction steps for future reference.

Navigation links are already tested and confirmed functional; thus, this step is skipped in favor of moving forward with deployment preparations.

Deploying the Application

Deployment Process

The user expresses satisfaction with app performance before proceeding to deploy it with a single click.

Upon successful deployment, they share a link for others to access the live application online.

Reflection on Development Experience

Acknowledgment of creating a full-stack application without writing any code highlights efficiency achieved through collaboration with Pythagora.