RepoSwarm - Giving AI Agents Architecture Context Across All Your Repos - Roy Osherove
Introduction to AI Tools and Architectures
Audience Engagement
- The speaker begins by engaging the audience with questions about their experience with AI-related tools like Cursor and Cloud Code.
- Queries are posed regarding the size of organizations represented, specifically those with over 100, 500, or 1,000 GitHub repositories.
Speaker Background
- Roy Oshro introduces himself as a chief architect at Verbbit AI and author of several books on unit testing and leadership.
- He mentions his blog, robotpaper.ai, which covers various topics including unit testing and AI.
The Role of AI in Software Development
Benefits of Writing Books
- Roy discusses how writing books allows him to leverage AI tools for code reviews by prompting them to follow his style.
Importance of Readable Code
- He emphasizes the significance of creating readable, maintainable, and trustworthy code.
Refactoring Legacy Code
- A recommended prompt for refactoring legacy code is shared: using insights from Michael Feathers' book on effective refactoring practices.
Transforming Workflows with AI
New Working Methods
- The talk focuses on how AI can enable new workflows rather than relying solely on traditional methods.
Information Condensation
- Roy highlights that AI can transform large amounts of information into concise formats for future use.
Utilizing Notebook LM for Presentations
Slide Generation Experiment
- Approximately 30% to 40% of the presentation slides were generated using Notebook LM based on provided blog posts and source code.
Effectiveness of New Tools
- The speaker notes that he retained about 60% of what was generated by Notebook LM due to its high quality.
Challenges in Managing Multiple Repositories
Chief Architect Responsibilities
- As chief architect at Verbbit, Roy oversees compliance across approximately 400 repositories while ensuring good practices are followed.
Documentation Issues
- He points out that documentation quality is often poor across repositories, making it difficult to manage effectively.
Real-world Scenario: Consolidating Monitoring Tools
Example Challenge
- Roy describes a scenario where consolidating monitoring tools across repositories could take days or weeks without efficient systems like Repos Swarm.
How to Ensure Accurate Data Handling in HIPAA-Compliant Applications
Importance of Up-to-Date Documentation
- Organizations often lack current diagrams showing how data is handled across their applications, which is crucial for HIPAA compliance.
- Many existing diagrams are outdated and irrelevant, necessitating time-consuming recreations that may not be completed effectively.
Challenges with Developer Resources
- Developers frequently struggle to find documentation on new services or event schemas, leading to inefficient manual processes during code reviews.
- The introduction of AI coding agents requires context about feature usage and component standards, which is often missing from existing documentation.
Liabilities in Current Documentation Practices
- Standard README files in repositories often lack comprehensive onboarding information, making it difficult for new developers to get up to speed quickly.
- Architecture documentation tends to be static and poorly formatted for AI readability, resulting in a loss of valuable insights over time.
Introduction of Repo Swarm Tool
- To address these issues, the "Repo Swarm" tool was developed during a company hackathon. It aims to automate the documentation process using AI.
- The tool compiles snapshots of all repositories into a centralized repository called "Architecture Hub," containing markdown files that are easy for AI agents to read.
Features and Benefits of Architecture Hub
- Each repository's markdown file includes sections like high-level overviews, dependencies, and security analyses generated through specific prompts.
- The architecture hub ensures continuous updates by regenerating content daily without needing backward compatibility adjustments when new components are added.
Standardizing Deployment Processes
Importance of Standardization in Deployment
- The speaker emphasizes the need for standardizing deployment processes across various teams within the company, as there are currently multiple types of deployments for front-end and back-end applications that lack uniformity.
Challenges in Achieving Standardization
- The process of gathering information from different teams to understand their deployment practices is time-consuming. The speaker suggests that while it may be difficult to achieve full standardization initially, significant progress can be made by focusing on key areas like authentication.
Authentication and Authorization Issues
- Identifying common authentication libraries and authorization models used across teams is crucial. There may be inconsistencies where different teams utilize varying models, which could benefit from a centralized approach.
Data Mapping and Security Checks
- Understanding data flow, storage schemas, and security checks is essential. The speaker mentions involving security experts to prioritize top vulnerabilities in repositories for timely fixes.
Documentation Practices
- Comprehensive documentation of dependencies related to machine learning services (e.g., Gemini Claude), feature flags, and prompt security checks is necessary. This ensures all relevant information is captured systematically.
Utilizing AI for Documentation
Benefits of AI in Documentation
- A single markdown file can serve as a comprehensive repository that remains up-to-date with daily commits. This allows for easy querying and retrieval of information across multiple repositories.
Advantages Over Traditional Documentation
- Unlike traditional documentation that can become stale, this method ensures continuous updates and readability by AI tools, enhancing accessibility to critical information.
Querying Information Efficiently
- Users can leverage tools like Cursor to quickly access answers about monitoring tools or other queries within minutes, achieving a high level of trust in the responses provided by AI systems.
Exploring Repository Structures
Overview of Repository Contents
- The structure includes details such as repository names, project purposes, major dependencies from package JSON files, and directory structures presented clearly in markdown format.
Utilizing Advanced Tools for Queries
- Tools equipped with augmented search capabilities index content effectively. This enables users to find accurate answers without needing extensive context input manually.
Practical Applications with Cursor
Demonstrating Cursor's Capabilities
- The speaker shares experiences using Cursor for generating reports based on user queries about programming languages or other topics relevant to the repositories being documented.
Ensuring Consistency in Responses
- Initial experiments revealed inconsistencies in AI-generated answers; thus, employing CLI tools alongside well-crafted prompts became essential for improving accuracy and efficiency when generating reports.
Generating Reports with CLI Tools
Overview of Commands and Rules
- The process begins with a set of commands outlined in a markdown file, which serves as a prompt for generating reports.
- Specific instructions include reading all ArchMD files while excluding prompts, diagrams, and report folders to avoid false positives from previous reports.
Report Generation Process
- The report generation involves applying cursor rules and ensuring that only factual information is used, avoiding hallucinations.
- A new directory is created for storing reports in memory, and the system starts planning its next moves based on the provided prompt.
Execution and Results
- The tool utilizes various CLI tools to search for monitoring types within repositories; however, there are doubts about finding substantial results.
- It’s noted that running the same prompt multiple times may yield different results due to the non-deterministic nature of models.
Quality Assurance and Feedback
- After generating reports, they are sent to tech leads for review to ensure completeness and identify any missed critical information.
- Continuous improvement of prompts has been emphasized as essential for better outcomes over time.
Insights on Report Content
- Each report includes an overview summarizing key findings such as technologies used (e.g., OpenTelemetry), along with architecture diagrams.
- Generating a report takes approximately five minutes even across multiple repositories; adjustments can be made based on specific needs or contexts.
Model Selection Considerations
- Different models can be tested for better results; options like Codeex or Grock may provide enhanced outputs but at higher costs.
- An example report generated today shows repository distributions by language within an architecture hub context.
Repository Management Practices
- Only repositories with commits in the past 12 months are indexed to maintain relevance; this ensures focus on active projects.
Advanced Reporting Techniques in Software Development
Overview of Advanced Reporting
- The speaker introduces a more advanced example of reporting, emphasizing the potential for extensive analysis and insights once the output is generated.
- Time constraints are acknowledged, with 15 minutes remaining for the discussion. The speaker highlights various applications such as onboarding new developers, conducting architecture reviews, and performing compliance security checks.
Real-World Application Scenarios
- A common scenario discussed involves deprecating a service within a company, necessitating identification of all dependencies across teams to ensure proper communication and planning.
- Manual processes can be time-consuming; however, utilizing advanced reporting tools can significantly reduce workload by automating up to 80-90% of tasks related to service updates.
Monitoring Service Analysis
- The speaker describes challenges in consolidating API gateways when transitioning to new systems like Kong while managing existing ones.
- Continuous monitoring allows for daily updates on architectural documents based on new commits across repositories, creating an evolving snapshot of architecture history.
Architectural Decision Records (ADRs)
- The concept of architecture history enables teams to ask insightful questions about past decisions and changes over time.
- ADRs serve as documentation for significant architectural decisions; they help understand project evolution and rationale behind current states.
Utilizing AI for Enhanced Reporting
- The analogy of tree rings illustrates how historical data can reveal patterns in architectural changes over time.
- AI tools were employed to generate creative reports from existing data, including prompts that facilitate the creation of ADR records based on git history sampling.
Examples from Real Projects
- Specific examples include decisions made regarding serverless implementations and database integrations that were not previously documented but are now captured through enhanced reporting methods.
Documenting Kit History and Tech Adoption
Importance of Documenting Kit History
- The speaker emphasizes the significance of documenting kit history, which is often overlooked. This documentation serves as a valuable resource for tracking changes and decisions made within projects.
Custom Commands for Tech Adoption
- A discussion on tech adoption and deprecation highlights the use of custom commands in project management. These commands can be tailored to specific needs, enhancing workflow efficiency.
Overview of Repository Architecture
- The speaker provides insights into the architecture behind repository management, mentioning that it operates through a temporal orchestration workflow engine designed for durability and error handling.
Repository Discovery and Management
GitHub Integration
- The system integrates with GitHub using a special token to discover repositories automatically based on a predefined JSON file (repos.json). This allows for easy updates without manual intervention.
Filtering Repositories
- Repositories are filtered based on age (older than one year), with options to skip certain repositories that may not be relevant or useful for documentation purposes.
Workflow Execution and Prompting
Fan Out Process
- Once repositories are identified, a fan-out process initiates new workflows to analyze each repository's structure, execute prompts, store results, and perform cleanup tasks efficiently.
Role of Prompts in Analysis
- Prompts are central to the product's functionality; they determine how effectively data is interrogated across different repositories. Base prompts apply universally while specialized prompts cater to specific repository types.
Enhancing Prompt Efficiency
Contextual Injection Between Prompts
- The system allows results from higher-level prompts to inform lower-level ones, enhancing context and improving overall analysis quality. This sequential prompting strategy optimizes information retrieval.
Structure of Prompt JSON
- A detailed overview of the prompt JSON structure reveals its role in organizing shared repository information. The core directory contains essential base prompts crucial for effective operation.
Understanding Prompt Management and Optimization
Importance of Versions in Prompts
- The discussion highlights the significance of prompt versions, noting that changes to a prompt version can trigger a workflow run even without new commits.
- This optimization is designed to save tokens by only executing prompts that have changed or when code modifications occur.
Caching Mechanisms
- The system utilizes caching mechanisms such as DynamoDB or local files to enhance efficiency, running only specific prompts if they have been altered.
- There is mention of prompt inheritance and tagging based on file types, which aids in context chaining for more effective security checks.
Investigation Process
- Investigating a repository typically takes between 10 to 15 minutes, with shorter times for smaller repositories. Costs range from $0.30 to $1.00 depending on the model used.
- Plans are underway to improve logging during investigations, enhancing transparency and tracking.
Future Developments
- Future enhancements include integrating Slack notifications for critical updates and automating documentation indexing for easier querying without direct access.
- Cross-repository reasoning aims to allow investigations across multiple repositories simultaneously, verifying dependencies and facilitating future agent interactions.