How to Get Your Data Ready for AI Agents (Docs, PDFs, Websites)
Building AI Agents with Open Source Document Extraction
Introduction to AI Agents and Data Access
- The initial step in building AI agents involves providing them access to various data types, such as documents, PDFs, and websites, to enhance their knowledge about specific problems or companies.
- While many online tools for document parsing exist, they often require API keys and are closed source. However, open-source alternatives can achieve similar results without these limitations.
Overview of the Document Extraction Pipeline
- The video will demonstrate how to create a fully open-source document extraction pipeline using Python and a library called Docling.
- Key techniques covered include extraction, parsing, chunking, embedding, and retrieval—essential for developing a knowledge system for AI agents.
Setting Up the Environment
- To follow along with the tutorial, viewers need to set up an environment and install required packages listed in a requirements.txt file. An OpenAI API key is also necessary for creating embeddings.
- Although OpenAI's services are used in this example, users can opt for entirely open-source models if preferred.
Steps in the Document Processing Workflow
- The process consists of five main steps: extracting document content, performing chunking, creating embeddings for a vector database, testing search functionality, and integrating everything into a chat application.
Extracting Content Using Docling
- The tutorial begins by demonstrating how to use Docling—a highly regarded open-source document extraction library from IBM—to extract content from PDF files.
- Users will learn how to convert PDFs into structured data models that facilitate further processing.
Advantages of Using Docling
- After running the conversion process on a PDF file using Docling’s capabilities (which may take time due to model downloads), users receive a structured object containing extracted information.
- One significant advantage of Docling is its ability to handle various data formats (PDFs, PowerPoints, etc.) uniformly through its specialized data model.
Exporting Data Formats
- Once documents are processed into Dockling objects, users can export them into different formats like Markdown or JSON for easier manipulation and visualization.
Table Extraction Capabilities
PDF and HTML Data Extraction Techniques
Introduction to Document Parsing
- The speaker discusses the effectiveness of parsing PDFs, highlighting a simple example that demonstrates successful extraction.
- Transitioning from PDF to HTML, the speaker explains how the same converter can process web pages by simply inputting a URL instead of a PDF.
Extracting Data from Web Pages
- To extract data from an entire website, the speaker introduces using
site map.xml, which most websites provide for listing all their URLs.
- A helper function named
get site map URLsis created to fetch and return all URLs found in the site's XML file.
Looping Through Site Map URLs
- By plugging in a URL, users can retrieve all URLs listed in the site map, allowing for systematic extraction of each page's content.
- The Dockling library offers a method called
convert all, enabling users to loop through multiple pages and gather documents efficiently.
Custom Extraction Parameters
- The ability to handle various document types (PDFs, web pages, etc.) forms the foundation for building knowledge extraction systems.
- The speaker encourages viewers to explore additional resources available for developers looking to enhance their skills or find freelance opportunities.
Chunking Data for AI Systems
- After extracting data, chunking is introduced as a method of dividing documents into smaller parts for better querying within AI systems.
- Chunking isn't merely about splitting text; it involves creating logical segments that fit well together based on context.
Advanced Chunking Techniques
- The Dockling library provides two methods for chunking: hierarchical chunker and hybrid chunker. These help organize data into meaningful groups automatically.
- Hybrid chunkers address both oversized chunks (by splitting them down) and undersized chunks (by combining smaller pieces), ensuring compatibility with embedding models used later in processing.
Embedding Models Considerations
- Different embedding models have specific maximum input limits; thus, it's crucial to keep chunks within these constraints when preparing data for analysis.
Creating an OpenAI Tokenizer Wrapper
Overview of the Tokenizer Wrapper
- The speaker developed a simple OpenAI tokenizer wrapper to work with Dockling's open-source model, which is accessible during a hacking phase. This wrapper adheres to the necessary API specifications for functionality.
Running the PDF Parsing
- The same PDF file is processed again for simplicity, demonstrating the parsing action previously executed. The result shows that the PDF has been successfully parsed into manageable data chunks.
Implementing Hybrid Chunker
- The hybrid chunker from the Dockling library is imported and configured with:
- The custom OpenAI tokenizer wrapper.
- Maximum input tokens set appropriately.
- Merging pairs option enabled (default setting) to combine smaller text chunks when needed.
- After executing this setup, a total of 36 text chunks are generated from the entire document, ensuring compatibility with the embedding model used.
Embedding Models and Database Integration
Preparing for Embedding
- With the chunks ready, they can be sent to an embedding model for factor generation, which will then be stored in a factor database.
- LansDB is chosen for its ease of use compared to other databases like PostgreSQL due to its persistent storage similar to SQLite and user-friendly API.
Setting Up LansDB
- A new Python file (
embedding.py) begins by re-running previous code while adding steps specific to working with LansDB.
- A function tailored for LansDB allows specifying an embedding model (OpenAI's Text Embedding 3 Large) and defining table structures using Pydantic models. This streamlines sending and retrieving embeddings without manual intervention.
Defining Data Schema
Schema Structure
- The main schema includes:
- A text field containing extracted content.
- A vector field utilized for search operations.
- Metadata fields capturing essential information such as file name, page numbers of chunks, and document title.
This structure ensures that relevant metadata accompanies each text chunk for future reference or analysis purposes.
Processing Chunks into Database
- Each of the 36 chunks undergoes processing where:
- Text is extracted alongside associated metadata (file name, page numbers).
- Results are formatted according to defined chunk models before being sent to LansDB.
Understanding the Integration of Lens DB API
Handling Errors and Preparing Data
- The speaker discusses encountering bugs in the code, which took about an hour to resolve. This highlights the importance of debugging in programming.
- The Lens DB API simplifies the process of adding chunks to a database by automatically handling embeddings, reducing manual work for developers.
Data Verification and Compatibility
- A method is demonstrated to return the first 10 results from a dataset, confirming that there are 36 records total. This emphasizes data integrity checks.
- The speaker notes that while they used Lens DB for this example, similar principles can be applied using other databases like PostgreSQL.
Querying with AI Systems
- An interactive session is initiated to connect to the factor database and perform searches based on user queries, showcasing practical application of embedding technology.
- The search functionality allows limiting results (e.g., setting a limit to five), demonstrating flexibility in retrieving relevant information from large datasets.
Building an Interactive Application
- Introduction of a Streamlit application setup for creating an interactive chat interface. This illustrates how easy it is to develop applications using Python libraries.
- The application connects with the database and includes functions for searching and displaying chat messages, emphasizing user interaction capabilities.
Running and Testing the Application
- Instructions are provided on how to run the Streamlit app locally, ensuring users have all necessary components installed before execution.