How to Get Your Data Ready for AI Agents (Docs, PDFs, Websites)

Name: How to Get Your Data Ready for AI Agents (Docs, PDFs, Websites)
Uploaded: 2025-02-13T17:03:08.000Z
Duration: 50 min

Building AI Agents with Open Source Document Extraction

Introduction to AI Agents and Data Access

The initial step in building AI agents involves providing them access to various data types, such as documents, PDFs, and websites, to enhance their knowledge about specific problems or companies.

While many online tools for document parsing exist, they often require API keys and are closed source. However, open-source alternatives can achieve similar results without these limitations.

Overview of the Document Extraction Pipeline

The video will demonstrate how to create a fully open-source document extraction pipeline using Python and a library called Docling.

Key techniques covered include extraction, parsing, chunking, embedding, and retrieval—essential for developing a knowledge system for AI agents.

Setting Up the Environment

To follow along with the tutorial, viewers need to set up an environment and install required packages listed in a requirements.txt file. An OpenAI API key is also necessary for creating embeddings.

Although OpenAI's services are used in this example, users can opt for entirely open-source models if preferred.

Steps in the Document Processing Workflow

The process consists of five main steps: extracting document content, performing chunking, creating embeddings for a vector database, testing search functionality, and integrating everything into a chat application.

Extracting Content Using Docling

The tutorial begins by demonstrating how to use Docling—a highly regarded open-source document extraction library from IBM—to extract content from PDF files.

Users will learn how to convert PDFs into structured data models that facilitate further processing.

Advantages of Using Docling

After running the conversion process on a PDF file using Docling’s capabilities (which may take time due to model downloads), users receive a structured object containing extracted information.

One significant advantage of Docling is its ability to handle various data formats (PDFs, PowerPoints, etc.) uniformly through its specialized data model.

Exporting Data Formats

Once documents are processed into Dockling objects, users can export them into different formats like Markdown or JSON for easier manipulation and visualization.

Table Extraction Capabilities

PDF and HTML Data Extraction Techniques

Introduction to Document Parsing

The speaker discusses the effectiveness of parsing PDFs, highlighting a simple example that demonstrates successful extraction.

Transitioning from PDF to HTML, the speaker explains how the same converter can process web pages by simply inputting a URL instead of a PDF.

Extracting Data from Web Pages

To extract data from an entire website, the speaker introduces using site map.xml, which most websites provide for listing all their URLs.

A helper function named get site map URLs is created to fetch and return all URLs found in the site's XML file.

Looping Through Site Map URLs

By plugging in a URL, users can retrieve all URLs listed in the site map, allowing for systematic extraction of each page's content.

The Dockling library offers a method called convert all, enabling users to loop through multiple pages and gather documents efficiently.

Custom Extraction Parameters

The ability to handle various document types (PDFs, web pages, etc.) forms the foundation for building knowledge extraction systems.

The speaker encourages viewers to explore additional resources available for developers looking to enhance their skills or find freelance opportunities.

Chunking Data for AI Systems

After extracting data, chunking is introduced as a method of dividing documents into smaller parts for better querying within AI systems.

Chunking isn't merely about splitting text; it involves creating logical segments that fit well together based on context.

Advanced Chunking Techniques

The Dockling library provides two methods for chunking: hierarchical chunker and hybrid chunker. These help organize data into meaningful groups automatically.

Hybrid chunkers address both oversized chunks (by splitting them down) and undersized chunks (by combining smaller pieces), ensuring compatibility with embedding models used later in processing.

Embedding Models Considerations

Different embedding models have specific maximum input limits; thus, it's crucial to keep chunks within these constraints when preparing data for analysis.

Creating an OpenAI Tokenizer Wrapper

Overview of the Tokenizer Wrapper

The speaker developed a simple OpenAI tokenizer wrapper to work with Dockling's open-source model, which is accessible during a hacking phase. This wrapper adheres to the necessary API specifications for functionality.

Running the PDF Parsing

The same PDF file is processed again for simplicity, demonstrating the parsing action previously executed. The result shows that the PDF has been successfully parsed into manageable data chunks.

Implementing Hybrid Chunker

The hybrid chunker from the Dockling library is imported and configured with:

The custom OpenAI tokenizer wrapper.

Maximum input tokens set appropriately.

Merging pairs option enabled (default setting) to combine smaller text chunks when needed.

After executing this setup, a total of 36 text chunks are generated from the entire document, ensuring compatibility with the embedding model used.

Embedding Models and Database Integration

Preparing for Embedding

With the chunks ready, they can be sent to an embedding model for factor generation, which will then be stored in a factor database.

LansDB is chosen for its ease of use compared to other databases like PostgreSQL due to its persistent storage similar to SQLite and user-friendly API.

Setting Up LansDB

A new Python file (embedding.py) begins by re-running previous code while adding steps specific to working with LansDB.

A function tailored for LansDB allows specifying an embedding model (OpenAI's Text Embedding 3 Large) and defining table structures using Pydantic models. This streamlines sending and retrieving embeddings without manual intervention.

Defining Data Schema

Schema Structure

The main schema includes:

A text field containing extracted content.

A vector field utilized for search operations.

Metadata fields capturing essential information such as file name, page numbers of chunks, and document title.

This structure ensures that relevant metadata accompanies each text chunk for future reference or analysis purposes.

Processing Chunks into Database

Each of the 36 chunks undergoes processing where:

Text is extracted alongside associated metadata (file name, page numbers).

Results are formatted according to defined chunk models before being sent to LansDB.

Understanding the Integration of Lens DB API

Handling Errors and Preparing Data

The speaker discusses encountering bugs in the code, which took about an hour to resolve. This highlights the importance of debugging in programming.

The Lens DB API simplifies the process of adding chunks to a database by automatically handling embeddings, reducing manual work for developers.

Data Verification and Compatibility

A method is demonstrated to return the first 10 results from a dataset, confirming that there are 36 records total. This emphasizes data integrity checks.

The speaker notes that while they used Lens DB for this example, similar principles can be applied using other databases like PostgreSQL.

Querying with AI Systems

An interactive session is initiated to connect to the factor database and perform searches based on user queries, showcasing practical application of embedding technology.

The search functionality allows limiting results (e.g., setting a limit to five), demonstrating flexibility in retrieving relevant information from large datasets.

Building an Interactive Application

Introduction of a Streamlit application setup for creating an interactive chat interface. This illustrates how easy it is to develop applications using Python libraries.

The application connects with the database and includes functions for searching and displaying chat messages, emphasizing user interaction capabilities.

Running and Testing the Application

Instructions are provided on how to run the Streamlit app locally, ensuring users have all necessary components installed before execution.