LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

Chunking Text for Large Language Models

In this video, the speaker discusses how to chunk text for large language models. They provide a rule of thumb for chunking text and walk through an example using Lang Chain Dots.

Using Lang Chain Dots

  • The speaker uses Lang Chain Dots as an example to demonstrate how to chunk text for large language models.
  • The speaker explains that they will be using the Lang Chain document loaders and the read the dots loader to process the HTML pages.
  • The speaker downloads all of the dot HTML files from Lang Chain and saves them into a directory.
  • The read the dots loader is used to load those dots and print out their length.

Splitting Pages into Chunks

  • The speaker provides a rule of thumb for chunking text in order to put it into a large language model.
  • They explain that they will split each page from Lang Chain Dots into more reasonably sized chunks.
  • Python libraries such as Lang Chain and Tick Token Tokenizer are used to split the documentation into trunks.
  • The speaker mentions that messy parts of the documentation can be handled easily by large language models, so there's no need to process them further.

Chunking for Retrieval Augmentation

In this section, the speaker discusses chunking and retrieval augmentation for question answering using a larger language model. They explain how to input relevant information into the large language model and discuss the token limit of the model.

Chunking Considerations

  • The first consideration is how many tokens can our large language model handle.
  • Relevant text chunks are retrieved from a vector database and passed alongside the original query to the large language model.
  • The maximum chunk size is 400 tokens per context, with a limit of 2000 tokens shared by five contexts.
  • The minimum number of tokens needed in a context is enough for it to make sense to humans.

Token Limit

  • The token limit for GPT 3.5 turbo is 4096, including both input and output tokens.
  • Assuming a context limit of 2000 tokens, we need to divide that by five, leaving us with about 400 tokens per context.

Calculating Chunk Size

In this section, the speaker discusses how to calculate the size of chunks based on token length using a specific tokenizer.

Tokenizing Text with Tick Tokenizer

  • The speaker explains that they will be using the Tick Token tokenizer to tokenize text.
  • They provide a link to the GitHub repository for more information on the tokenizer and explain that they will be using the GPT 3.5 turbo model which uses CL 100K base.
  • The speaker notes that most recent models use CR100K base except for Text Avengers003 model which uses P50K base.

Creating Tick Token Length Function

  • The speaker creates a function called "Tick Token Length" which calculates the length of text in terms of tick token tokens.

Visualizing Document Length Distribution

  • Before discussing chunking, the speaker visualizes document length distribution by calculating token counts and showing minimum, maximum, and average number of tokens.
  • They note that most documents have around 1.3k tokens.

Chunking Using Line Chain Splitter

  • The speaker explains that they will be using Line Chain Splitter with Recursive Character Text Button to split text into chunks based on a specified chunk size and separators.
  • They note that there is also an overlap between each chunk by 20 tokens.

Text Chunking

In this section, the speaker discusses text chunking and how to avoid missing important information when splitting text into chunks.

Trunk Overlap

  • The speaker explains that when splitting text into chunks, it's important to avoid missing out on important information by just splitting in the middle of two related pieces of information.
  • To avoid this issue, the speaker suggests using a trunk overlap. This means including a portion of the previous chunk in the next chunk to ensure that any important connections between chunks are not missed.
  • The speaker provides an example of how trunk overlap works with four chunks of 400 characters each.

Splitting Text

  • To split text, the speaker uses the split_text function with parameters for trunk size and trunk overlap.
  • Using these parameters and tokenization, the split_text function returns multiple chunks for each page.
  • The length of each chunk is optimized based on specific separators rather than being exactly 400 tokens long.

Creating Unique IDs

  • The final format for each page includes an ID, text content, and source.
  • To create a unique ID for each page and chunk combination, the URL is hashed using hashlib md5.
  • A count of the number of chunks is added to create a unique identifier for each individual chunk.

Chunking Text for Large Language Models

In this section, the speaker explains how to chunk text and process it for large language models. They also show how to store data on Hugging Face datasets.

Creating a Json Lines File

  • To repeat the same logic across the entire dataset, take out the URL, create a unique ID, split the text into chunks using a text splitter, and append them all to a documents list.
  • Save the documents as a Json lines file.
  • Load the Json lines file iteratively.

Storing Data on Hugging Face Datasets

  • Go to huggingface.co and sign up or log in.
  • Create a new dataset by giving it a name and choosing whether to make it private or public.
  • Upload your Json lines file by dragging it into the files section of your dataset page.
  • Install data set library for hugging face datasets.
  • Load your data set with load_dataset().
  • Extract information from your data set.

Conclusion

The speaker emphasizes that chunking text and processing it for large language models is an important part of natural language processing that is often overlooked. They recommend storing data on Hugging Face datasets because it makes sharing and accessing data easier.