Fine-Tuning LayoutLMv3 for Document Understanding with Custom Datasets | Step-by-Step Tutorial

Name: Fine-Tuning LayoutLMv3 for Document Understanding with Custom Datasets | Step-by-Step Tutorial
Uploaded: 2024-10-08T17:26:25.000Z
Duration: 1 h 10 min 47 s

Introduction to Document AI and Layout LM B3

Overview of Document Understanding

The speaker introduces the topic of document AI, focusing on fine-tuning the layout LM B3 model using a custom dataset.

Emphasizes the importance of document understanding in AI, highlighting its application in companies like Bank of America for extracting information from various documents such as PDFs and forms.

Features of Layout LM B3

The layout LM B3 model is described as a state-of-the-art tool from Microsoft that incorporates visual information, making it effective for complex document extraction.

The session will cover data preparation, model training, and evaluation processes necessary for fine-tuning the layout LM B3 model.

Understanding Layout LM Architecture

Key Components and Functionality

Explanation of how layout LM works by creating bounding boxes around text elements in images to extract crucial information like names and dates.

Discusses the two main steps: preparing data (using tools like Label Studio for tagging text with bounding boxes) and training the model.

Token Classification Process

Describes token classification as a broader problem where images (including converted PDFs) are processed to extract relevant tokens.

Highlights how layout LM extends traditional transformer architectures by integrating layout information into input embeddings.

Technical Aspects of Layout LM B3

Embedding Types

Details about different types of embeddings used in layout LM B3: text embeddings, layout embeddings, visual embeddings, position embeddings, etc.

Explains how these components feed into a multi-layer transformer encoder that captures relationships between tokens while considering both textual and spatial information.

Data Set Requirements

Notes that layout LM B3 requires datasets in specific formats; discusses importing datasets from Hugging Face Hub.

Dataset Structure and Sample Analysis

Dataset Composition

Outlines what is included in the dataset: images being processed along with tagged tokens representing various labels (e.g., addresses or names).

Mentions coordinates associated with each bounding box for token representation within images.

Sample Examination

Provides insight into sample data structure including IDs, tokens, bounding boxes (bbx), tags, and images available for training purposes.

Understanding Tokenization and Model Fine-Tuning

Introduction to Tokenization

The process begins with obtaining tokens, which are represented by bounding boxes (Bing boxes) and associated numerical identifiers (TXS).

The layout model being utilized is the Layout LMB3 base from Microsoft, known for its unique visual representation of tokens.

Model Preparation

The auto processor is imported with the apply OCR parameter set to false, indicating that Optical Character Recognition will not be used during inference.

The dataset is divided into training and testing sets; focus is on extracting features from the training dataset.

Feature Extraction

Features are accessed through a class structure resembling a dictionary, allowing retrieval of various attributes such as ID, token bounding boxes, and images.

Class levels are identified by their index numbers (0 to 6), which correspond to specific text classifications like headers or questions.

Mapping IDs to Class Levels

Two dictionaries are created: one mapping ID to level names and another mapping level names back to IDs for easy reference.

The feature names attribute provides a list of token names that can be indexed for further processing.

Data Preparation Methodology

A method called prepare examples is introduced for batch processing datasets instead of handling samples individually.

Images and corresponding data columns (words, bounding boxes, levels) are prepared for processing while ensuring token length does not exceed 512 due to encoder limitations.

Handling Multi-page Documents

Understanding Token Processing and Dataset Preparation in Machine Learning

Overview of Feature Creation

The process begins with overcoming the limitation of 512 tokens, focusing on initializing the base model. Features such as pixel values, input IDs, attention masks, and bounding box levels are created.

Pixel values are represented as a three-dimensional array for RGB images, requiring a total of 225 to 224 tokens. Attention masks and bounding boxes are structured in two-dimensional arrays.

Dataset Selection and Mapping

A selection of 60 training examples and 40 validation examples is made from the dataset. The mapping function is utilized to create embeddings from images, words, and bounding boxes.

The mapping function processes samples in batches rather than individually. Unnecessary columns like ID are removed from the dataset to streamline processing.

Data Preparation Steps

After processing through the base model, only essential features remain for both X (input data) and Y (corresponding tokens).

The features class is formatted appropriately to ensure compatibility with subsequent processing steps.

Training Dataset Configuration

The training dataset is prepared with specified examples. New features include pixel values, input IDs, attention masks, and bounding box levels.

Sample images from the training dataset are examined to extract input IDs. Padding tokens may be added if sequences do not reach the maximum length of 512.

Handling Pixel Values

Pixel values range between 0 (white color) and 1 (information present), indicating that conversion back to strings isn't feasible due to their numerical nature.

Conversion to Torch Format

The training dataset must be converted into torch tensors using a specific method called set_format. This ensures compatibility with PyTorch operations.

Shapes and data types of various components like input IDs and bounding boxes can be printed for verification after conversion.

Troubleshooting Dataset Loading Issues

If issues arise while loading metrics from datasets due to version incompatibility (e.g., version 3.0.1), downgrading to version 2.2.1.04 resolves these problems.

Metrics Calculation Methodology

A method for computing evaluation metrics such as precision, recall, F1 score, and accuracy is established for token classification tasks using this framework.

Model Training and Token Classification with Layout LM

Loading the Model for Token Classification

The process begins by loading a base model designed for token classification, specifically using the layout lmb3 class from the Transformer library.

An id2label dictionary is prepared to facilitate inference, mapping labels to IDs necessary for the model's operation.

Training Process Overview

Various methods exist for training models, including manual gradient descent and backpropagation techniques. However, these can be complex and time-consuming.

The Hugging Face library offers a Trainer class that simplifies model training by managing various aspects of the training process.

Setting Up Training Arguments

Key training arguments are initialized within the Trainer class, including output directory settings and maximum steps for epochs.

Recommended batch sizes follow powers of two (e.g., 4, 8, 16), optimizing CPU/GPU architecture utilization during processing.

Evaluation Strategy and Model Saving

An evaluation strategy is implemented every 250 steps to monitor performance; it ensures that the best model based on accuracy is saved after training concludes.

After initializing training arguments, datasets are provided along with necessary components like tokenizers and metrics computation methods.

Monitoring Training Progress

Upon executing the train method in the Trainer object, GPU usage is monitored as training progresses. Initial loss values indicate a downward trend but suggest more epochs may be needed.

Post-training results show validation loss at 1.97 with precision at 86%, recall at 90%, F1 score at 88%, and overall accuracy at 74%. Recommendations include increasing epoch count for improved accuracy.

Finalizing Model Training

A dedicated folder named "layout lmb3 fine-tune" contains checkpoints and artifacts generated during training.

The trained model is saved using a specific naming convention to ensure clarity regarding its purpose; processor information must also be preserved for future inference tasks.

Inference Preparation

After fine-tuning completion, both custom-trained models and base processors are loaded for inference applications.

Test data samples are processed without labels to prepare them for evaluation against trained models.

Labeling Tokens in Image Processing

Overview of Token Labeling

The speaker discusses the process of drawing tokens for a given image, emphasizing the importance of labeling in understanding visual data.

A clear distinction is made between questions and answers within the context of token labeling, highlighting how each element serves a specific purpose.

The mention of "nothing level and value" suggests an exploration of areas within images that may not contain significant information but still require attention during labeling.

The speaker notes that the performance in this labeling task was commendable, indicating successful execution and understanding of the process.