CatBoost Part 2: Building and Using Trees

CatBoost Part 2: Building and Using Trees

CatBoost Part 2: Building and Using Trees

Introduction to CatBoost

  • Josh Starmer introduces the topic of CatBoost, specifically focusing on building and using trees in this second part of the series.
  • The session encourages curiosity with a playful reminder: "A" always, "B" be, "C" curious. It assumes prior knowledge from Part 1 regarding Ordered Target Encoding and cosine similarity.

Data Preparation for Tree Building

  • A simple dataset is introduced where 'Favorite Color' predicts 'Height', serving as an example for demonstrating tree creation in CatBoost.
  • CatBoost randomizes training data rows and applies Ordered Target Encoding to discrete columns with more than two options; binary options are replaced with 1's and 0's.

Understanding Ordered Target Encoding

  • Unlike previous examples where the target was categorical, here 'Height' is continuous. Thus, it is binned into discrete categories for encoding purposes.
  • Two bins are created based on height values: the smallest values go into bin 0, while larger values go into bin 1. This allows for effective encoding similar to Part 1.

Avoiding Leakage in Predictions

  • Ordered Target Encoding processes data sequentially to prevent leakage—where a row’s target value could influence its own encoding.
  • More data would allow for additional bins and different equations for encoding; details can be found on the CatBoost website.

Initializing Model Predictions

  • After applying Ordered Target Encoding, initial model predictions are set to zero across all rows, followed by calculating residuals (differences between observed heights and predicted heights).

Building the First Tree

  • The process begins by identifying thresholds for 'Favorite Color' to establish the root of the tree; potential thresholds are calculated based on sorted values.
  • For simplicity, a stump (a small decision tree with one split) is built instead of a full tree. The first threshold tested is set at 0.04.

Running Rows Through the Tree

  • Each row's residual is placed in corresponding leaves based on whether their 'Favorite Color' exceeds or falls below the threshold; outputs are updated accordingly.
  • As each row passes through, average residual values update leaf outputs progressively until all rows have been processed.

Evaluating Threshold Effectiveness

  • To assess how well predictions align with actual outcomes at each threshold, cosine similarity between leaf output values and residuals is computed.
  • For threshold testing at 'Favorite Color' less than 0.04, cosine similarity yields a score of 0.71 indicating prediction effectiveness.

This structured approach provides clarity on how CatBoost builds trees using ordered target encoding while avoiding common pitfalls like leakage during model training.

How CatBoost Determines Thresholds for Trees

Cosine Similarity in Tree Building

  • The cosine similarity between the residuals and leaf output values is calculated, yielding a value of 0.79 for the second tree compared to 0.71 for the first tree.
  • When building larger trees, each split is evaluated by comparing cosine similarities of possible thresholds; this process helps determine which threshold to use.
  • Initializing leaf output with zero can skew calculations since it does not reflect actual data; thus, including these values in cosine similarity calculations may be misleading.

Handling Data and Predictions

  • In practice, when sufficient data is available, CatBoost ignores initial rows during cosine similarity calculations to avoid bias from initialization values.
  • After constructing the first tree, predictions are updated by adding scaled leaf output values (using a learning rate of 0.1), improving upon initial predictions that were all set to zero.

Sequential Data Processing

  • CatBoost processes data sequentially, ensuring that the residual of a row does not influence its own prediction or leaf output calculation—this prevents leakage similar to Ordered Target Encoding.
  • Residual updates involve subtracting predicted values from observed ones; original categorical variables are restored after randomization for further processing.

Building Additional Trees

  • With only one threshold identified (0.29), only one additional tree is built after updating predictions and residuals; this iterative process continues until satisfactory predictions are achieved.
  • Typically, many trees would be created iteratively to refine predictions; however, only two trees are considered here for simplicity.

Understanding Symmetric Decision Trees in CatBoost

Characteristics of Symmetric Decision Trees

  • Larger trees in CatBoost utilize symmetric decision trees where identical thresholds apply at each node on the same level (e.g., 'Age' less than 12).

Trade-offs in Prediction Accuracy

  • While symmetric decision trees may reduce predictive accuracy as they represent weaker learners, they enhance computational efficiency during prediction processes.

Efficiency vs. Effectiveness

  • The design choice prioritizes speed over precision: while normal decision trees might have varied thresholds leading to more accurate predictions, symmetric structures allow faster computations across large datasets.

Understanding CatBoost's Unique Approach

Key Features of CatBoost

  • CatBoost employs a method where each level asks the same questions, simplifying the tracking process and ensuring consistency in decision-making across different paths through the tree.
  • The algorithm processes data sequentially, treating it as if arriving one row at a time. This approach helps avoid leakage during Target Encoding and when calculating output values from trees.
  • CatBoost constructs symmetrical trees, which are less complex and allow for quicker computation of output values, enhancing overall efficiency.

Additional Resources

  • For further learning on statistics and machine learning, resources such as PDF study guides and "The StatQuest Illustrated Guide to Machine Learning" are available at statquest.org.
Video description

Just like we saw in CatBoost Part 1, Ordered Target Encoding, we're going to use the training data one row at a time to build and calculate the output values from trees. This is part of CatBoot's determined effort to avoid leakage like there is no tomorrow. We'll also learn how CatBoost makes predictions once the trees made. NOTE: This StatQuest is based on the original CatBoost manuscript... https://arxiv.org/abs/1706.09516 ...and an example provided in the CatBoost documentation... https://catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic English This video has been dubbed using an artificial voice via https://aloud.area120.google.com to increase accessibility. You can change the audio track language in the Settings menu. Spanish Este video ha sido doblado al español con voz artificial con https://aloud.area120.google.com para aumentar la accesibilidad. Puede cambiar el idioma de la pista de audio en el menú Configuración. Portuguese Este vídeo foi dublado para o português usando uma voz artificial via https://aloud.area120.google.com para melhorar sua acessibilidade. Você pode alterar o idioma do áudio no menu Configurações. For a complete index of all the StatQuest videos, check out: https://statquest.org/video-index/ If you'd like to support StatQuest, please consider... Patreon: https://www.patreon.com/statquest ...or... YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join ...buying one of my books, a study guide, a t-shirt or hoodie, or a song from the StatQuest store... https://statquest.org/statquest-store/ ...or just donating to StatQuest! https://www.paypal.me/statquest Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter: https://twitter.com/joshuastarmer 0:00 Awesome song and introduction 1:10 Building the first tree 6:05 Quantifying the effectiveness of the first threshold 6:56 Testing a second threshold 9:05 Building the second tree 10:21 The main idea of how CatBoost works 12:15 Making predictions 13:02 Symmetric Decision Trees 14:56 Summary of the main ideas Corrections: 2:05 Red should have gone into bin 0 instead of bin 1. 7:23 I should have said that the cosine similarity was 0.71. #StatQuest #CatBoost #DubbedWithAloud