CatBoost Part 2: Building and Using Trees
CatBoost Part 2: Building and Using Trees
Introduction to CatBoost
- Josh Starmer introduces the topic of CatBoost, specifically focusing on building and using trees in this second part of the series.
- The session encourages curiosity with a playful reminder: "A" always, "B" be, "C" curious. It assumes prior knowledge from Part 1 regarding Ordered Target Encoding and cosine similarity.
Data Preparation for Tree Building
- A simple dataset is introduced where 'Favorite Color' predicts 'Height', serving as an example for demonstrating tree creation in CatBoost.
- CatBoost randomizes training data rows and applies Ordered Target Encoding to discrete columns with more than two options; binary options are replaced with 1's and 0's.
Understanding Ordered Target Encoding
- Unlike previous examples where the target was categorical, here 'Height' is continuous. Thus, it is binned into discrete categories for encoding purposes.
- Two bins are created based on height values: the smallest values go into bin 0, while larger values go into bin 1. This allows for effective encoding similar to Part 1.
Avoiding Leakage in Predictions
- Ordered Target Encoding processes data sequentially to prevent leakage—where a row’s target value could influence its own encoding.
- More data would allow for additional bins and different equations for encoding; details can be found on the CatBoost website.
Initializing Model Predictions
- After applying Ordered Target Encoding, initial model predictions are set to zero across all rows, followed by calculating residuals (differences between observed heights and predicted heights).
Building the First Tree
- The process begins by identifying thresholds for 'Favorite Color' to establish the root of the tree; potential thresholds are calculated based on sorted values.
- For simplicity, a stump (a small decision tree with one split) is built instead of a full tree. The first threshold tested is set at 0.04.
Running Rows Through the Tree
- Each row's residual is placed in corresponding leaves based on whether their 'Favorite Color' exceeds or falls below the threshold; outputs are updated accordingly.
- As each row passes through, average residual values update leaf outputs progressively until all rows have been processed.
Evaluating Threshold Effectiveness
- To assess how well predictions align with actual outcomes at each threshold, cosine similarity between leaf output values and residuals is computed.
- For threshold testing at 'Favorite Color' less than 0.04, cosine similarity yields a score of 0.71 indicating prediction effectiveness.
This structured approach provides clarity on how CatBoost builds trees using ordered target encoding while avoiding common pitfalls like leakage during model training.
How CatBoost Determines Thresholds for Trees
Cosine Similarity in Tree Building
- The cosine similarity between the residuals and leaf output values is calculated, yielding a value of 0.79 for the second tree compared to 0.71 for the first tree.
- When building larger trees, each split is evaluated by comparing cosine similarities of possible thresholds; this process helps determine which threshold to use.
- Initializing leaf output with zero can skew calculations since it does not reflect actual data; thus, including these values in cosine similarity calculations may be misleading.
Handling Data and Predictions
- In practice, when sufficient data is available, CatBoost ignores initial rows during cosine similarity calculations to avoid bias from initialization values.
- After constructing the first tree, predictions are updated by adding scaled leaf output values (using a learning rate of 0.1), improving upon initial predictions that were all set to zero.
Sequential Data Processing
- CatBoost processes data sequentially, ensuring that the residual of a row does not influence its own prediction or leaf output calculation—this prevents leakage similar to Ordered Target Encoding.
- Residual updates involve subtracting predicted values from observed ones; original categorical variables are restored after randomization for further processing.
Building Additional Trees
- With only one threshold identified (0.29), only one additional tree is built after updating predictions and residuals; this iterative process continues until satisfactory predictions are achieved.
- Typically, many trees would be created iteratively to refine predictions; however, only two trees are considered here for simplicity.
Understanding Symmetric Decision Trees in CatBoost
Characteristics of Symmetric Decision Trees
- Larger trees in CatBoost utilize symmetric decision trees where identical thresholds apply at each node on the same level (e.g., 'Age' less than 12).
Trade-offs in Prediction Accuracy
- While symmetric decision trees may reduce predictive accuracy as they represent weaker learners, they enhance computational efficiency during prediction processes.
Efficiency vs. Effectiveness
- The design choice prioritizes speed over precision: while normal decision trees might have varied thresholds leading to more accurate predictions, symmetric structures allow faster computations across large datasets.
Understanding CatBoost's Unique Approach
Key Features of CatBoost
- CatBoost employs a method where each level asks the same questions, simplifying the tracking process and ensuring consistency in decision-making across different paths through the tree.
- The algorithm processes data sequentially, treating it as if arriving one row at a time. This approach helps avoid leakage during Target Encoding and when calculating output values from trees.
- CatBoost constructs symmetrical trees, which are less complex and allow for quicker computation of output values, enhancing overall efficiency.
Additional Resources
- For further learning on statistics and machine learning, resources such as PDF study guides and "The StatQuest Illustrated Guide to Machine Learning" are available at statquest.org.