The evolution of tree-based models — from robustness to optimization to scalability to better categorical feature handling
If you work with tabular data — the kind of structured data found in business analytics, finance, marketing, customer analysis, product analytics, operations, or surveys — you’ve probably encountered several popular machine learning algorithms:
Random Forest
XGBoost
LightGBM
CatBoost
All of them rely on decision trees, and all of them can produce strong predictive models.
But they were designed with different priorities in mind.
Random Forest emphasizes simplicity and robustness.
XGBoost focuses on highly optimized gradient boosting.
LightGBM was created to make boosting faster and more scalable.
CatBoost was created to make boosting work especially well with categorical variables.
Understanding why CatBoost was created and how it works makes it easier to decide when it is the right model for your data.
Many real-world datasets are not purely numeric.
They often contain many categorical variables such as:
customer segment
country
region
product category
industry
job title
subscription plan
marketing channel
survey answer
device type
store location
campaign ID
product ID
ZIP code
For example, a customer churn dataset might look like this.
| Customer Type | Region | Plan | Industry | Monthly Revenue | Churn |
|---|---|---|---|---|---|
| SMB | West | Starter | Retail | 120 | TRUE |
| Enterprise | East | Business | Finance | 2,500 | FALSE |
| Mid-Market | South | Pro | SaaS | 850 | FALSE |
Some columns are numeric.
But many important columns are categorical.
This creates a practical problem for machine learning.
Most tree-based models cannot simply use text values like:
Starter
Business
Retail
Finance
West
East
These values need to be represented in a way the model can use.
There are several ways to do this.
Common approaches include:
one-hot encoding
label encoding
native categorical split handling
target encoding
CatBoost’s ordered target statistics
CatBoost was designed around this problem from the beginning.
But to understand what makes CatBoost different, we first need to understand how Random Forest, XGBoost, and LightGBM approach tree-based modeling.
Random Forest builds many trees independently.
Each tree:
samples the dataset randomly
selects features randomly when splitting
produces its own prediction
The final prediction is the average for regression or the majority vote for classification.
The key idea is that many independent trees reduce variance and improve stability compared to a single decision tree.
But the trees do not learn from each other.
Each tree is trained separately.
This makes Random Forest relatively simple, robust, and easy to use, but it may not reach the same level of predictive accuracy as modern boosting algorithms.
XGBoost uses a technique called gradient boosting.
Instead of building independent trees, it builds trees sequentially to correct the errors made by the previous trees.
Boosting algorithms are performing a form of gradient descent in function space.
For example, in a regression problem, the model starts with an initial prediction and calculates the error.
| Row | Actual | Prediction | Gradient / Residual |
|---|---|---|---|
| 1 | 10 | 7 | 3 |
| 2 | 15 | 14 | 1 |
| 3 | 8 | 9 | -1 |
The next tree is built to predict these gradient values, not the original target values.
Why?
Because the gradient tells the model how to move the prediction to reduce the loss.
The model updates the prediction like this:
prediction_new = prediction_previous + learning_rate × prediction_from_new_tree
Conceptually, the model evolves like this:
Tree 1 → initial prediction
Tree 2 → fix errors from Tree 1
Tree 3 → fix remaining errors
Tree 4 → continue improving
This process often leads to more accurate models than Random Forest.
However, when categorical variables are included in the data, users need to decide how to represent them.
Traditionally, XGBoost has often been used with one-hot encoded categorical variables, though newer versions also support native categorical handling depending on the interface and configuration.
LightGBM also uses gradient boosting.
But LightGBM focuses heavily on making boosting faster and more scalable.
It optimizes:
how trees grow
how rows are sampled
how split candidates are evaluated
how high-dimensional sparse features are handled
Its major innovations include:
leaf-wise tree growth
histogram-based splitting
GOSS, or Gradient-based One-Side Sampling
EFB, or Exclusive Feature Bundling
LightGBM is especially useful when:
the dataset is large
there are many rows
there are many features
the data is sparse
training speed matters
LightGBM also supports native categorical feature handling.
When LightGBM is told that a variable is categorical, it does not simply create one-hot encoded columns. Instead, it keeps the categorical variable as one feature and tries to find useful ways to split its categories into groups.
For example, suppose we have this categorical variable:
Plan = Starter, Pro, Business, Enterprise
LightGBM may find a split like this:
Plan in {Starter, Pro}
vs.
Plan in {Business, Enterprise}
This is different from one-hot encoding, where each category becomes a separate binary column.
So LightGBM’s categorical handling is not the same as naive target encoding, and it does not automatically create the target leakage problem that we will discuss later.
Still, CatBoost takes a different approach to categorical variables.
CatBoost also uses gradient boosting.
It builds trees sequentially, just like XGBoost and LightGBM.
Tree 1 → initial prediction
Tree 2 → fix errors from Tree 1
Tree 3 → fix remaining errors
Tree 4 → continue improving
But CatBoost was designed especially to work well with categorical variables.
Its key ideas include:
built-in categorical feature handling
target-based statistics for categorical variables
ordered target statistics to reduce leakage
ordered boosting to reduce prediction shift
symmetric trees
strong default settings
The main point is not that XGBoost and LightGBM always have leakage problems with categorical variables.
That would be incorrect.
If categorical variables are converted with one-hot encoding, there is no target leakage from the encoding itself because one-hot encoding does not use the target variable.
LightGBM’s native categorical handling also does not simply use naive target encoding.
The more accurate point is this:
CatBoost was designed to safely use target-based information from categorical variables, especially high-cardinality categorical variables, while reducing the leakage and bias problems that can happen with naive target encoding.
Let’s unpack this.
One common way to convert categorical variables into numeric features is one-hot encoding.
For example, this column:
| Plan |
|---|
| Starter |
| Pro |
| Business |
can be converted into multiple columns:
| Plan_Starter | Plan_Pro | Plan_Business |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
This approach is simple and safe.
It does not use the target variable.
So if we are predicting churn, the one-hot encoded columns are
created only from the values in the Plan column.
They are not calculated from the Churn column.
That means one-hot encoding itself does not create target leakage.
For categorical variables with a small number of unique values, one-hot encoding often works well.
For example:
Plan = Starter / Pro / Business
Region = East / West / South / North
Device = Desktop / Mobile / Tablet
These variables have only a few categories, so one-hot encoding creates only a few extra columns.
But one-hot encoding becomes less attractive when a categorical variable has many unique values.
For example:
Product ID: 50,000 categories
ZIP Code: 30,000 categories
Campaign ID: 10,000 categories
Search Keyword: 500,000 categories
Company Name: 1,000,000 categories
One-hot encoding these variables can create a huge number of sparse columns.
This can cause several problems:
too many columns
higher memory usage
slower training
sparse data
overfitting on rare categories
harder model interpretation
So one-hot encoding avoids leakage, but it can become inefficient or less effective for high-cardinality categorical variables.
This is where target-based encoding becomes attractive.
Another way to represent categorical variables is target encoding.
The idea is simple:
Replace each category with a statistic calculated from the target variable.
For example, suppose we are predicting customer churn.
Our data looks like this:
| Customer | Plan | Churn |
|---|---|---|
| A | Starter | 1 |
| B | Starter | 0 |
| C | Starter | 1 |
| D | Business | 0 |
| E | Business | 0 |
The target variable is:
Churn
And one explanatory variable is:
Plan
With target encoding, we calculate the average churn rate for each plan.
| Plan | Average Churn |
|---|---|
| Starter | 0.67 |
| Business | 0.00 |
Then we replace the plan value with that average.
| Customer | Plan | Churn | Encoded Plan |
|---|---|---|---|
| A | Starter | 1 | 0.67 |
| B | Starter | 0 | 0.67 |
| C | Starter | 1 | 0.67 |
| D | Business | 0 | 0.00 |
| E | Business | 0 | 0.00 |
This is useful because it captures information like:
Starter customers tend to churn more than Business customers.
Target encoding can be very powerful, especially for high-cardinality categorical variables.
Instead of creating thousands of one-hot columns, it can represent each category with one numeric value.
But there is a hidden problem.
If target encoding is done naively, it can leak information from the target variable into the input features.
Look at customer A.
| Customer | Plan | Churn | Encoded Plan |
|---|---|---|---|
| A | Starter | 1 | 0.67 |
The encoded value 0.67 was calculated using all Starter
customers:
| Customer | Plan | Churn |
|---|---|---|
| A | Starter | 1 |
| B | Starter | 0 |
| C | Starter | 1 |
This means customer A’s own churn value was used to create customer A’s input feature.
That is the leakage.
The model is not directly seeing the target column, but the target value has influenced the feature value for the same row.
In simple terms:
The answer has partly leaked into the input feature.
This can make the model look better during training than it really is.
The problem becomes much worse with rare categories.
For example:
| Customer | Product ID | Churn |
|---|---|---|
| X | Product_999 | 1 |
If Product_999 appears only once, naive target encoding
gives:
Product_999 churn rate = 1.00
So the encoded value becomes:
| Customer | Product ID | Churn | Encoded Product ID |
|---|---|---|---|
| X | Product_999 | 1 | 1.00 |
This is almost the same as copying the target value into the feature.
Any model can learn from this leaked information.
The model may look highly accurate on the training data, but perform poorly on new data.
This problem is not caused by XGBoost or LightGBM themselves.
It is caused by the preprocessing step.
If we create a leaked target-encoded feature and give it to XGBoost or LightGBM, those models will use it.
They do not automatically know that the feature was created using the target value from the same row.
So the risky workflow looks like this:
Raw categorical variable
↓
Naive target encoding using the full training data
↓
XGBoost / LightGBM / any other model
↓
Over-optimistic model performance
This is the leakage problem CatBoost was designed to reduce.
CatBoost uses target-based statistics for categorical variables, but it calculates them in an ordered way.
This technique is called ordered target statistics.
The idea is:
For each row, calculate the category statistic using only rows that came before it in a random ordering.
Let’s use the same example.
| Customer | Plan | Churn |
|---|---|---|
| A | Starter | 1 |
| B | Starter | 0 |
| C | Starter | 1 |
| D | Starter | 1 |
Suppose the data is randomly ordered like this:
| Order | Customer | Plan | Churn |
|---|---|---|---|
| 1 | A | Starter | 1 |
| 2 | B | Starter | 0 |
| 3 | C | Starter | 1 |
| 4 | D | Starter | 1 |
For simplicity, suppose the prior value is the overall churn rate,
say 0.50.
CatBoost-style ordered encoding would look like this:
| Order | Customer | Plan | Churn | Encoded Plan Used for This Row |
|---|---|---|---|---|
| 1 | A | Starter | 1 | 0.50 |
| 2 | B | Starter | 0 | 1.00 |
| 3 | C | Starter | 1 | 0.50 |
| 4 | D | Starter | 1 | 0.67 |
Here is what is happening.
For customer A, there are no previous Starter rows, so CatBoost uses a prior value.
A → no previous Starter rows → use prior = 0.50
For customer B, only customer A came before it.
B → previous Starter rows: A
→ average churn = 1 / 1 = 1.00
For customer C, customers A and B came before it.
C → previous Starter rows: A, B
→ average churn = (1 + 0) / 2 = 0.50
For customer D, customers A, B, and C came before it.
D → previous Starter rows: A, B, C
→ average churn = (1 + 0 + 1) / 3 = 0.67
The important point is this:
The encoded value for each row does not use that row’s own target value.
Customer C’s encoded value uses A and B, but not C.
Customer D’s encoded value uses A, B, and C, but not D.
So CatBoost can still learn that Starter customers tend to churn more, but it avoids directly leaking each row’s own answer into its feature value.
Customer A's encoded value
= average churn of all Starter customers
= uses A's own churn value
= leakage
Customer A's encoded value
= average churn of previous Starter customers only
= does not use A's own churn value
= leakage is reduced
This is one of the most important ideas behind CatBoost.
CatBoost does not just save preprocessing work.
It provides a built-in way to use target-related categorical information more safely.
The difference becomes even clearer with high-cardinality variables.
Imagine you accidentally include Customer ID as a
feature.
| Customer ID | Churn |
|---|---|
| C001 | 1 |
| C002 | 0 |
| C003 | 1 |
| C004 | 0 |
If each customer ID appears only once, naive target encoding creates this:
| Customer ID | Churn | Encoded Customer ID |
|---|---|---|
| C001 | 1 | 1.00 |
| C002 | 0 | 0.00 |
| C003 | 1 | 1.00 |
| C004 | 0 | 0.00 |
This is basically the target variable copied into the feature column.
With CatBoost’s ordered approach, the first time each customer ID appears, there are no previous rows for that ID.
So CatBoost uses a prior value instead.
| Customer ID | Churn | CatBoost-style Encoded Value |
|---|---|---|
| C001 | 1 | prior |
| C002 | 0 | prior |
| C003 | 1 | prior |
| C004 | 0 | prior |
This prevents the model from memorizing the target through unique IDs.
Of course, this does not mean that pure ID columns are good predictive features.
Usually, you should still remove pure identifier columns such as customer ID, transaction ID, or row ID.
But CatBoost’s encoding approach is safer than naive target encoding.
CatBoost also uses another technique called ordered boosting.
In standard gradient boosting, each tree is trained using gradients calculated from the same training data.
This can create a subtle bias called prediction shift.
The simple explanation is:
During training, the model may see a slightly easier version of the problem than it will see when predicting new data.
CatBoost uses an ordered training procedure to reduce this bias.
You do not need to understand all the mathematical details to use CatBoost.
The practical message is:
CatBoost tries to make the training process closer to the real prediction situation, where the model must predict unseen rows without using their target values.
Together, ordered target statistics and ordered boosting are two of the main reasons CatBoost often works well on datasets with many categorical variables.
Another distinctive feature of CatBoost is that it usually builds symmetric trees, also called oblivious trees.
A symmetric tree uses the same split condition at each level of the tree.
For example:
Root
/ \
Split A Split A
/ \ / \
Split B Split B Split B Split B
This is different from many other tree algorithms, where each branch can use different split conditions.
Symmetric trees are more constrained, but they have several advantages.
They can be:
faster to apply to new data
easier to regularize
more stable
less likely to overfit in some situations
The trade-off is that symmetric trees may sometimes be less flexible than fully asymmetric trees.
But in practice, CatBoost often performs very well with this structure.
CatBoost reduces the leakage problem caused by naive target encoding for categorical variables.
But it does not magically prevent all types of data leakage.
You can still create leakage if your dataset includes variables that would not be available at prediction time.
For example, if you are predicting customer churn, these variables may leak future information:
Cancellation Date
Refund Amount After Cancellation
Support Ticket After Churn
Final Invoice Status
Reason for Leaving
These variables may only become known after the customer has already churned.
No model can automatically fix this kind of leakage.
You still need to make sure that all explanatory variables are available before the prediction point.
So the correct understanding is:
CatBoost helps reduce leakage related to target-based categorical encoding, but users still need to avoid general data leakage in the dataset.
LightGBM and CatBoost can both handle categorical variables, but they do so differently.
LightGBM’s native categorical handling keeps a categorical variable as one feature and finds useful splits between groups of categories.
For example:
Plan in {Starter, Pro}
vs.
Plan in {Business, Enterprise}
CatBoost uses target-based statistics in an ordered way.
For example:
Starter for this row
→ average churn of previous Starter rows only
So the difference can be summarized like this:
| Model / Approach | How categorical variables are handled |
|---|---|
| One-hot encoding | Converts each category into a separate binary column |
| LightGBM native categorical handling | Splits categories into groups inside the tree |
| CatBoost | Uses ordered target statistics and ordered boosting |
| Naive target encoding | Replaces category with target average, but can leak target information |
The important point is not that LightGBM is unsafe.
The important point is that CatBoost takes a different approach.
LightGBM focuses strongly on efficient tree growth and scalable training.
CatBoost focuses strongly on categorical variables and leakage-reducing target statistics.
CatBoost has many parameters, but you do not need to start with all of them.
For most users, the most important ones are:
number of trees
learning rate
tree depth
L2 regularization
random strength
bagging temperature
early stopping
Let’s look at them briefly.
In CatBoost, the number of trees is often controlled by a parameter
called iterations.
More trees give the model more chances to learn patterns.
iterations = 1000
But more trees also mean longer training time and a higher risk of overfitting.
A good practical approach is to use a large number of trees together with early stopping.
iterations = 3000
early_stopping_rounds = 50
This means:
Train up to 3,000 trees, but stop if validation performance no longer improves.
The learning rate controls how much each new tree contributes to the final prediction.
prediction_new = prediction_previous + learning_rate × prediction_from_new_tree
A large learning rate learns quickly.
A small learning rate learns slowly and carefully.
| Learning Rate | Meaning |
|---|---|
| 0.1 | faster, more aggressive |
| 0.05 | balanced |
| 0.03 | slower, often more stable |
| 0.01 | very slow, may need many trees |
The common trade-off is:
higher learning rate → fewer trees
lower learning rate → more trees
For example:
learning_rate = 0.1
iterations = 500
can be changed to:
learning_rate = 0.03
iterations = 2000
for a slower but often more reliable model.
Tree depth controls how complex each tree can become.
depth = 6
A smaller depth creates simpler trees.
A larger depth allows the model to capture more complex interactions.
| Depth | Meaning |
|---|---|
| 4 | simpler model |
| 6 | common starting point |
| 8 | more complex |
| 10 | very complex |
If the model is underfitting, increasing depth may help.
If the model is overfitting, reducing depth may help.
L2 regularization controls how strongly the model penalizes large leaf values.
In simple terms:
It prevents the model from becoming too aggressive.
A common parameter is:
l2_leaf_reg = 3
Larger values make the model more conservative.
This can help reduce overfitting.
Random strength adds randomness when CatBoost chooses splits.
This can help prevent the model from relying too heavily on obvious splits that may not generalize well.
random_strength = 1
Increasing this value can reduce overfitting, but too much randomness may weaken the model.
Bagging temperature controls randomness in row sampling.
bagging_temperature = 1
A higher value increases randomness.
This can help reduce overfitting, especially when the model is too sensitive to the training data.
Early stopping is one of the most useful tools for boosting models.
Instead of guessing the perfect number of trees, you let the validation data decide.
iterations = 3000
early_stopping_rounds = 50
If validation performance does not improve for 50 rounds, training stops.
This helps prevent overfitting and saves time.
Each algorithm has strengths.
Good when:
you want a simple baseline
the dataset is relatively small
minimal tuning is preferred
stability is more important than maximum accuracy
Random Forest is often a good first model because it is easy to understand and relatively robust.
Good when:
datasets are moderate in size
you want strong predictive performance
you want detailed tuning options
you need a mature and widely used boosting algorithm
XGBoost is a strong general-purpose gradient boosting model.
Good when:
datasets are large
there are many rows
there are many features
the data is sparse
training speed matters
LightGBM is often a great choice when performance and scalability are important.
Good when:
the dataset has many categorical variables
you want to reduce manual preprocessing
one-hot encoding would create too many columns
you have high-cardinality categorical variables
you want strong default performance
you are working with customer, marketing, survey, product, or business data
CatBoost is especially attractive when your dataset contains many character or factor columns.
A common workflow in machine learning projects is:
Start with Random Forest as a baseline.
Try XGBoost or LightGBM to improve performance.
Prefer LightGBM when datasets are large or training time becomes a bottleneck.
Prefer CatBoost when the dataset has many categorical variables, especially high-cardinality categorical variables.
A simple rule of thumb is:
Mostly numeric and large data → LightGBM
Many categorical variables → CatBoost
Need a simple robust baseline → Random Forest
Need mature boosting flexibility → XGBoost
Of course, the best model depends on the actual data.
But this rule can help you choose where to start.
Random Forest, XGBoost, LightGBM, and CatBoost all rely on decision trees, but they represent different philosophies.
Random Forest focuses on robust ensembles
XGBoost focuses on optimized gradient boosting
LightGBM focuses on efficient and scalable boosting
CatBoost focuses on boosting with better categorical feature handling
The main advantage of CatBoost is not that other models always suffer from leakage.
That would be too broad.
One-hot encoding is safe from target leakage, and LightGBM has its own native categorical split handling.
CatBoost’s unique contribution is that it provides a built-in way to use target-based categorical information more safely through ordered target statistics and ordered boosting.
This makes CatBoost especially useful for real-world tabular datasets that contain many categorical variables, such as customer data, marketing data, product data, survey data, and business operations data.
You can try CatBoost with Exploratory.
Go to Analytics view.
Select CatBoost.
Select a Target Variable.
Select Explanatory Variables, or Features.
Click the Run button.
CatBoost can be especially useful when your data includes categorical variables such as customer segment, region, product category, plan type, industry, campaign ID, product ID, ZIP code, or survey answers.
You can start using CatBoost in the latest version of Exploratory.
👉 Download Exploratory
https://exploratory.io/download
If you don’t have an account yet, sign up here to start your 30-day free trial.
If your trial has expired but you’d like to try the new features, simply launch the latest version and use the Extend Trial option.
If you have questions or feedback, feel free to contact me at kan@exploratory.io.
We’d love to hear how you’re using Exploratory to uncover insights in your data.
Kan Nishida
CEO, Exploratory