Why is XGBoost popular?

XGBoost has gained significant popularity in applied machine learning and Kaggle competitions for structured data. It has been credited with winning numerous competitions and powers various cutting-edge industry applications due to its strong performance and active open-source community.

What concepts are fundamental to understanding XGBoost?

To understand XGBoost, it's essential to grasp decision trees (tree-like prediction structures), ensembles (combining multiple models for optimal results), boosting (sequentially training models to correct previous errors), and gradient boosting (a boosting technique where new models predict residuals of prior models).

How does XGBoost work beyond standard Gradient Boosting?

XGBoost enhances Gradient Boosting by employing more accurate approximations, specifically by utilizing second-order gradients and advanced regularization techniques to minimize its objective function.

What types of problems can XGBoost solve and on what platforms does it run?

XGBoost can be used for regression, classification, ranking, and user-defined prediction problems. It runs on Windows, Linux, and macOS, supports all major programming languages, and its distributed version is portable to various environments like Yarn, Kubernetes, and potentially Mesos.

How does XGBoost prevent overfitting?

XGBoost addresses overfitting through its 'eta' parameter, which acts as a learning rate. Instead of adding new tree predictions with full weight, 'eta' reduces their weight, thereby decreasing the overall model's complexity and the likelihood of overfitting. Small 'eta' values (e.g., 0.1 to 0.3) are commonly used.

XGBoost

Q: What is XGBoost?

XGBoost is an open-source, ensemble machine learning algorithm that uses a high-performance implementation of gradient boosted decision trees. It is known for its speed, scalability, and usability, built on a C++ codebase with a Python interface.

What is XGBoost?

XGBoost is an open source, ensemble machine learning algorithm that utilizes a high-performance implementation of gradient boosted decision trees. An underlying C++ codebase combined with a Python interface sitting on top makes XGBoost a very fast, scalable, and highly usable library.

XGBoost is an algorithm that has recently been dominating applied machine learning (ML) and Kaggle competitions for structured or tabular data. The XGBoost algorithm was developed as a research project at the University of Washington, and presented at the SIGKDD Conference in 2016. Since its introduction, this algorithm has not only been credited with winning numerous Kaggle competitions, but also for being the algorithm powering several cutting-edge industry applications. As a result, there is a strong community of data scientists contributing to the XGBoost open source projects with ~500 contributors and ~5000 commits on GitHub.

To understand how XGBoost works, you need to understand decision trees, ensembles, boosting, and gradient boosting.

Decision Tree Illustration from Titanic Survival Rates

Decision tree example from Titanic survival rates

Source: Vighnesh Tiwari

Decision trees are a type of ML algorithm that have a tree-like graph structure that is used for prediction. The leaves in a decision tree correspond to target classes, and each node in a decision tree represents an attribute. The popular electronic game 20Q and Alexa skill Twenty Questions are both examples of a decision tree in action.

Ensembles refer to a group of people or things. Ensembling is an ML technique that involves combining several machine learning models to build a single powerful predictive model to get optimal results.

Boosting is an ensemble-based learning algorithm which involves training ML models sequentially one after the other, and in each iteration the new model is trained to correct the errors made by the previous ones.

Gradient Boosting is a boosting technique wherein in each iteration the new models are trained to predict the residuals (i.e., errors) of prior models.

XGBoost uses more accurate approximations than Gradient Boosting by employing second-order gradients and advanced regularization. It seeks to minimize the following objective function:

Read the original paper, XGBoost: A Scalable Tree Boosting System, for more details.

Implementing XGBoost

XGBoost can be used to solve regression, classification, ranking, and user-defined prediction problems. It runs on Windows, Linux and macOS, and supports all major programming languages. The distributed version of XGBoost is designed to be portable to various environments, and can be ported to any cluster that supports rabit. You can directly run XGBoost on Yarn and Kubernetes; Mesos and other resource allocation engines can be supported as well.

Due to the nature of an ensemble model, you have to be careful about overfitting. The eta parameter in XGBoost gives you a chance to prevent this overfitting. The eta can be intuitively thought of as a learning rate. Rather than simply adding the predictions of new trees to the ensemble with full weight, the eta will be multiplied by the residuals being added to reduce their weight. This reduces the complexity of the overall model and the likelihood of overfitting. It is common to have small eta values in the range of 0.1 to 0.3.

Summary

Decision Tree Illustration from Titanic Survival Rates
Implementing XGBoost

Additional Resources

Data Science

Credit Card Fraud Detection using XGBoost, SMOTE, and threshold moving

Learn more

Code

SHAP and LIME Python libraries: Part 2 - using SHAP and LIME

Learn more

Data Science

Towards predictive accuracy: tuning hyperparameters and pipelines

Learn more