XGBoost is a library designed and optimized for generalized gradient boosting. It provides state-of-the-art performance for typical supervised machine learning problems, powers more than half of machine learning challenges at Kaggle, and attracts lots of users from industry.
Despite better performance compared with other gradient-boosting implementations, it’s still a time-consuming task to train XGBoost model. And it usually requires extensive parameter tuning to get a highly accurate model, which brings the strong requirement to speed up the whole process. There are two directions to accelerate this process: one is to use powerful hardware such as GPU; another way is to leverage distributed computation framework such as Apache Spark. In the latest version of XGBoost, it has already supported parallel tree construction algorithms on GPU, which can significantly improve the model training performance. On the other hand, XGBoost can be seamlessly integrated with Spark to build unified machine learning pipeline on massive data with optimized parallel parameter tuning function.
In this talk, we will cover the implementation and performance improvement of GPU-based XGBoost algorithm, summarize model tuning experience and best practices, share the insights on how to build a heterogeneous data analytic and machine learning pipeline based on Spark in a GPU-equipped YARN cluster, and show how to push model into production.