http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=C55jiang&feedformat=atomstatwiki - User contributions [US]2021-11-27T20:52:00ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34542summary2018-03-17T17:28:16Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: One key step in the approximate algorithm is to propose candidate split points. Usually, percentiles of feature are used to make candidates distribute evenly on the data.<br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n) }<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} \rightarrow [0, \infty)<br />
</math>:<br />
[[File:For4.png|500px| thumb|center ]]<br />
which represents the proportion of instances whose feature value <math> k </math> is smaller than <math> z </math>. The goal is to find candidate split points <math> ({\mathbf{s}_{k1}, \mathbf{s}_{k2}, ... \mathbf{s}_{kl} }) </math><br />
[[File:For5.png|500px| thumb|center ]]<br />
Here <math> \epsilon </math> is an approximation factor. This means that there is roughly <math> </math> candidate points. <br />
Here each data point is weighted by <math> h_i </math>, since rewriting Eq (3)<br />
[[File:For3.png|500px| thumb|center ]]<br />
we have <br />
[[File:For6.png|500px| thumb|center ]]<br />
which is exactly weighted squared loss with labels <math> g_i/ h_i </math> with <math> h_i </math>.<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34541summary2018-03-17T17:27:04Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: One key step in the approximate algorithm is to propose candidate split points. Usually, percentiles of feature are used to make candidates distribute evenly on the data.<br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n) }<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} \rightarrow [0, \infty)<br />
</math>:<br />
[[File:For4.png|500px| thumb|center ]]<br />
which represents the proportion of instances whose feature value <math> k </math> is smaller than <math> z </math>. The goal is to find candidate split points <math> ({\mathbf{s}_{k1}, \mathbf{s}_{k2}, ... \mathbf{s}_{kl} }) </math><br />
[[File:For5.png|500px| thumb|center ]]<br />
Here <math> \epson </math> is an approximation factor. This means that there is roughly <math> </math> candidate points. <br />
Here each data point is weighted by <math> h_i </math>, since rewriting Eq (3)<br />
[[File:For3.png|500px| thumb|center ]]<br />
we have <br />
[[File:For6.png|500px| thumb|center ]]<br />
which is exactly weighted squared loss with labels <math> g_i/ h_i </math><br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34540summary2018-03-17T17:23:05Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: One key step in the approximate algorithm is to propose candidate split points. Usually, percentiles of feature are used to make candidates distribute evenly on the data.<br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n) }<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} \rightarrow [0, \infty)<br />
</math>:<br />
[[File:For4.png|500px| thumb|center ]]<br />
which represents the proportion of instances whose feature value <math> k </math> is smaller than <math> z </math>. The goal is to find candidate split points <math> ({\mathbf{s}_{k1}, \mathbf{s}_{k2}, ... \mathbf{s}_{kl} }) </math><br />
[[File:For5.png|500px| thumb|center ]]<br />
<br />
<br />
[[File:For3.png|500px| thumb|center ]]<br />
<br />
[[File:For6.png|500px| thumb|center ]]<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34539summary2018-03-17T17:22:37Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: One key step in the approximate algorithm is to propose candidate split points. Usually, percentiles of feature are used to make candidates distribute evenly on the data.<br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n) }<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} \rightarrow [0, \infty)<br />
</math>:<br />
[[File:For4.png|500px| thumb|center ]]<br />
which represents the proportion of instances whose feature value <math> k </math> is smaller than <math> z </math>. The goal is to find candidate split points <math> {\mathbf{s}_{k1}, \mathbf{s}_{k2}, ... \mathbf{s}_{kl} } </math><br />
[[File:For5.png|500px| thumb|center ]]<br />
<br />
<br />
[[File:For3.png|500px| thumb|center ]]<br />
<br />
[[File:For6.png|500px| thumb|center ]]<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34538summary2018-03-17T17:19:33Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: One key step in the approximate algorithm is to propose candidate split points. Usually, percentiles of feature are used to make candidates distribute evenly on the data.<br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n) }<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} \rightarrow [0, \infty)<br />
</math>:<br />
<br />
<br />
<br />
[[File:For4.png|500px| thumb|center ]]<br />
<br />
[[File:For5.png|500px| thumb|center ]]<br />
<br />
<br />
[[File:For3.png|500px| thumb|center ]]<br />
<br />
[[File:For6.png|500px| thumb|center ]]<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34537summary2018-03-17T17:18:44Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: One key step in the approximate algorithm is to propose candidate split points. Usually, percentiles of feature are used to make candidates distribute evenly on the data.<br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = \{ (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n) \}<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} \rightarrow [0, \infty)<br />
</math>:<br />
<br />
<br />
<br />
[[File:For4.png|500px| thumb|center ]]<br />
<br />
[[File:For5.png|500px| thumb|center ]]<br />
<br />
<br />
[[File:For3.png|500px| thumb|center ]]<br />
<br />
[[File:For6.png|500px| thumb|center ]]<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34536summary2018-03-17T17:17:40Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: One key step in the approximate algorithm is to propose candidate split points. Usually, percentiles of feature are used to make candidates distribute evenly on the data.<br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = \math { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n)}<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} \rightarrow [0, \infty)<br />
</math>:<br />
<br />
<br />
<br />
[[File:For4.png|500px| thumb|center ]]<br />
<br />
[[File:For5.png|500px| thumb|center ]]<br />
<br />
<br />
[[File:For3.png|500px| thumb|center ]]<br />
<br />
[[File:For6.png|500px| thumb|center ]]<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34535summary2018-03-17T17:17:19Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: One key step in the approximate algorithm is to propose candidate split points. Usually, percentiles of feature are used to make candidates distribute evenly on the data.<br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = \{ (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n)}<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} \rightarrow [0, \infty)<br />
</math>:<br />
<br />
<br />
<br />
[[File:For4.png|500px| thumb|center ]]<br />
<br />
[[File:For5.png|500px| thumb|center ]]<br />
<br />
<br />
[[File:For3.png|500px| thumb|center ]]<br />
<br />
[[File:For6.png|500px| thumb|center ]]<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34534summary2018-03-17T17:16:32Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: One key step in the approximate algorithm is to propose candidate split points. Usually, percentiles of feature are used to make candidates distribute evenly on the data.<br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n)}<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} \rightarrow [0, \infty)<br />
</math>:<br />
<br />
<br />
<br />
[[File:For4.png|500px| thumb|center ]]<br />
<br />
[[File:For5.png|500px| thumb|center ]]<br />
<br />
<br />
[[File:For3.png|500px| thumb|center ]]<br />
<br />
[[File:For6.png|500px| thumb|center ]]<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34533summary2018-03-17T17:12:43Z<p>C55jiang: /* Split Finding Algorithms */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: <br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n)}<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} --> [0, +INF]<br />
</math>:<br />
<br />
<br />
[[File:For3.png|500px| thumb|center ]]<br />
<br />
[[File:For4.png|500px| thumb|center ]]<br />
<br />
[[File:For5.png|500px| thumb|center ]]<br />
<br />
[[File:For6.png|500px| thumb|center ]]<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:For6.png&diff=34532File:For6.png2018-03-17T17:11:14Z<p>C55jiang: </p>
<hr />
<div></div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:For5.png&diff=34531File:For5.png2018-03-17T17:10:58Z<p>C55jiang: </p>
<hr />
<div></div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:For4.png&diff=34530File:For4.png2018-03-17T17:10:41Z<p>C55jiang: </p>
<hr />
<div></div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:For3.png&diff=34529File:For3.png2018-03-17T17:10:20Z<p>C55jiang: </p>
<hr />
<div></div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34528summary2018-03-17T16:47:54Z<p>C55jiang: /* Column Blocks and Parallelization */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: <br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n)}<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} --> [0, +INF]<br />
</math>:<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:Figure6.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Figure6.png&diff=34527File:Figure6.png2018-03-17T16:47:32Z<p>C55jiang: </p>
<hr />
<div></div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34526summary2018-03-17T16:46:27Z<p>C55jiang: /* Column Blocks and Parallelization */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: <br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n)}<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} --> [0, +INF]<br />
</math>:<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:F5.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34525summary2018-03-17T16:45:18Z<p>C55jiang: /* System Design */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: <br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n)}<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} --> [0, +INF]<br />
</math>:<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34524summary2018-03-17T16:44:57Z<p>C55jiang: /* Split Finding Algorithms */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: <br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n)}<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} --> [0, +INF]<br />
</math>:<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
= System Design =<br />
<br />
=== Column Blocks and Parallelization ===<br />
<br />
[[File:.png|500px| thumb|center ]]<br />
1. Feature values are sorted.<br />
2. A block contains one or more feature values.<br />
3. Instance indices are stored in blocks.<br />
4. Missing features are not stored.<br />
5. With column blocks, a parallel split finding algorithm is easy to design.<br />
<br />
=== Cache Aware Access ===<br />
A thread pre-fetches data from non-continuous memory into a continuous buffer.<br />
The main thread accumulates gradients statistics in the continuous buffer.<br />
<br />
===System Tricks ===<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34523summary2018-03-17T16:34:57Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
Motivation: <br />
<br />
let multi-set<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n)}<br />
</math><br />
represent the k-th feature values and second order gradient statistics of each training instances. <br />
The define a rank function <math><br />
\mathbf{r}_k : \mathbf{R} --> [0, +INF]<br />
</math>:<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34522summary2018-03-17T16:28:04Z<p>C55jiang: /* Weighted Quantile Sketch */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
<math><br />
\mathbf{D}_k = { (\mathbf{x}_{1k}, \mathbf{h}_1), (\mathbf{x}_{2k}, \mathbf{h}_2) , ... (\mathbf{x}_{nk}, \mathbf{h}_n)}<br />
</math><br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34521summary2018-03-17T16:07:24Z<p>C55jiang: /* Sparsity--aware Split Finding */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34520summary2018-03-17T16:06:42Z<p>C55jiang: /* Sparsity--aware Split Finding */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. ]]<br />
Figure 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset. The results shows that sparsity aware algorithm runs 50 times faster than the naive version.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34519summary2018-03-17T16:02:11Z<p>C55jiang: /* Sparsity--aware Split Finding */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
<br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
<br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
XGBoost handles all sparsity patterns in a unified way. More importantly, this paper exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. <br />
[[File:Figure 5.png|500px| thumb|center | Figure 5. ]]<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 25% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34516summary2018-03-17T15:30:28Z<p>C55jiang: /* Sparsity--aware Split Finding */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input <math> X </math> to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Figure 4. Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix <math> X </math>, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 20% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34515summary2018-03-17T15:28:34Z<p>C55jiang: /* Sparsity--aware Split Finding */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input X to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center | Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix X, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 20% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34514summary2018-03-17T15:27:31Z<p>C55jiang: /* Sparsity--aware Split Finding */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input X to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center ]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix X, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries <math> I_k</math>. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 20% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34513summary2018-03-17T15:21:57Z<p>C55jiang: /* Split Finding Algorithms */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input X to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center ]]<br />
This paper propose to add a default direction in each tree node, which is shown in Figure 4. When a value is missing in the sparse matrix X, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default direction is learn from data. Algorithm 3 shows Sparsity aware Split Finding. <br />
<br />
[[File:Algorithm 3.png|500px| thumb|center ]]<br />
The important improvement is to only visit the non-missing entries<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 20% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User_talk:C55jiang&diff=34510User talk:C55jiang2018-03-17T15:08:13Z<p>C55jiang: Created page with "Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing"</p>
<hr />
<div>Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34509summary2018-03-17T15:06:24Z<p>C55jiang: /* Split Finding Algorithms */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting, compared with the traditional row subsampling, and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics greedily. <br />
<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, The author introduces an approximate algorithm. The algorithm first proposes candidate splitting points by the percentiles of feature distribution, maps the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and are used for all split finding. It requires fewer proposal steps. <br />
The local variant: the candidate splits are refined after each split. It requires fewer candidate points, so it is potentially more appropriate for deeper trees. <br />
Methods can be chosen according to users’ needs. <br />
<br />
[[File:auc.png|500px| thumb|center]]<br />
<br />
=== Weighted Quantile Sketch ===<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input X to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
[[File:Figure 4.png|500px| thumb|center ]]<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is flexible, portable and reusable. It also supports various languages and ecosystems.<br />
<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
=== Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 20% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives state-of-the-art results in different experiment settings, which explains why it it considered one of the best and fastest algorithm for structured or tabular data.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Algorithm_3.png&diff=34508File:Algorithm 3.png2018-03-17T15:04:32Z<p>C55jiang: </p>
<hr />
<div></div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Figure_5.png&diff=34507File:Figure 5.png2018-03-17T15:03:46Z<p>C55jiang: Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration.</p>
<hr />
<div>Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration.</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Figure_4.png&diff=34506File:Figure 4.png2018-03-17T15:02:28Z<p>C55jiang: Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing</p>
<hr />
<div>Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34496summary2018-03-17T13:39:06Z<p>C55jiang: /* Sparsity--aware Split Finding */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics. <br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, we first propose candidate splitting points by percentiles of feature distribution, map the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and used at all levels. <br />
The local variant: the candidate splits are refined after each splits. It is potential for deeper trees and it requires fewer candidates. <br />
Methods can be chosen according to users’ needs.<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
<br />
=== Weighted Quantile Sketch ===<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
In real world problem, it is quite common for the input X to be sparse. The reasons that causes for sparsity are following: 1. presence of missing values in the data; 2.frequent zero entries in the statistics; and, 3.artifacts of feature engineering such as one-hot encoding.<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is portable and reusable. Thus, it is available in various languages and ecosystems.<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
===Multiple Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 20% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives outstanding performance in different experiment settings, which explains why it is considered the best and fastest open source package on boosted tree.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=summary&diff=34495summary2018-03-17T13:33:25Z<p>C55jiang: /* Split Finding Algorithms */</p>
<hr />
<div>'''''XGBoost: A Scalable Tree Boosting System'''''<br />
<br />
== Presented by ==<br />
<br />
Jiang, Cong<br />
<br />
Song, Ziwei<br />
<br />
Ye, Zhaoshan<br />
<br />
Zhang, Wenling<br />
<br />
= Introduction =<br />
<br />
The Extreme Gradiant Boosting, also known as XGBoost, has been the leading model that dominates Kaggle, as well as other machine learning and data mining challenges. The main reason to its success is that this tree boosting system is highly scalable , which means it could be used to solve various problems with significantly less time and fewer resources. <br />
<br />
This paper is written by the father of XGBoost, Tianqi Chen. It gives us an overview on how he used algorithmic optimizations as well as some important systems to develop XGBoost. He explained it in the following manner:<br />
<br />
1. Based on gradient tree boosting algorithm, Chen makes some modifications to the regularized objective function and introduces an end-to-end tree boosting system. He also applies shrinkage and feature subsampling to prevent overfitting, <br />
<br />
2. The system supports existing split finding algorithms including exact greedy and approximate algorithms. For weighted data, Chen proposes a distributed weighted quantile sketch algorithm which is provable; while for sparse data, he develops a novel sparsity-aware algorithm which is much faster than the naive model.<br />
<br />
3. Storing the data in ''blocks'' is Chen's method to reduce sorting cost. The cache-aware block structure is highly effective for out-of-core tree learning.<br />
<br />
4. XGBoost is evaluated by designed experiments with 4 datasets. Chen compares it with adjusted setting as well as other systems to test its performance.<br />
<br />
= Tree Boosting=<br />
<br />
<br />
=== Regularized Learning Objective ===<br />
Suppose the given data set has m features and n examples. A tree ensemble model uses K additive functions to predict the output.<br />
[[File:eq1.png|center|eq1.png]]<br />
where F is the space of regression trees(CART).<br />
The author suggests the following regularized objective function to make improvements comparing to gradient boosting<br />
[[File:eq2.png|center|eq2.png]]<br />
Regression tree contains continuous scores on each leaf, which is different with decision trees we have learned in class. The following 2 figures represent how to use the decision rules in the trees to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves.<br />
<br />
[[File:Picture2.1.png|500px| thumb| center| Figure 0: Prediction Score of Tree Structure.]]<br />
<br />
<br />
[[File:Picture2.2.png|500px|thumb|center|Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree. ]]<br />
<br />
=== Gradient Tree Boosting ===<br />
The tree ensemble model (Equation 2.) is trained in an additive manner. We need to add a new <math>f</math> to our objective.<br />
[[File:eq2.2.png|center|eq2.2.png]]<br />
After applying the second-order Taylor expansion to the above formula and removing constant terms, we can finally have<br />
[[File:eq3.png|center|eq3.png]]<br />
Define <math>I_{j} = \{i| q(X_{i}) \}</math><br />
As the instance set of leaf j. We can rewrite Equation (3) as<br />
[[File:eq4.png|center|eq4.png]]<br />
By computing the optimal weight and corresponding optimal value, we can use following equation as the scoring function which can measure the quality of a tree structure. Lower score means better tree structure.<br />
[[File:eq5.png|center|eq5.png]]<br />
The following figure shows an example how this score is calculated.<br />
[[File:Picture2.3.png|500px| thumb|center|Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score. ]]<br />
<br />
<br />
<br />
Normally it is impossible to enumerate all the possible tree structures. Assume <math>I_{L}</math> and <math>I_{R}</math> are the instance sets of left and right nodes after the split, and <math> I = I_{L} + I_{R}</math>. Then the loss reduction after the split is given by<br />
[[File:eq6.png|center]]<br />
<br />
=== Shrinkage and Column Subsampling ===<br />
In addition to the regularized objective, the paper also introduces two techniques to prevent over-fitting. <br />
<br />
Method 1: Shrinkage. After each step of tree boosting, we scale newly added weights by a shrinkage factor η, which reduces the influence of every individual tree and leaves the space for future trees. <br />
<br />
Method 2: Column (feature) subsampling. We try to build each individual tree with only a portion of predictors chosen randomly. This procedure encourages the variance between the trees (i.e. no same tree occurs). Implemented in an opensource packages, this method prevents over-fitting and speeds up the computations of parallel algorithm as well.<br />
<br />
= Split Finding Algorithms =<br />
The ways to find the best split can be realized by two main methods. <br />
<br />
Basic exact greedy algorithm: enumerate over all the possible splits on all the features, which is computationally demanding. Sort the data first, and visit the data to accumulate the gradient statistics. <br />
<br />
Approximate algorithm: To achieve an effective gradient tree boosting model, we first propose candidate splitting points by percentiles of feature distribution, map the continuous features into buckets split and then finds the best solution among the proposals by the aggregated statistics. <br />
The algorithm can be varied by the time the proposal is given. <br />
The global variant: all candidate splits are proposed at the initial phase of tree construction and used at all levels. <br />
The local variant: the candidate splits are refined after each splits. It is potential for deeper trees and it requires fewer candidates. <br />
Methods can be chosen according to users’ needs.<br />
[[File:exact_greedy.png|500px| thumb|center ]]<br />
[[File:appro_greedy.png|500px| thumb|center]]<br />
<br />
<br />
=== Weighted Quantile Sketch ===<br />
<br />
=== Sparsity--aware Split Finding ===<br />
<br />
=Evaluations=<br />
===System Implementation===<br />
Implemented as an open source package, XGBoost is portable and reusable. Thus, it is available in various languages and ecosystems.<br />
===Dataset and Setup===<br />
Table 2 summarizes the 4 datasets used in the experiments.<br />
[[File:t2.png|500px| thumb|center ]]<br />
<br />
<br />
The first three datasets are used for the single machine parallel setting, which is run on a Dell PowerEdge R420 with two 8-core Intel Xeon and memory 64GB. Meanwhile, AWS c3.8xlarge machine is used for experiment in distributed and the out-of-core setting, which uses the criteo dataset. All available cores in the machines are devoted into the experiments, unless otherwise specified.<br />
<br />
===Multiple Experiments===<br />
The classification experiment uses Higgs Boson dataset to classify whether an event corresponds to the Higgs boson. The goal is to evaluate XGBoost using Exact Greedy Methods and compare it with column sampling XGBoost and two other well-known systems. Table 3 below summarizes the results.<br />
[[File:t3.png|500px| thumb|center ]]<br />
Overall, both XGBoost and scikit-learn performs better than R.gbm by 20% of Test AUC. In terms of time, however, XGBoost runs significantly faster than both scikit-learn and R.gbm. Note that XGBoost with column sampling performs slightly worse because the number of key features is limited. We could conclude that XGBoost using the exact greedy algorithm without column sampling works the best under this scenario.<br />
<br />
The learning to rank experiment compares XGBoost(using exact greedy algorithm) with pGBRT(using approximate algorithm) in order to evaluate the system's performance on rank problem.<br />
<br />
[[File:t4.png|500px| thumb|center ]]<br />
[[File:f10.png|500px| thumb|center ]]<br />
According to Table 4, XGBoost with column sampling performs almost as well as pGBRT, while using only 1/5 of the time. Note that pGBRT is the best previously published system to solve this rank problem. <br />
<br />
The out-of-score and distributed experiments demonstrate XGBoost's ability to handle large data with a fast speed.<br />
<br />
= Conclusion=<br />
The writer, Tianqi Chen, proposed a scalable end-to-end tree boosting system with cache access patterns, block structure and other essential elements. He also introduced two novel split finding algorithms, one is sparsity-aware algorithm and the other is weighted quantile sketch with proofs. Compared to other popular systems, XGBoost gives outstanding performance in different experiment settings, which explains why it is considered the best and fastest open source package on boosted tree.<br />
<br />
= Source =</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18&diff=31832stat441w182018-02-13T17:34:34Z<p>C55jiang: </p>
<hr />
<div><br />
<br />
[https://docs.google.com/forms/d/1HrpW_lnn4jpFmoYKJBRAkm-GYa8djv9iZXcESeVB7Ts/prefill Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Mar 8 || || 1|| || || <br />
|-<br />
|Mar 8 || || 2|| || || <br />
|-<br />
|Mar 13 || || 3|| || || <br />
|-<br />
|Mar 13 || || 4|| || || <br />
|-<br />
|Mar 15 || || 5|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 20 || || 7|| || || <br />
|-<br />
|Mar 20 || Wenling Zhang, Cong Jiang, Ziwei Song, Zhaoshan Ye || 8|| XGBoost: A Scalable Tree Boosting System. || ［https://arxiv.org/pdf/1603.02754.pdf Paper］ || <br />
|-<br />
|Mar 22 || Alice Wang, Robert Huang, Roger Wang, Renato Ferreira || 9|| || || <br />
|-<br />
|Mar 22 || || 10|| || || <br />
|-<br />
|Mar 27 || || 11|| || || <br />
|-<br />
|Mar 27 || || 12|| || || <br />
|-<br />
|Mar 29 || || 13|| || || <br />
|-<br />
|Mar 29 || Ammar Mehvee, Chen Yuan || 14|| Bag of Tricks for Efficient Text Classification || || <br />
|-<br />
|Apr 3 || Qici Tan, Qi Mai, Ziheng Chu, Minghao Lu || 15|| || || <br />
|-<br />
|Apr 3 || Haochen, Yang, Patrick, Selina, Sigeng, Angel || 16|| || || <br />
|-</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18&diff=31828stat441w182018-02-13T17:32:04Z<p>C55jiang: </p>
<hr />
<div><br />
<br />
[https://docs.google.com/forms/d/1HrpW_lnn4jpFmoYKJBRAkm-GYa8djv9iZXcESeVB7Ts/prefill Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Mar 8 || || 1|| || || <br />
|-<br />
|Mar 8 || || 2|| || || <br />
|-<br />
|Mar 13 || || 3|| || || <br />
|-<br />
|Mar 13 || || 4|| || || <br />
|-<br />
|Mar 15 || || 5|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 20 || || 7|| || || <br />
|-<br />
|Mar 20 || Wenling Zhang, Cong Jiang, Ziwei Song, Zhaoshan Ye || 8|| XGBoost: A Scalable Tree Boosting System. || ［https://arxiv.org/pdf/1603.02754.pdf Paper］ || ［https://arxiv.org/abs/1603.02754］<br />
|-<br />
|Mar 22 || Alice Wang, Robert Huang, Roger Wang, Renato Ferreira || 9|| || || <br />
|-<br />
|Mar 22 || || 10|| || || <br />
|-<br />
|Mar 27 || || 11|| || || <br />
|-<br />
|Mar 27 || || 12|| || || <br />
|-<br />
|Mar 29 || || 13|| || || <br />
|-<br />
|Mar 29 || Ammar Mehvee || 14|| Bag of Tricks|| || <br />
|-<br />
|Apr 3 || Qici Tan, Qi Mai, Ziheng Chu, Minghao Lu || 15|| || || <br />
|-<br />
|Apr 3 || Haochen, Yang, Patrick, Selina, Sigeng, Angel || 16|| || || <br />
|-</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18&diff=31823stat441w182018-02-13T17:30:08Z<p>C55jiang: </p>
<hr />
<div><br />
<br />
[https://docs.google.com/forms/d/1HrpW_lnn4jpFmoYKJBRAkm-GYa8djv9iZXcESeVB7Ts/prefill Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Mar 8 || || 1|| || || <br />
|-<br />
|Mar 8 || || 2|| || || <br />
|-<br />
|Mar 13 || || 3|| || || <br />
|-<br />
|Mar 13 || || 4|| || || <br />
|-<br />
|Mar 15 || || 5|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 20 || || 7|| || || <br />
|-<br />
|Mar 20 || Wenling Zhang, Cong Jiang, Ziwei Song, Zhaoshan Ye || 8|| XGBoost: A Scalable Tree Boosting System. || ［https://arxiv.org/pdf/1603.02754.pdf］ || ［https://arxiv.org/abs/1603.02754］<br />
|-<br />
|Mar 22 || Alice Wang, Robert Huang, Roger Wang, Renato Ferreira || 9|| || || <br />
|-<br />
|Mar 22 || || 10|| || || <br />
|-<br />
|Mar 27 || || 11|| || || <br />
|-<br />
|Mar 27 || || 12|| || || <br />
|-<br />
|Mar 29 || || 13|| || || <br />
|-<br />
|Mar 29 || || 14|| || || <br />
|-<br />
|Apr 3 || Qici Tan, Qi Mai, Ziheng Chu, Minghao Lu || 15|| || || <br />
|-<br />
|Apr 3 || Haochen, Yang, Patrick, Selina, Sigeng, Angel || 16|| || || <br />
|-</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18&diff=31820stat441w182018-02-13T17:25:45Z<p>C55jiang: </p>
<hr />
<div><br />
<br />
[https://docs.google.com/forms/d/1HrpW_lnn4jpFmoYKJBRAkm-GYa8djv9iZXcESeVB7Ts/prefill Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Mar 8 || || 1|| || || <br />
|-<br />
|Mar 8 || || 2|| || || <br />
|-<br />
|Mar 13 || || 3|| || || <br />
|-<br />
|Mar 13 || || 4|| || || <br />
|-<br />
|Mar 15 || || 5|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 20 || || 7|| || || <br />
|-<br />
|Mar 20 || Wenling Zhang, Cong Jiang, Ziwei Song, Zhaoshan Ye || 8|| || XGBoost: A Scalable Tree Boosting System. || https://arxiv.org/abs/1603.02754<br />
|-<br />
|Mar 22 || Alice Wang, Robert Huang, Roger Wang, Renato Ferreira || 9|| || || <br />
|-<br />
|Mar 22 || || 10|| || || <br />
|-<br />
|Mar 27 || || 11|| || || <br />
|-<br />
|Mar 27 || || 12|| || || <br />
|-<br />
|Mar 29 || || 13|| || || <br />
|-<br />
|Mar 29 || || 14|| || || <br />
|-<br />
|Apr 3 || Qici Tan, Qi Mai, Ziheng Chu, Minghao Lu || 15|| || || <br />
|-<br />
|Apr 3 || Haochen, Yang, Patrick, Selina, Sigeng, Angel || 16|| || || <br />
|-</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18&diff=31818stat441w182018-02-13T17:20:01Z<p>C55jiang: </p>
<hr />
<div><br />
<br />
[https://docs.google.com/forms/d/1HrpW_lnn4jpFmoYKJBRAkm-GYa8djv9iZXcESeVB7Ts/prefill Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Mar 8 || || 1|| || || <br />
|-<br />
|Mar 8 || || 2|| || || <br />
|-<br />
|Mar 13 || || 3|| || || <br />
|-<br />
|Mar 13 || || 4|| || || <br />
|-<br />
|Mar 15 || || 5|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 20 || || 7|| || || <br />
|-<br />
|Mar 20 || Wenling Zhang, Cong Jiang, Ziwei Song, Zhaoshan Ye || 8|| || || http://delivery.acm.org/10.1145/2940000/2939785/p785-chen.pdf?ip=129.97.124.16&id=2939785&acc=CHORUS&key=FD0067F557510FFB%2E9219CF56F73DCF78%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1518540509_99b0ad338171af1d29597740ac85e31d<br />
|-<br />
|Mar 22 || Alice Wang, Robert Huang, Roger Wang, Renato Ferreira || 9|| || || <br />
|-<br />
|Mar 22 || || 10|| || || <br />
|-<br />
|Mar 27 || || 11|| || || <br />
|-<br />
|Mar 27 || || 12|| || || <br />
|-<br />
|Mar 29 || || 13|| || || <br />
|-<br />
|Mar 29 || || 14|| || || <br />
|-<br />
|Apr 3 || Qici Tan, Qi Mai, Ziheng Chu, Minghao Lu || 15|| || || <br />
|-<br />
|Apr 3 || Haochen, Yang, Patrick, Selina, Sigeng, Angel || 16|| || || <br />
|-</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18&diff=31816stat441w182018-02-13T17:19:13Z<p>C55jiang: </p>
<hr />
<div><br />
<br />
[https://docs.google.com/forms/d/1HrpW_lnn4jpFmoYKJBRAkm-GYa8djv9iZXcESeVB7Ts/prefill Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Mar 8 || || 1|| || || <br />
|-<br />
|Mar 8 || || 2|| || || <br />
|-<br />
|Mar 13 || || 3|| || || <br />
|-<br />
|Mar 13 || || 4|| || || <br />
|-<br />
|Mar 15 || || 5|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 20 || || 7|| || || <br />
|-<br />
|Mar 20 || Wenling Zhang, Cong Jiang, Ziwei Song, Zhaoshan Ye || 8|| || || <br />
|-<br />
|Mar 22 || Alice Wang, Robert Huang, Roger Wang, Renato Ferreira || 9|| || || <br />
|-<br />
|Mar 22 || || 10|| || || <br />
|-<br />
|Mar 27 || || 11|| || || <br />
|-<br />
|Mar 27 | || 12|| || || <br />
|-<br />
|Mar 29 || || 13|| || || <br />
|-<br />
|Mar 29 || || 14|| || || <br />
|-<br />
|Apr 3 || Qici Tan, Qi Mai, Ziheng Chu, Minghao Lu || 15|| || || <br />
|-<br />
|Apr 3 || Haochen, Yang, Patrick, Selina, Sigeng, Angel || 16|| || || <br />
|-</div>C55jianghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441w18&diff=31808stat441w182018-02-13T17:15:54Z<p>C55jiang: </p>
<hr />
<div><br />
<br />
[https://docs.google.com/forms/d/1HrpW_lnn4jpFmoYKJBRAkm-GYa8djv9iZXcESeVB7Ts/prefill Your feedback on presentations]<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|-<br />
|Feb 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [http://wikicoursenote.com/wiki/Stat946f15/Sequence_to_sequence_learning_with_neural_networks#Long_Short-Term_Memory_Recurrent_Neural_Network Summary]<br />
|-<br />
|Mar 8 || || 1|| || || <br />
|-<br />
|Mar 8 || || 2|| || || <br />
|-<br />
|Mar 13 || || 3|| || || <br />
|-<br />
|Mar 13 || || 4|| || || <br />
|-<br />
|Mar 15 || || 5|| || || <br />
|-<br />
|Mar 15 || || 6|| || || <br />
|-<br />
|Mar 20 || || 7|| || || <br />
|-<br />
|Mar 20 || || 8|| || || <br />
|-<br />
|Mar 22 || Alice Wang, Robert Huang, Roger Wang, Renato Ferreira || 9|| || || <br />
|-<br />
|Mar 22 || || 10|| || || <br />
|-<br />
|Mar 27 || || 11|| || || <br />
|-<br />
|Mar 27 || Cong Jiang, Shaoshan ye, ziwei song || 12|| || || <br />
|-<br />
|Mar 29 || || 13|| || || <br />
|-<br />
|Mar 29 || || 14|| || || <br />
|-<br />
|Apr 3 || || 15|| || || <br />
|-<br />
|Apr 3 || Haochen, Yang, Patrick, Selina, Sigeng, Angel || 16|| || || <br />
|-</div>C55jiang