# One or Many Trees: Classification Models

Businesses use the supervised machine learning techniques like Decision trees to make better decisions and make more profit. Decision tree is very popular technique because of its simplicity, ease to use and interpretability. It is a tree-based classification model. The technique has shortcomings such as sub-optimal performance, high variance and lack of robustness. An individual tree over fit the training data. Decision trees have been around for a long time and known to suffer from bias and variance.

Ensemble methods, which combines several decision trees to produce better predictive performance than using a single decision tree. The primary idea behind the ensemble model is that a group of weak learners come together to form a strong learner. It improves the stability and predictive power of the model.

One of the simplest ensemble models is Bagging. Bagging is also known as Bootstrap aggregating. It creates many different bootstrapped samples of the data and grows a decision tree on each one of them. At the deployment stage, the predictions of the trees within the ensemble are averaged to generate the final probabilities. Average of all the predictions from different trees is more robust than a single decision tree. Bagging constructs the new learner in parallel way and average the response of N learners; each model is built independently and have equal weights in averaging. The algorithm looks through all variables in order to select the most optimal split point.

Random Forest is an extension over bagging. It takes an additional step of taking the random selection of features (variables) rather than using all features to grow trees, along with the random subset of data. The random-forest algorithm brings extra randomness into the model by searching for the best feature among a random subset of features instead of searching for the best feature among all features while splitting a node. This process results in a better model. Random Forests fit a more accurate Model by averaging many Decision Trees and reducing the Variance and avoiding overfitting problem in tree. The main features of RF include bootstrap resampling, random feature selection, out-of-bag error estimation, full depth decision tree growing. Output of all these randomly generated trees is aggregated to obtain one final prediction, which is the average values of all trees in the forest. Random forest is a nonlinear classifier which works well in case of huge data. Training of these models takes time but the accuracy also increases.

Gradient Boosting is a sequential process in which model is built sequentially and have unequal weights in averaging (weighted averaging). It uses gradient descent algorithm which can optimize any differentiable loss function. An ensemble of trees are built one by one; the next tree tries to recover the loss (difference between actual and predicted values). An important parameter in gradient descent is the size of the steps, learning rate. If the learning rate is too small, then the algorithm will take many iterations to find the minimum. And if the learning rate is too large, higher chance of missing the minimum and end up somewhere.

These ensemble methods reduce the variance of the predictions thus improving predictive performance. The gain in predicative accuracy comes at a cost of interpretability of the model.

The following libraries are available to build these models in R:

Decision tree – *‘rpart’, ‘party’, ’tree’*

Bagging – *‘ipred’, ‘adabag’*

Random Forest – *‘randomForest’, ‘randomForestSRC’*

Boosting – *‘gbm’, ‘mboost’, ‘xgboost’*

All these models can be built in Python using *‘scikit -learn’*