# Decision Tree Explained

Decision tree also known as classification tree, segmentation tree or recursive partitioning. It incorporates non-linear effect and high order interaction between predictors. It creates a tree-based classification model. It creates rules and use those rules to predict future events such as likelihood of default on a loan or classify the population into different groups such as high, medium, low risk groups and then take the action accordingly. This technique is widely used in banking risk management.

Decision trees can be used for numerous tasks within a data mining project including data exploration, preparation, variable selection, building a baseline model, segmentation. The tree has 3 types of nodes – root node, internal or child nodes, leaf or terminal nodes.

There are many **decision tree algorithms** such as AID (Automatic Interaction Detector), MAID (Extended AID for multivariate outcome), THAID (Extended AID for categorical outcome using theta criterion), CHAID (Chi-Square Automatic Interaction Detector), CART (Classification and Regression Tree), ID3, C4.5, QUEST (Quick, unbiased, efficient statistical tree). The difference in these algorithms are based on split (binary, n-array), dependent variable (continuous, categorical), splitting criterion (Entropy, Gini, chi-square). CHAID is the most popular method among these algorithms. CHAID determines optimal n-array split, used for both types of dependent variable and used chi-square value (p-value) as splitting criterion.

Algorithm splits the data on the feature which have largest Information Gain and the splitting procedure repeats until the leaves become pure. However, deep tree leads to overfitting and require pruning the tree. There are two ways – **pre-pruning** (forward pruning) or **post-pruning** (backward pruning). Pre-pruning is an early stopping rule, which prune the tree through setting a limit for the maximal depth of the tree. Post-pruning is used after generating a full decision tree to remove branches in a manner that improves the accuracy of the overall classification when applied to the validation dataset.

**Information gain** is the difference between the impurity of the parent node and sum of the child node impurities. Examples of **impurity measures** – Entropy, Gini, Classification error. Entropy measure the variability of the distribution of DV with respect to IV. Entropy of 1 indicates high variability in the distribution and 0 indicates low variability. As the variability is reduced, there is gain in information. Entropy is 0 if all observations at a node belong to the same class and 1 if classes are uniformly distributed.

**CHAID** is non-parametric approach that makes no assumption of training data or prediction residuals. Its output is highly visual and easy to interpret. The splitting methods attempt to maximize the homogeneity of each group by means of purity measures. It searches for independent variables that are most significant to the dependent variable based on measure (Adjusted P-value Bonferroni Adjustment) which used to evaluate the best split. CHAID generates non-binary trees and choose optimal split based on chi-square value unlike CART which generates binary trees and use entropy or Gini measures.

A **p-value** is used in hypothesis testing to help you support or reject the null hypothesis. The smaller the p-value, the strong the evidence that you reject the null hypothesis. For example, a p-value of 0.03. This means that there is a 3% chance your results could be random (i.e happened by chance). The p-value indicates whether a coefficient is significantly different from 0. **Chi-square** test is a statistical test, which also tests the null hypothesis. CHAID uses chi-square test of independence. A statistically significant results indicate that the two variables are not independent i.e there is a relationship between them (DV and IV).

The classification table (cross-section of actual and predicted values (Good/Bad) provide a quick evaluation of how the model works. The table is also known as **confusion matrix**. The errors in classification model are generally divided into two types – training error (number of misclassification error on training records) generalization error (expected error on unseen or test records). A good model must have low training error as well as low generalization error.

To build a good classification model, one need to keep in mind few key points:

- Choose simple model over complex model if model errors are similar.
- Compute the performance metrics obtained using different estimation methods such as holdout method, random sampling, k fold cross-validation, bootstrapping.
- Prune the tree to avoid overfitting.
- Use CHAID algorithm which uses statistical chi-square test to determine the best split during the tree growing process.

The following libraries are available to build decision tree in R – ‘rpart’, ‘party’ and in Python – ‘scikit -learn’.

**Limitations of CHAID**

A small change in training data can result in a change in the tree and therefore the final predictions. Decision tree results in low bias but high variance which leads to overfitting. Information gain in decision trees is biased in favour of categorical variables having more levels.