Decision Tree
This is a classification method used in Machine Learning and Data Mining that is based on Trees
Suppose we have a set of rules
 if we group them by condition
 and then put redundant checks inside it
 we get a decision tree
Rules:
 IF pclass='1' AND sex='female' THEN survive=yes
 IF pclass='1' AND sex='male' AND age < 5 THEN survive=yes
 IF pclass='2' AND sex='female' THEN survive=yes
 IF pclass='2' AND sex='male' THEN survive=no
 IF pclass='3' AND sex='male' THEN survive=no
 IF pclass='3' AND sex='female' AND age < 4 THEN survive=yes
 IF pclass='3' AND sex='female' AND age >= 4 THEN survive=no
If we put if conditions inside
 IF pclass='1' THEN
 IF sex='female' THEN survive=yes
 IF sex='male' AND age < 5 THEN survive=yes
 IF pclass='2'
 IF sex='female' THEN survive=yes
 IF sex='male' THEN survive=no
 IF pclass='3'
 IF sex='male' THEN survive=no
 IF sex='female'
 IF age < 4 THEN survive=yes
 IF age >= 4 THEN survive=no
We have a decision tree:
Decision Trees
Consider this dataset
Tid 
Refund 
Marital Status 
Income 
Cheat

1 
Yes 
Single 
125K 
No

2 
No 
Married 
100K 
No

3 
No 
Single 
70K 
No

4 
Yes 
Married 
120K 
No

5 
No 
Divorced 
95K 
Yes

6 
No 
Married 
60K 
No

7 
Yes 
Divorced 
220K 
No

8 
No 
Single 
85K 
Yes

9 
No 
Married 
75K 
No

10 
No 
Single 
90K 
Yes

There could be several decision trees for this dataset:
Decision Tree:
 leaves are labels with the predicted class
 internal nodes are labeled with attributes used to decide which path to take
 edges are labeled with values for this attributes or with boolean tests
Classification:
 suppose we take a previously unseen records (11, no, married, 112k, ?)
 we need to predict a class for this label
 we put this instance on top of the tree and go the way to the leaf
 the leaf we end up is our class  cheat=no
Error Measures
A tree is perfect if it makes no errors on the training set

 this tree is perfect w.r.t. the training dataset
 learning error = 0%
Error measures:
 learning error: # of errors on the training dataset
 testing error: # of errors on the test set
Learning Algorithms
There are several learning algorithms
 CART
 ID3, C4.5
 SLIQ, SPRINT, etc.
Main Principles
(ID3 algo)
Given:
The structure is recursive:
 let $D_t$ be a set of records that reach a node $t$
 at the beginning $t = \text{root}$ and $D_t \equiv D$
 if all $d \in D_t$ belong to the same class $y_t$
 then $t$ is labeled by $y_t$
 if $D_t \equiv \varnothing$
 then $t$ is labeled with the default class (e.g. the majority class)
 if $d \in D_t$ belong to different classes
 split $D_t$ into subsets $D_{t+1}$ and recursively apply the procedure to each
Splitting
 there is huge # of ways to split a set
 how to determine the best split?
Splitting:

 multiway: good for nominal, most of the time select all of the attributes
 binary: good for numeric, but need to find where to split
Splitting numerical attributes
 goal: want to find subsets that are more "pure" than the original one
 "pure" = degree of homogeneity
 $\fbox{$C_1$: 5, $C_2$: 5}$  homogeneous, high degree of impurity
 $\fbox{$C_1$: 9, $C_2$: 1}$  nonhomogeneous, low degree of impurity
 the lower the better
 use Information Gain for that
Stopping Conditions
Stop when
 all records in node $k$ belong to the same class
 the information gain is lower than some given threshold
Given a set $S$ with $K$ classes $C_1, ..., C_K$
 let $p_k = \cfrac{ C_k }{ C }$  probability of record belonging to class $k$
 $I(S)$  is a function of impurity that we want to minimize
Measures of Impurity
 Misclassification Error:
 Entropy and Information Gain
 $I(S) =  \sum_k p_k \cdot \log_2 p_k$
 Gini index:
 $I(S) = 1  \sum_k p_k \cdot c_k$
Information Gain
 how much information you get when partition the data?
 def: $\Delta I = I(S)  p_L \cdot I(S_L)  p_R \cdot I(S_R)$
 $p_L = \cfrac{ S_L }{ S }, p_R = \cfrac{ S_R }{ S }$
 $I$  a measure of impurity
It's easy to generalize IG to Multiway
 split $S$ to $K$ subsets $S_1, ..., S_K$
 $\Delta I = I(S)  \sum_k p_k \cdot I(S_k)$
 $\sum_k p_k \cdot I(S_k)$ is the Expected Value of Entropy in the partitioned dataset
 $E \big[ I \big( \{ S_1, ..., S_K \} \big) \big] = \sum_k p_k I(S_k) =  \sum_k p_k \log_2 p_k$
 with $p_k = \cfrac{ S_k }{ S }$
 but if we minimize this $\Delta I$, we end up with $K = N$
 i.e. we obtain as many subsets as there are records  and a set with just one element is 100% pure
 use Gain Ratio impurity
 $ \Delta I_K = \cfrac{\Delta I}{ \sum_k p_k \cdot \log_2 p_k}$
Example
Suppose we have a node with $S \equiv \fbox{$C_1$: 20, $C_2$: 30}$
 $S_L \equiv \fbox{$C_1$: 15, $C_2$: 5}$ and $S_L \equiv \fbox{$C_1$: 5, $C_2$: 25}$
 let $I$ be the Entropy function
 $I(S) =  \cfrac{20}{50} \log_2 \cfrac{20}{50}  \cfrac{30}{50} \log_2 \cfrac{30}{50} \approx 0.971$
 $I(S_L) =  \cfrac{15}{20} \log_2 \cfrac{15}{20}  \cfrac{5}{20} \log_2 \cfrac{5}{20} \approx 0.811$
 $I(S_R) =  \cfrac{5}{30} \log_2 \cfrac{5}{30}  \cfrac{5}{30} \log_2 \cfrac{5}{30} \approx 0.65$
 $\Delta I = I(S)  p_L \cdot I(S_L)  p_R \cdot I(S_R) = 0.971  0.4 \cdot 0.811  0.6 \cdot 0.65 = 0.26$
Another example with $S \equiv \fbox{$C_1$: 20, $C_2$: 30}$
 suppose that now we want to split to $S_L \equiv \fbox{$C_1$: 10, $C_2$: 15}$ and $S_L \equiv \fbox{$C_1$: 10, $C_2$: 15}$
 in this case it's clear that we don't have any IG
 so min value is 0, and there's no max boundary
Splitting Algorithm
for each attribute $A_i$
 find the best partitioning $P^*_{A_i} = \{S_1, ..., S_K\}$ of $S$
 $P^*_{A_i}$ maximizes the information gain
 among all $\{ P^*_{A_i} \}$ select the maximal one
 split by it
Best to split in halves: $K = 2$
 how to choose $\alpha$ that splits $S$ into $S_L$ and $S_R$?
 try several, pick up the best
 note that $\Delta I$ is not monotonic  have to try all of them
Perfect decision trees perform 100% accurate on the training set
How to get the optimal complexity of a tree?
 Prepruning
 Stop before a tree becomes perfect (fullygrown)
 Postpruning
 Grow a perfect tree
 Prune the nodes bottomup
PrePruning
It's about finding a good stopping criterion
Simple ones:
 when # of instances in a node goes below some threshold
 when the misclassification error is lower than some threshold $\beta$
 when expanding current node doesn't give a significant information gain $\Delta I$
More complex:
 when class distribution of instances becomes independent from available features
PostPruning
CostComplexity Pruning
 by Breiman, Olshen, Stone (1984)
Practical Issues
Decision Boundaries
the decision boundaries are rectilinear
Handling Missing Values
Two options
 discard all NAs
 modify the algorithm
Algorithm Modification
Modification:
 suppose that we have some splitting test criterion $T$ and dataset $S$
 information gain for splitting $S$ using $T$ is
 $\Delta I (S, T) = I(S)  \sum_k \alpha_{T, k} \cdot I(S_k)$
 let $S_0 \subseteq S$ for which we can't evaluate $T$ (i.e. because needed values are missing)
 if $S_0 \not \equiv \varnothing$
 calculate the information gain as
 $\cfrac{S  S_0}{ S } \Delta I (S  S_0, T)$
 suppose such $T$ is chosen, what to do with values from $S_0$?
 add them to all the subsets with weight proportional to the size of these subsets
 $w_k = \cfrac{ S }{S  S_0}$
 and information gain is computed using sums of weights instead of counts
Example
X 
Y 
Z 
Class

a 
70 
Yes 
+

a 
90 
Yes 


a 
85 
No 


a 
95 
No 


a 
70 
No 
+

? 
90 
Yes 
+

b 
78 
No 
+

b 
65 
Yes 
+

b 
75 
No 
+

c 
80 
Yes 


c 
70 
Yes 


c 
80 
No 
+

c 
80 
No 
+

c 
96 
No 
+


 There is one missing value for $X$: $(?, 90, \text{Yes}, +)$
 let $I$ be the misclassification error
 $I(S  S_0) = 5/13$ (5 in "", 8 in "+")
 $I(S  S_0  X = a) = 2/5$
 $I(S  S_0  X = b) = 0$
 $I(S  S_0  X = c) = 2/5$
 calculate IG $\cfrac{S  S_0}{ S } \Delta I (S  S_0, T)$
 $\Delta I = \cfrac{13}{14} \cdot (\cfrac{5}{13}  \cfrac{5}{13} \cdot \cfrac{2}{5}  \cfrac{3}{13} \cdot 0  \cfrac{5}{13} \cdot \cfrac{2}{5}) = \cfrac{1}{14}$

Distribute the values to subsets
 according to the value of $X$
 but the rows with missing values are put to all the subsets
 and the weight is proportional to the size of this subset prior to adding these rows
$X = a$
$Y$ 
$Z$ 
Class 
$w$

70 
Yes 
+ 
1

90 
Yes 
 
1

85 
No 
 
1

95 
No 
 
1

70 
No 
+ 
1

90 
Yes 
+ 
5/13


$X = a$
$Y$ 
$Z$ 
Class 
$w$

78 
No 
+ 
1

65 
Yes 
+ 
1

75 
No 
+ 
1

90 
Yes 
+ 
3/13


$X = a$
$Y$ 
$Z$ 
Class 
$w$

80 
Yes 
 
1

70 
Yes 
 
1

80 
No 
+ 
1

80 
No 
+ 
1

96 
No 
+ 
1

90 
Yes 
+ 
5/13


Classification with Modification
Classification:
 let $P(C  E,T)$ be the probability of classifying case $E$ to class $C$ using tree $T$
 define it recursively:
 if $t = \text{root}(T)$ is a leaf (i.e. it's a singleton tree)
 then P(C  E,T) is the relative frequency of training cases in class $C$ that reach $T$
 if $t = \text{root}(T)$ is not a leaf and $t$ is partitioned using attribute $X$
 if $E.X = x_k$
 then $P(C  E,T) = P(C  E,T_k)$ where $T_k$ is a subtree of $T$ where $X = x_k$
 if $E.X$ is unknown,
 then $P(CE,T) = \sum_{k=1}^{K} \cfrac{S_k}{SS_0} \cdot P(C  E,T)$
 so we sum up probabilities of belonging to class $C$ from each child of $t$
 predict that a record belongs to class $C$ by selecting the highest probability $P(CE,T)$
Example
 assume that $X$ is unknown  how to classify the case?
 $P(+ E,T) = \sum_{k=1}^{K} P(+  E,T_k) = \cfrac{20}{50} \cdot \cfrac{15}{20} + \cfrac{30}{50} \cdot \cfrac{5}{30} = \cfrac{20}{50}$
 $P( E,T) = \sum_{k=1}^{K} P(  E,T_k) = \cfrac{20}{50} \cdot \cfrac{5}{20} + \cfrac{30}{50} \cdot \cfrac{25}{30} = \cfrac{30}{50}$
 $P( E,T) > P(+ E,T) \Rightarrow$ predict "$$"
Decision Tree Learning Examples
Example 1
Suppose we have the following data:

 2 classes: $\square$ (S) and $\bigcirc$ (C)
 there are 5S and 3C, total 8:
 $I(S) =  \frac{3}{8} \log_2 \frac{3}{8}  \frac{5}{8} \log_2 \frac{5}{8} = 0.954$
 there are 2 attributes: $X$ and $Y$
 and 2 ways to split each: $X < 1.5, X < 2.5, Y < 1.5, Y < 2.5$


$S_L$ 
$I(S_L)$ 
$S_R$ 
$I(S_R)$ 
$E[I(\{S_L, S_R\})]$ 
$\Delta I$

$X < 1.5$


$2C$ 
0 
$3C, 3S$ 
1 
$\frac{2}{8} \cdot 0 + \frac{5}{8} \cdot 1 = 0.75$ 
$0.954  0.75 = \fbox{0.204}$

$X < 2.5$


$3C, 2S$ 
0.971 
$2C, 1S$ 
0.918 
$\frac{5}{8} \cdot 0.971 + \frac{3}{8} \cdot 0.918 = 0.951$ 
0.003

$Y < 1.5$


$2C, 1S$ 
0.918 
$3C, 2S$ 
0.971 
0.951 
0.003

$Y < 2.5$


$2C$ 
1 
$3C, 3S$ 
0 
0.75 
0.204

We decide to use $X < 1.5$ to split the data

 left part is pure  no need to split it
 right part: $3C, 3S$, it's not pure: $I(S) = 1$

$I(S_L)$ 
$I(S_R)$ 
$\Delta I$

$X < 2.5$

0.918 
0.918 
0.082

$Y < 1.5$

1 
1 
0

$Y < 2.5$

0.811 
0 
$\fbox{0.459}$

 $Y < 2.5$  next best splitting criterion

 if we stop here, we obtain the following tree:

 this tree is not a perfect tree  we can continue and at the end obtain a perfect one
Pros and Cons
Advantages
 fast
 handles both symbolic and numerical attributes
 works well with redundant attributes
 invariant to any monotonic transformation of the dataset
 robust to Outliers
 e.g. a tree with $\text{income}$ will behave the same way as a tree with $\sqrt{\text{income}}$
 easy to interpret and explain to nonspecialists
 help to understand what are most important attributes
Disadvantages
 prone to overfitting, need pruning techniques
 even a small variation in data can lead to very different tree
 decision boundaries are rectilinear
Sources