Cost-Complexity Pruning
Post-pruning algorithm for Decision Trees
- by Breiman, Olshen, Stone (1984)
Cost-Complexity Function
- need to optimize the cost-complexity function
- $R_\alpha (T) = R(T) + \alpha \cdot | f(T) |$ where
- $R(T)$ is the training/learning error
- $f(T)$ a function that returns the set of leaves of tree $T$
- $\alpha$ is a Regularization parameter
- $R(T) = \sum_{t \in f(T)} r(t) \cdot p(t) = \sum_{t \in f(T)} R(t)$
- $\sum_{t \in f(T)} R(t)$ - sum of misclassification errors at each leaf
- $r(t) = 1 - \max_k p(C_k - t)$ - misclassification rate
- $p(t) = \cfrac{n(t)}{n}$ with $n(t)$ being the # of records in node $t$ and $n$ - total # of records
Pruning Subtrees
Subtrees:
Pruning a subtree $T_{t}$
- $R_\alpha(T - T_t) - R_\alpha(T)$ - variation of the cost-complexity function
- this is the cont-complexity when pruning subtree $T_t$
- $R_\alpha(T - T_t) - R_\alpha(T) = R(T - T_t) - R(T) + \alpha ( | f(T - T_t) | - |f(T)| ) = R(t) - R(T_t) + \alpha ( 1 - |f(T_t)| )$
- let $\alpha' = \cfrac{R(t) - R(T_t)}{|f(T_t)| - 1}$
- variation is
- null if $\alpha = \alpha'$
- negative if $\alpha < \alpha'$
- positive if $\alpha > \alpha'$
Algorithm
Pruning Algorithm:
- Initialization:
- let $T^1$ be the tree obtained with $\alpha^1 = 0$
- by minimizing $R(T)$
- Step 1
- select node $t \in T^1 $ that minimizes
- $g_1(t) = \cfrac{R(t) - R(T^1_t)}{|f(T^1_t)| - 1}$
- let $t_1$ be this node
- let $\alpha^2 = g_1(t_1)$ and $T^2 = T^1 - T^1_{t_1}$
- step $i$
- select node $t \in T^i $ that minimizes
- $g_i(t) = \cfrac{R(t) - R(T^i_t)}{|f(T^i_t)| - 1}$
- let $t_i$ be this node
- let $\alpha^{i + 1} = g_i(t_i)$ and $T^{i+1} = T^i - T^i_{t_i}$
Output:
- a sequence of trees $T^1 \supseteq T^2 \supseteq \ ... \ \supseteq T^k \supseteq \ ... \ \supseteq \{ \text{root} \}$
- a sequence of parameters $\alpha^1 \leqslant \alpha^2 \leqslant \ ... \ \leqslant \alpha^k \leqslant \ ... $
Choosing $\alpha$
The algorithm outputs $\alpha^1 \leqslant \alpha^2 \leqslant \ ... \ \leqslant \alpha^k \leqslant \ ... $
- need to choose some $\alpha \in [\alpha^k, \alpha^{k+1} )$
- let $\alpha \in [\alpha^k, \alpha^{k+1} )$
How to choose $\alpha$
Example
Example 1
Suppose we have the following tree:
-

- we want to prune it
- we have 3 inner nodes where we can prune: $t_1 \equiv \text{root}, t_2, t_3$
Some formulas:
- $R(T_t)$ - training error of a subtree $T_t$ - a tree with root at node $t$
- $R(T_t) = \sum_{l \in f(T_t)} R(l)$ - sum of all training errors over all leaves
- $R_(t)$ - training error of node $t$
- $R_(t) = r(t) \cdot p(t)$
- $r(t)$ - misclassification error at this none (without considering the leaves)
- $p(t)$ - proportion of data items reached $t$ (i.e. # of items reached $t$ divided by # of training items)
- $g(t) = \cfrac{R(t) - R(T_{T_t})}{| f(T_t) | - 1}$
- $| f(T_t) | - 1$ is the number of leaves to prune
Iteration 1:
$t$ |
$R_(t)$ |
$R(T_t)$ |
$g(t)$
|
$t_1$ |
$\cfrac{8}{16} \cdot \cfrac{16}{16}$ |
$T_{t_1}$ - the entire tree all leaves are pure $R(T_{t_1}) = 0$ |
$\cfrac{8/16 - 0}{4 - 1} = \cfrac{1}{6}$
|
$t_2$ |
$\cfrac{4}{12} \cdot \cfrac{12}{16} = \cfrac{4}{16}$ (there are 12 records, 4 $\blacksquare$ + 8 $\bigcirc$ ) |
$R(T_{t_2}) = 0$ |
$\cfrac{4/16 - 0}{3 - 1} = \cfrac{1}{8}$
|
$t_3$ |
$\cfrac{2}{6} \cdot \cfrac{6}{16} = \cfrac{2}{16}$ |
$R(T_{t_3}) = 0$ |
$\cfrac{2/16 - 0}{3 - 1} = \cfrac{1}{8}$
|
We want to find the minimal $g(t)$
- it's $g(t_2)$ and $g(t_3)$
- in case of a tie, we choose the one that prunes fewer nodes
- i.e. $g(t_3)$
- so prune at $t_3$
- let $\alpha^{(2)} = 1/8$ (the min $g(t)$)
Iteration 2:
- in the tree now we have only candidates: $t_1$ and $t_2$
$t$ |
$R_(t)$ |
$R(T_t)$ |
$g(t)$
|
$t_1$ |
$\cfrac{8}{16} \cdot \cfrac{16}{16}$ |
$\cfrac{2}{16}$ |
$\cfrac{8/16 - 2/16}{3 - 1} = \cfrac{6}{32}$
|
$t_2$ |
$\cfrac{4}{12} \cdot \cfrac{12}{16}$ |
$\cfrac{2}{16}$ |
$\cfrac{4/16 - 2/16}{2 - 1} = \cfrac{1}{8}$
|
Find minimal $g(t)$:
- it's $g(t_2) = 1/8$
- let $\alpha^{(3)} = 1/8$
- prune at $t_2$
-

Iteration 3:
- only one candidate for pruning: $t_1$
- $\alpha^{(4)} = g(t_1) = \cfrac{8/16 - 4/16}{2 - 1} = \cfrac{1}{4}$
Selecting the best:
- we have these values: $\alpha^{(0)} = 0, \alpha^{(1)} = 1/8, \alpha^{(2)} = 1/8, \alpha^{(3)} = 1/4$
- by the theorem we want to find tree such $T$ that minimizes the cost-complexity function
- if $0 \geqslant \alpha < 1/8$, then $T_1$ is the best
- if $\alpha = 1/8$, then $T_2$ is the best
- if $1/8 < \alpha < 1/4$, then $T_3$ is the best
- if $1/8 < \alpha < 1/4$, then $T_3$ is the best
- to choose $\alpha$ use Cross-Validation
Example 2
From IT4BI 2013 year exam:
Sources