ML Wiki

Information Gain

The expected information gain is

the change in entropy when going
from a prior state
to anther new state

Entropy

Information

Consider two sequence of coin flips

HHTHTTH...
TTHHTTH...

How much information do we get after flipping each coin?

information: $I(X) = - \log_2 p(X)$
need a measure of unpredictability in a sequence of events
Expected Value of information -
- $\text{Entropy}(X) = E[I(X)] = \sum_x p(x) \cdot I(x) = - \sum_x p(x) \cdot \log_2 p(x)$

E.g.

coin: $P(H) = P(T) = 0.5$
- $I(X) = - \sum_i p(i) \log_2 p(i) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1$
dice: $p(i) = 1/6$
- $I(X) = - \sum_i p(i) \log_2 p(i) = - 6 \cdot (\frac{1}{6} \log_2 \frac{1}{6}) \approx 2.58$
- more unpredictable than a coin
weighted dice: $p(1) = ... = p(5) = 0.1$ and $p(6) = 0.5$
- $I(X) = - 5 \cdot (0.1 \log_2 0.1) - 0.5 \log_2 0.5 \approx 2.16$
- less unpredictable than a fair dice

Entropy Function

Given $m$ classes, the entropy of signal $S$ is

$I(S) = - \sum_{i=1}^{m} p_i \log_2 (p_i)$
where $p_i$ is probability of seeing class $C_i$ in $S$
$S_i$ - set of all records of class $i$

Split in Halves

Entropy for Two Classes

For two classes:

If records in $S$ belong to classes $C_1$ or $C_2$
Let $p_1 = ( |S_1| / |S| ) = p$ and $p_2 = ( |S_2 | / |S| ) = 1 - p$
Then, $E[I(S)] = -p \cdot \log_2 (p) - (1 - p) \cdot \log_2 (1-p) $
- expected value of entropy with two subsets

Entropy is maximal when $p_1 = p_2 = 0.5$

when $p_1 = 0$ and $p_2 = 1$ (or vise versa), there's no entropy

Splitting

Suppose we want to split a set $S$ into two parts $S_1$ and $S_2$

we compute the entropy before splitting
and compute the entropy after splitting at some point $\alpha$
the information gain of splitting is the entropy before minus the expected entropy after

$I(S, \alpha) = I(S) - E[I(\{S_1, S_2\})] = I(S) - p_1 \cdot I(S_1) - p_2 \cdot I(S_2)$

where $p_1 = \cfrac{|S_1|}{|S|}$ and $p_2 = \cfrac{|S_2|}{|S|}$
and $S_1, S_2$ are subsets of $S$ s.t. $S_1 \equiv \{ s \ : \ s < \alpha \}$ and $S_2 = S - S_1$
note that I(S) is a constant and doesn't depend on $\alpha$

Best Splitting

What is the best $\alpha$:

we want to select a boundary $\alpha$ that maximizes the information gain

So for all possible values of $\alpha$

we compute the information gain
and select the best splitting
the process can be repeated recursively

Example 1

Can give the answer of "how much unpredictable is your data?"

suppose that for the titanic dataset ([1]) we have 342 survived people out of 891
$p(\text{Survive}) = \cfrac{342}{891}$
\text{Entropy} = - \cfrac{342}{891} \log_2 \cfrac{342}{891} - \cfrac{549}{891} \log_2 \cfrac{549}{891} \approx 0.96

This is important measure for splitting in Decision Trees

select the attribute with the highest information gain
it will reduce the unpredictability the most

Example 2

Suppose we have a data set of customers

class $C_1$ - drinks milk and $C_2$ - doesn't drink
what's the best boundary to split the age variable?

Links

Sources

Information Gain

Contents

Information Gain

Entropy

Information

Entropy Function

Split in Halves

Entropy for Two Classes

Splitting

Best Splitting

Example 1

Example 2

Links

Sources