Information Gain
The expected information gain is
 the change in entropy when going
 from a prior state
 to anther new state
Entropy
Information
Consider two sequence of coin flips
How much information do we get after flipping each coin?
 information: $I(X) =  \log_2 p(X)$
 need a measure of unpredictability in a sequence of events
 Expected Value of information 
 $\text{Entropy}(X) = E[I(X)] = \sum_x p(x) \cdot I(x) =  \sum_x p(x) \cdot \log_2 p(x)$
E.g.
 coin: $P(H) = P(T) = 0.5$
 $I(X) =  \sum_i p(i) \log_2 p(i) =  (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1$
 dice: $p(i) = 1/6$
 $I(X) =  \sum_i p(i) \log_2 p(i) =  6 \cdot (\frac{1}{6} \log_2 \frac{1}{6}) \approx 2.58$
 more unpredictable than a coin
 weighted dice: $p(1) = ... = p(5) = 0.1$ and $p(6) = 0.5$
 $I(X) =  5 \cdot (0.1 \log_2 0.1)  0.5 \log_2 0.5 \approx 2.16$
 less unpredictable than a fair dice
Entropy Function
Given $m$ classes, the entropy of signal $S$ is
 $I(S) =  \sum_{i=1}^{m} p_i \log_2 (p_i)$
 where $p_i$ is probability of seeing class $C_i$ in $S$
 $S_i$  set of all records of class $i$
Split in Halves
Entropy for Two Classes
For two classes:
 If records in $S$ belong to classes $C_1$ or $C_2$
 Let $p_1 = ( S_1 / S ) = p$ and $p_2 = ( S_2  / S ) = 1  p$
 Then, $E[I(S)] = p \cdot \log_2 (p)  (1  p) \cdot \log_2 (1p) $
 expected value of entropy with two subsets
Entropy is maximal when $p_1 = p_2 = 0.5$
 when $p_1 = 0$ and $p_2 = 1$ (or vise versa), there's no entropy
Splitting
Suppose we want to split a set $S$ into two parts $S_1$ and $S_2$
 we compute the entropy before splitting
 and compute the entropy after splitting at some point $\alpha$
 the information gain of splitting is the entropy before minus the expected entropy after
$I(S, \alpha) = I(S)  E[I(\{S_1, S_2\})] = I(S)  p_1 \cdot I(S_1)  p_2 \cdot I(S_2)$
 where $p_1 = \cfrac{S_1}{S}$ and $p_2 = \cfrac{S_2}{S}$
 and $S_1, S_2$ are subsets of $S$ s.t. $S_1 \equiv \{ s \ : \ s < \alpha \}$ and $S_2 = S  S_1$
 note that I(S) is a constant and doesn't depend on $\alpha$
Best Splitting
What is the best $\alpha$:
 we want to select a boundary $\alpha$ that maximizes the information gain
So for all possible values of $\alpha$
 we compute the information gain
 and select the best splitting
 the process can be repeated recursively
Example 1
Can give the answer of "how much unpredictable is your data?"
 suppose that for the titanic dataset ([1]) we have 342 survived people out of 891
 $p(\text{Survive}) = \cfrac{342}{891}$
 \text{Entropy} =  \cfrac{342}{891} \log_2 \cfrac{342}{891}  \cfrac{549}{891} \log_2 \cfrac{549}{891} \approx 0.96
This is important measure for splitting in Decision Trees
 select the attribute with the highest information gain
 it will reduce the unpredictability the most
Example 2
Suppose we have a data set of customers
 class $C_1$  drinks milk and $C_2$  doesn't drink
 what's the best boundary to split the age variable?

Links
Sources