ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Residual Analysis

Residuals

'’Residuals’’ is the difference between actual values and predicted values

$i$th residual is:

  • $e_i = y_i - (b_0 + b_1 x_i) = y_i - b_0 - b_1 x_1 $

Mean of $e$:

  • $\bar{e} = \bar{y} - b_0 - b_1 \bar{x} = \bar{y} - (\bar{y} - b_1 \bar{x}) - b_1 \bar{x} = 0$

What about $\text{Var}(e)$?

  • $e_i = y_i - b_0 - b_1 x_i = y_i - (\bar{y} - b_1 \bar{x}) - b_1 x_i = (y_i - \bar{y}) - b_i (x_i - \bar{x}) = (y_i - \bar{y}) - R \cfrac{s_y}{s_x}(x_i - \bar{x})$
  • so,
    $(e_i - \bar{e})^2 = (y_i - \bar{y})^2 + (R \cfrac{s_y}{s_x})^2 (x_i - \bar{x})^2 - 2 (\bar{y} - b_1 \bar{x}) \cdot R \cfrac{s_y}{s_x}(x_i - \bar{x})$
  • $\text{Var}(e) = \cfrac{1}{n - 1} \sum (e_i - \bar{e})^2 = s_y^2 + (R \cfrac{s_y}{s_x})^2 s_x^2 - 2R \cfrac{s_y}{s_x} \cfrac{1}{n - 1} \sum (x_0 - \bar{x})(y_i - \bar{y}) = $ (note the orrelation coefficient again| ) |- $= s_y^2 + R^2 s_y^2 - 2R\cfrac{s_y}{s_x} s_x s_y R = s_y^2 + R^2 s_y^2 - 2R^2 s_y^2 = s_y^2 (1 - R^2)$ |
    So
    $\text{Var}(e) = \text{Var}(y)(1 - R^2)$

Coefficient of Determination

The regression multiplies the variance of $y$ by $(1 - R^2)$

  • Or, the regression line ‘‘removes’’ (or ‘‘reduces’’) a fraction of $R^2$ of the variance of $y$
  • Or we say it “explains a fraction of $R^2$ of the variation”

$R^2$ is called ‘‘coefficient of determination’’ - and says what fraction of $\text{Var}(Y)$ has been explained by the linear relationship

Examples:

  • $R^2 = 0$: the linear relationship explains nothing (so no linear relationship between $X$ and $Y$)
  • $R^2 = 1$: the linear relationship explains everything - no left-overs, no uncertainty
  • $R^2 = 0.0186$: only 1.86% of variation was explained by the linear model - so there hardly is a linear relation. The rest of the variance (98%) is due to something else

Let’s take a look at the example again:

  • <img src=”Image” />
  • $R^2$ = 0.4033
  • so it means quite a bit of variance there is explained by linear model
  • but still it doesn’t explain everything - indeed the real data doesn’t seem to have linear relationship

Residual Analysis

Are there any other kinds of relationships between $X$ and $Y$, not captured by regression?

Ideal case

<img src=”Image” />

  • This is a good case because after taking out linear relationship there’s no particular pattern in residuals: only independent errors are left
  • So overall there’s no particular trend and that means that the regression really tells us something about the relationships between $X$ and $Y$

Another Example

<img src=”Image” />

And the same here

<img src=”Image” />

In both cases the linear relationship doesn’t describe the whole story and we see there are apparent patterns in the residuals in both cases

Logarithmic Transformation

  • To improve the situation we could try to transform the variables before applying regression.
  • Most common transformation is logarithmic

So we have the following:

<img src=”Image” />

Recall that in this case $R^2 = 0.40$

If we calculate $\log_{10} x$ what we get is

<img src=”Image” />

Now we’re able to fit a better regression line and in this case $R^2 = 0.6576$

Here we interpret a slope of 14.93 as

  • if $\log_b x$ increases by $1$, $y$ increases by 14.93
  • or if $x$ is multiplied by $b$, $y$ increases by 14.93

Sources