Processing math: 100%

Regularization

  • Regularization is a technique used to address overfitting
  • Main idea of regularization is to keep all the features, but reduce magnitude of parameters θ
  • It works well when we have a lot of features, each of which contributes a bit to predicting y


Penalization

  • Suppose we want to fit θ0+θ1x1+θ2x2+θ3x3+θ4x4
  • We want to penalize θ3 and θ4 - and make them small
  • we modify out cost function J:
J(θ)=1m[mi=1cost(hθ(x(i)),y(i))+1000θ23+1000θ24]
  • where 1000θ23 and 1000θ24 - penalty for using θ3 and θ4 respectively
  • As a result, we'll have θ30 and θ40


So, regularization ensures that values for θ1...θn are small

  • it makes the hypotheses simple
  • and less prone to overfitting


Cost Function

  • we have features x1,...,xm - they may be polynomials
  • and parameters θ0,θ1,...,θm

We don't know which parameter to penalize.


Here is our cost function J with regularization:

  • J(θ)=1m[mi=1cost(hθ(x(i)),y(i))+λnj=1θ2j]
  • In the cost function we include the penalty for all θs
    • we typically don't penalize θ0, only θ1,...,θn
    • λ is called regularization parameter


Regularization Parameter λ

When we find the optimum for our cost function J we have two goals:

  • we would like to fit the training data well
    • 1st term of the expression reflects that:
    • mi=1(cost(hθ(x(i))),y(i))
  • we want to keep the parameters small
    • 2nd term ensures that:
    • λnj=1θ2j


λ is the paramenter that controls the trade-off between these two goals

  • We need to choose λ carefully
    • large λ will lead to underfitting (we'll end up with hθ(x)θ0 )


Say we want to fit hθ(x)=θ0+θ1x+...+θ4x4

  • diagnosis-regularization.png
  • (a) If λ if large (say λ=10000), all θ are penalized and θ1θ2...0, hθ(x)θ0
  • (b) if λ is intermediate, we fit well
  • (c) if λ is small (close to 0) we fit too well, i.e. we overfit

To find the best value for this parameter, Model Selection techniques can be used. For example, Cross-Validation


Usage

Regularized Linear Regression

Gradient Descent

When we use Gradient Descent (or other optimization technique), we have the following algorithm:

  • repeat:
    • for all j
    • θj=θjθjJ(θ)


Because we have changed the J(θ) by adding the regularization term, we need to change the partial derivatives of J(θ). So the algorithm now looks as follows:

  • repeat
    • θ0=θ0αm(hθ(x(i))y(i))x(i)0 // no change for θ0
    • θj=θjα[1m(hθ(x(i))y(i))x(i)0+λmθj]

Or we can rewrite the last one as: θj=θj(1αλm)αm(hθ(x(i))y(i))x(i)0


Normal Equation

We have the following input:

  • X=[...(x(1))T.........(x(m))T...]Rm×(n+1)
  • y=[y1ym]Rm

We find θ by calculating θ=(XTX+λE)1XTy

  • where ER(n+1)×(n+1)
    • and E is almost identity matrix (1s on the main diagonal, the rest is 0s), except that the very first element is 0
    • i.e. for n=2 : [000010001]
  • (XTX+λE) is always invertible


Regularized Logistic Regression

Jold(θ)=1m[ylog(hθ(x))+(1y)log(1hθ(x))]
  • We need to modify it to penalize θ1,...,θn:
J(θ)=Jold(θ)+λ2mnj=1θ2j


Similarly, for Gradient Descent we have

  • repeat
    • θ0=θ0αm(hθ(x(i))y(i))x(i)0 // no change for θ0
    • θj=θjα[1m(hθ(x(i))y(i))x(i)0+λmθj]


See also

Sources