Regularization
- Regularization is a technique used to address overfitting
- Main idea of regularization is to keep all the features, but reduce magnitude of parameters θ
- It works well when we have a lot of features, each of which contributes a bit to predicting y
Penalization
- Suppose we want to fit θ0+θ1x1+θ2x2+θ3x3+θ4x4
- We want to penalize θ3 and θ4 - and make them small
- we modify out cost function J:
- J(θ)=1m[∑mi=1cost(hθ(x(i)),y(i))+1000⋅θ23+1000⋅θ24]
- where 1000⋅θ23 and 1000⋅θ24 - penalty for using θ3 and θ4 respectively
- As a result, we'll have θ3≈0 and θ4≈0
So, regularization ensures that values for θ1...θn are small
- it makes the hypotheses simple
- and less prone to overfitting
Cost Function
- we have features x1,...,xm - they may be polynomials
- and parameters θ0,θ1,...,θm
We don't know which parameter to penalize.
Here is our cost function J with regularization:
- J(θ)=1m[∑mi=1cost(hθ(x(i)),y(i))+λ∑nj=1θ2j]
- In the cost function we include the penalty for all θs
- we typically don't penalize θ0, only θ1,...,θn
- λ is called regularization parameter
Regularization Parameter λ
When we find the optimum for our cost function J we have two goals:
- we would like to fit the training data well
- 1st term of the expression reflects that:
- ∑mi=1(cost(hθ(x(i))),y(i))
- we want to keep the parameters small
- 2nd term ensures that:
- λ∑nj=1θ2j
λ is the paramenter that controls the trade-off between these two goals
- We need to choose λ carefully
- large λ will lead to underfitting (we'll end up with hθ(x)≈θ0 )
Say we want to fit hθ(x)=θ0+θ1x+...+θ4x4
-

- (a) If λ if large (say λ=10000), all θ are penalized and θ1≈θ2≈...≈0, hθ(x)≈θ0
- (b) if λ is intermediate, we fit well
- (c) if λ is small (close to 0) we fit too well, i.e. we overfit
To find the best value for this parameter, Model Selection techniques can be used. For example, Cross-Validation
Usage
When we use Gradient Descent (or other optimization technique), we have the following algorithm:
Because we have changed the J(θ) by adding the regularization term, we need to change the partial derivatives of J(θ). So the algorithm now looks as follows:
- repeat
- θ0=θ0−αm∑(hθ(x(i))−y(i))x(i)0 // no change for θ0
- θj=θj−α[1m∑(hθ(x(i))−y(i))x(i)0+λmθj]
Or we can rewrite the last one as:
θj=θj(1−αλm)−αm∑(hθ(x(i))−y(i))x(i)0
We have the following input:
- X=[...(x(1))T.........(x(m))T...]∈Rm×(n+1)
- y=[y1⋮ym]∈Rm
We find θ by calculating θ=(XTX+λE∗)−1⋅XT⋅y
- where E∗∈R(n+1)×(n+1)
- and E is almost identity matrix (1s on the main diagonal, the rest is 0s), except that the very first element is 0
- i.e. for n=2 : [000010001]
- (XTX+λE∗) is always invertible
- Jold(θ)=−1m∑[y⋅log(hθ(x))+(1−y)⋅log(1−hθ(x))]
- We need to modify it to penalize θ1,...,θn:
- J(θ)=Jold(θ)+λ2m∑nj=1θ2j
Similarly, for Gradient Descent we have
- repeat
- θ0=θ0−αm∑(hθ(x(i))−y(i))x(i)0 // no change for θ0
- θj=θj−α[1m∑(hθ(x(i))−y(i))x(i)0+λmθj]
See also
Sources