| ... | ... | @@ -15,10 +15,12 @@ Relative importance of the features x is described by the corresponding weights. |
|
|
|
This formulation however, has strong limitations and is prone to over-fitting and suffers from the presence of outliers. As a remedy, we add a regularization term to the error function in order to control over-fitting:
|
|
|
|
|
|
|
|
```math
|
|
|
|
E_data = y_true - y_p(x, w)
|
|
|
|
E_total = E_data + Lambda * E_{regularization}
|
|
|
|
E_{data} = y_{true} - y_p(x, w)
|
|
|
|
E_{total} = E_{data} + λE_{regularization}
|
|
|
|
```
|
|
|
|
where Lambda defines the relative effect of the regularization term. $`E_{regularization}`$ is typically defined as a function of the weight vector and the variation in this dependency leads to alternative regularization methods. The underlying idea is to enforce the optimizer to decay the weight values towards zero, unless the opposite is enforces by the data. In statistics, this is called parameter [shrinkage method](https://en.wikipedia.org/wiki/Shrinkage_(statistics)).
|
|
|
|
where Lambda defines the relative effect of the regularization term. $`E_{regularization}`$ is typically defined as a function of the weight vector (w) and the variations in this dependency lead to alternative regularization methods. The underlying idea is to enforce the optimizer to decay the weight values towards zero, unless the opposite is enforces by the data. In statistics, this is called [parameter shrinkage method](https://en.wikipedia.org/wiki/Shrinkage_(statistics)).
|
|
|
|
|
|
|
|
With regularization, we can reduce the the effective model complexity so that the models can be trained with limited amount of data with much less over-fitting. It should be noted that, however, this addition creates another hyperparameter (λ), which is needed to be determined for the case of interest.
|
|
|
|
|
|
|
|
|
|
|
|
## Additional Sources
|
| ... | ... | |
| ... | ... | |