| ... | @@ -12,9 +12,23 @@ Lets think a simple model: a linear function of the input vector with a bias: |
... | @@ -12,9 +12,23 @@ Lets think a simple model: a linear function of the input vector with a bias: |
|
|
y_p(x, w) = w_0 + w_1x_1 + . . . + w_ix_i
|
|
y_p(x, w) = w_0 + w_1x_1 + . . . + w_ix_i
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
To convert the regression model into a classifier, we can imagine a scenario of binary classification and claim that we have Class 1 if $`y_p>=0`$; otherwise, Class 2. By enforcing such a rule, we can figure out the weights that would work for our training dataset by minimizing an error function (e.g. least squares). This toy example is good for demonstration but not very useful in practice: (i) least squares approach is not robust and fails drastically in the presence of outliers, (ii) more importantly, under the hood, we assume a Gaussian conditional distribution for the labels --which is usually not the case.
|
|
To convert the regression model into a classifier, we can imagine a scenario of binary classification and claim that we have Class 1 if $`y_p>=0`$; otherwise, Class 2. By enforcing such a rule, we can figure out the weights that would work for our training dataset by minimizing an error function (e.g. least squares). This toy example is good for demonstration but not very useful in practice: (i) least squares approach is not robust and fails drastically in the presence of outliers, (ii) model always do perfectly confident predictions (either 0 or 1) and cannot distinguish examples close or far away from the decision boundaries, (iii) more importantly, under the hood, we assume a Gaussian conditional distribution for the labels --which is usually not the case.
|
|
|
|
|
|
|
|
|
We can alleviate such issues by using another threshold function, the logistic function:
|
|
|
|
|
|
|
|
|
```math
|
|
|
|
σ(x) = 1 / (1 + exp(-x))
|
|
|
|
```
|
|
|
|
|
|
|
|
This function is also continuous so that it is differentiable --we can use the gradient of the error surface to learn. To build a decision surface from logistic function, we will just use it as a basis function:
|
|
|
|
|
|
|
|
```math
|
|
|
|
y_p(x, w) = σ(w_0 + w_1x_1 + . . . + w_ix_i)
|
|
|
|
```
|
|
|
|
|
|
|
|
The continuous nature of the basis function will give us a gentle transition from Class 1 to Class 2, which can be interpreted as possibility to belong to a class.
|
|
|
|
|
|
|
|
Learning is similar to regression: for M dimensional input array, we learn M trainable parameters (and bias). So, the model training is very fast at high dimensions. Similar to regression, learning is typically managed by the gradient descent algorithm.
|
|
|
|
|
|
|
|
Also note that logistic regression suffers from over-fitting, if the training dataset is linearly separable. Once trained, logistic sigmoid function will be very steep, like [a Heaviside step function](https://en.wikipedia.org/wiki/Heaviside_step_function). Therefore, you should add regularization to the error function (penalize w going to very large values).
|
|
Also note that logistic regression suffers from over-fitting, if the training dataset is linearly separable. Once trained, logistic sigmoid function will be very steep, like [a Heaviside step function](https://en.wikipedia.org/wiki/Heaviside_step_function). Therefore, you should add regularization to the error function (penalize w going to very large values).
|
|
|
|
|
|