| ... | ... | @@ -30,6 +30,14 @@ The continuous nature of the basis function will give us a gentle transition fro |
|
|
|
|
|
|
|
Learning is similar to regression: for M dimensional input array, we learn M trainable parameters (and bias). So, the model training is very fast at high dimensions. Similar to regression, learning is typically managed by the gradient descent (GD) algorithm. It should be noted that logistic regression also suffers from the over-fitting, if the training dataset is perfectly linearly separable. Once trained with such a dataset, the logistic sigmoid function will be very steep, like a [Heaviside step function](https://en.wikipedia.org/wiki/Heaviside_step_function). Therefore, we should add regularization to the error function that we apply the GD (penalize w going to very large values).
|
|
|
|
|
|
|
|
## Wisdom of the Crowd: Random Forests and Ensemble methods
|
|
|
|
|
|
|
|
Nobody is perfect. That is also true for our ML models. Generalization is major challenge in all ML implementations and we do select our model complexity very carefully and deploy various regularization strategies in order to overcome this issue. This is, however, one way of solving a difficult problem. In the opposite end, we aim to have weak learners –many of them- instead of one, powerful, tailored model. In the most basic case, we do ask N number of models about their opinion then use the most frequent answer / the average prediction. Another way is to split the data space into regions and train “expert” models in a sub-domain of the input space (mixtures of experts). Training of such individual models can be decoupled (independently trained models, preferably relying on a different learning method); or be coupled. One smart way of doing it is to use the feedback obtained from the ith model during the training of (i+1)th model. This strategy is referred as boosting.
|
|
|
|
|
|
|
|
One popular building block here is the decision tree. The model can be interpreted as a graph, consisting of sequential decisions performed at each node. These decisions are how the data space is divided into sub-fractions. To illustrate, lets imagine a 2D dataspace. In the first step, data-space is divided into two based on a model parameter $`γ_1`$. Then, each sub-domain is further split based on the model parameters $`γ_2`$ and $`γ_3`$. Once we have a tree, we can pass a new input x' and find out in which zone that new instance falls into. At the leaf nodes, we may have constants or class indices printed out as the output. During the training,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Additional references
|
|
|
|
|
|
|
|
### Useful posts
|
| ... | ... | |
| ... | ... | |