| ... | @@ -85,7 +85,7 @@ If you are more curious about its derivation and the underlying mathematics of t |
... | @@ -85,7 +85,7 @@ If you are more curious about its derivation and the underlying mathematics of t |
|
|
|
|
|
|
|
## Bayesian Linear Regression
|
|
## Bayesian Linear Regression
|
|
|
|
|
|
|
|
In the “regression problem” (see lecture notes for “ode to learning” and “regression”), we have discussed that selection of model complexity is needed to be compatible with the data (dimensions, volume) in order to minimize the over-fitting problem. We have also seen that we can force regularization on the lost function to give additional penalty for the over-fitting. Nonetheless, its impact is limited as the nature of the base function (i.e. our scientific hypothesis) is still there, affecting the overall behaviour of the ML model deployed.
|
|
In the “regression problem” (see lecture notes for “ode to learning” and “regression”), we have discussed that selection of model complexity is needed to be compatible with the data (dimensions, volume) in order to minimize the over-fitting problem. We have also seen that we can force regularization on the lost function to give additional penalty for the over-fitting. Nonetheless, its impact is limited as the nature of the base function (i.e. our scientific hypothesis) is still there, affecting the overall behavior of the ML model deployed.
|
|
|
|
|
|
|
|
As an alternative path, we can follow the probabilistic learning to alleviate the over-fitting in the regression analysis. Let us consider a simple, linear regression case:
|
|
As an alternative path, we can follow the probabilistic learning to alleviate the over-fitting in the regression analysis. Let us consider a simple, linear regression case:
|
|
|
|
|
|
| ... | @@ -93,7 +93,7 @@ As an alternative path, we can follow the probabilistic learning to alleviate th |
... | @@ -93,7 +93,7 @@ As an alternative path, we can follow the probabilistic learning to alleviate th |
|
|
y_p(x, w) = w_0 + w_1x_1 + . . . + w_ix_i
|
|
y_p(x, w) = w_0 + w_1x_1 + . . . + w_ix_i
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
In the probabilistic world, the trainable parameters (w) will be represented by probability distributions, represented by Gaussians. These initial statements act like prior knowledge on the fitting. In the next step, we update our PDFs via Bayesian update, obtaining the posterior distributions --again as Gaussians (we love Gaussians as it is very easy to work with them: combining Gaussians will yield a new Gaussian distribution). By examining new cases (training data), we can build up a better understanding of the system behaviour iteratively in this probabilistic landscape.
|
|
In the probabilistic world, the trainable parameters (w) will be represented by probability distributions, represented by Gaussians. These initial statements act like prior knowledge on the fitting. In the next step, we update our PDFs via Bayesian update, obtaining the posterior distributions --again as Gaussians (we love Gaussians as it is very easy to work with them: combining Gaussians will yield a new Gaussian distribution). By examining new cases (training data), we can build up a better understanding of the system behavior iteratively in this probabilistic landscape.
|
|
|
|
|
|
|
|
Let's look at a simple, 2D case. In the linear model, we know have only two trainable parameters:
|
|
Let's look at a simple, 2D case. In the linear model, we know have only two trainable parameters:
|
|
|
|
|
|
| ... | @@ -104,7 +104,7 @@ Herein, we will represent the the probabilities for the weights in terms of Gaus |
... | @@ -104,7 +104,7 @@ Herein, we will represent the the probabilities for the weights in terms of Gaus |
|
|
|
|
|
|
|
<img src="uploads/c785c19d15ea93d24d4ca41e01bf9848/br1.png" width="600">
|
|
<img src="uploads/c785c19d15ea93d24d4ca41e01bf9848/br1.png" width="600">
|
|
|
|
|
|
|
|
At this stage, the model is not trained at all. What happens as we see examples? Lets show the model the very first example (blue circle below). In the Bayesian learning, we use the likelihood function, p(y|x,w) -- probability of getting the true y value given the wights and the example. With the example we pass, the calculated likelihood is shown below (left). The true weights are shown as "+" for comparison. Here, the calculated likelihood estimates that the weights of the model should be around this zone. By using the likelihood and our prior, we now calculate the posterior probabilities, given in the middle:
|
|
At this stage, the model is not trained at all. What happens as we see examples? Lets show the model the very first example (blue circle below). In the Bayesian learning, we use the likelihood function, p(y|x,w) -- probability of getting the true y value given the weights and the example. With the example we pass, the calculated likelihood is shown below (left). The true weights are shown as "+" for comparison. Here, the calculated likelihood estimates that the weights of the model should be around this zone. By using the likelihood and our prior, we now calculate the posterior probabilities, given in the middle:
|
|
|
|
|
|
|
|
<img src="uploads/056765b94996330d69d7dc932f5576b0/br2.png" width="600">
|
|
<img src="uploads/056765b94996330d69d7dc932f5576b0/br2.png" width="600">
|
|
|
|
|
|
| ... | |
... | |
| ... | | ... | |