|
|
|
## Recurrent neural networks (RNN)
|
|
|
|
|
|
|
|
In the simple regression models, what we had under the hood was a mathematical model in order to figure out the relationships between the features. The order of the observations we passed to the model was not informative. In the training set, we had about 20,000 load values yet our statistical models at best were looking back very little in time in their mathematical definitions. We tried to carry this information as additional features in SVM/MLP but this strategy has its own limitations in practice.
|
|
|
|
|
|
|
|
An alternative approach is to create a mathematical model that is capable of keeping track of changes in time and can pass this information to other weak learners it is connected to, in addition to the auto-regression task within the immediate past. In such a case, we can create ensemble models that can watch over changes in the system for a long period of time.
|
|
|
|
|
|
|
|
One alternative model structure here is the recurrent neuron cell. Herein, we assume that the neuron not only produces a signal output (X to y), but also has an embedded hidden state (vector) that is also updated as it sees more cases ($`h_{t}=f(h_{t-1},X_t)`$. Furthermore, the new prediction will be based on the this updated state ($`y_t=g(h_t)`$). In other words, we will not directly calculate y from X, but use an intermediate filter function to add a spice of time. Let’s be more explicit. Imagine that the X has a dimension of 4 (four features) and we use a hidden state embedding of 6. Lets also assume that the hidden state function is a tanh:
|
|
|
|
|
|
|
|
```math
|
|
|
|
h_t = tanh(W^{xh}_{64}X_t + W^{hh}_{66}h_{t-1})
|
|
|
|
```
|
|
|
|
In the training, what we learn is how to combine X_t and h_{t-1} by finding the best weights giving the minimum error. Note that I use superscripts to explain the relationships in the weights and subscripts to denote the size of the wieght matrix. Also note that tanh is applied in an elementwise fashion. Since the equation is additive, at the beginning, h can be a vector of zeros as well.
|
|
|
|
In the next step, we find y from h. If we want to create an output of the same size, we adjust the weight matrix accordingly:
|
|
|
|
|
|
|
|
```math
|
|
|
|
y_t = W^{yh}_{46}h_t
|
|
|
|
```
|
|
|
|
|
|
|
|
For single output:
|
|
|
|
|
|
|
|
```math
|
|
|
|
y_t = W^{yh}_{16}h_t
|
|
|
|
```
|
|
|
|
|
|
|
|
Imagine that we have an input X of the shape (1,3,4) where the second dimension is the time. If we use our model for all time steps, we get the following:
|
|
|
|
|
|
|
|
```math
|
|
|
|
y_1 = W^{yh}_{16} tanh(W^{xh}_{64}X_1 + W^{hh}_{66}h_{0})
|
|
|
|
```
|
|
|
|
|
|
|
|
```math
|
|
|
|
y_2 = W^{yh}_{16} tanh(W^{xh}_{64}X_2 + W^{hh}_{66}h_{1})
|
|
|
|
```
|
|
|
|
|
|
|
|
```math
|
|
|
|
y_3 = W^{yh}_{16} tanh(W^{xh}_{64}X_3 + W^{hh}_{66}h_{2})
|
|
|
|
```
|
|
|
|
|
|
|
|
If we substitute the hidden states,
|
|
|
|
|
|
|
|
```math
|
|
|
|
y_3 = W^{yh}_{16} tanh(W^{xh}_{64}X_3 + W^{hh}_{66}( tanh(W^{xh}_{64}X_2 + W^{hh}_{66}( tanh(W^{xh}_{64}X_1 + W^{hh}_{66}h_{0})))))
|
|
|
|
```
|
|
|
|
|
|
|
|
In short, we can say that both h and y are functions of all past observations:
|
|
|
|
|
|
|
|
```math
|
|
|
|
h_t = f(X_1,X_2,X_3, ..., X_n)
|
|
|
|
```
|
|
|
|
|
|
|
|
```math
|
|
|
|
y_t = g(X_1,X_2,X_3, ..., X_n)
|
|
|
|
```
|
|
|
|
|
|
|
|
and this is where the recursive nature comes from. |