LSTM Networks in Details

Today I’d like to dig deeper into the details of LSTM algorithm. Below is a flow chart of LSTM networks.


LSTM were designed specifically to overcome the long-term dependency problem faced by recurrent neural networks RNNs (due to the vanishing gradient problem). LSTMs have feedback connections which make them different to more traditional feedforward neural networks. This property enables LSTMs to process entire sequences of data (e.g. time series) without treating each point in the sequence independently, but rather, retaining useful information about previous data in the sequence to help with the processing of new data points. As a result, LSTMs are particularly good at processing sequences of data such as text, speech and general time-series.

LSTMs use a series of ‘gates’ to control how the information in a sequence of data comes into, is stored in, and leaves the network. There are three gates in a typical LSTM: forget gate, input gate, and output gate.

·       These gates can be thought of as filters and are each their own neural network.

 

First step

The first step in the process is the forget gate. Here we will decide which bits of the cell state (long term memory of the network) are useful given both the previous hidden state and new input data.


        

To do this, the previous hidden state and the new input data are fed into a neural network. This network generates a vector where each element is in the interval [0,1] (ensured by using the sigmoid activation). This network (within the forget gate) is trained so that it outputs close to 0 when a component of the input is deemed irrelevant and closer to 1 when relevant. It is useful to think of each element of this vector as a sort of filter/sieve which allows more information through as the value gets closer to 1.

These outputted values are then sent up and pointwise multiplied with the previous cell state. This pointwise multiplication means that components of the cell state which have been deemed irrelevant by the forget gate network will be multiplied by a number close to 0 and thus will have less influence on the following steps.

In summary, the forget gate decides which pieces of the long-term memory should now be forgotten (have less weight) given the previous hidden state and the new data point in the sequence.

 

Second Step

Both the new memory network and the input gate are neural networks in themselves, and both take the same inputs, the previous hidden state, and the new input data. It is worth noting that the inputs here are the same as the inputs to the forget gate!

1.      The new memory network is a tanh activated neural network which has learned how to combine the previous hidden state and new input data to generate a ‘new memory update vector’. This vector essentially contains information from the new input data given the context from the previous hidden state. This vector tells us how much to update each component of the long-term memory (cell state) of the network given the new data.

Note that we use a tanh here because its values lie in [-1,1] and so can be negative. The possibility of negative values here is necessary if we wish to reduce the impact of a component in the cell state.

2.    However, in part 1 above, where we generate the new memory vector, there is a big problem, it doesn’t check if the new input data is even worth remembering. This is where the input gate comes in. The input gate is a sigmoid activated network which acts as a filter, identifying which components of the ‘new memory vector’ are worth retaining. This network will output a vector of values in [0,1] (due to the sigmoid activation), allowing it to act as a filter through pointwise multiplication. Like what we saw in the forget gate, an output near zero is telling us we don’t want to update that element of the cell state.

3.    The output of parts 1 and 2 are pointwise multiplied. This causes the magnitude of new information we decided on in part 2 to be regulated and set to 0 if need be. The resulting combined vector is then added to the cell state, resulting in the long-term memory of the network being updated.

 

Third Step

Now that our updates to the long-term memory of the network are complete, we can move to the final step, the output gate, deciding the new hidden state. To decide this, we will use three things: the newly updated cell state, the previous hidden state, and the new input data.

One might think that we could just output the updated cell state; however, this would be comparable to someone unloading everything they had ever learned about the stock market when only asked if they think it will go up or down tomorrow.

To prevent this from happening we create a filter, the output gate, exactly as we did in the forget gate network. The inputs are the same (previous hidden state and new data), and the activation is also sigmoid (since we want the filter property gained from outputs in [0,1]).

As mentioned, we want to apply this filter to the newly updated cell state. This ensures that only necessary information is output (saved to the new hidden state). However, before applying the filter, we pass the cell state through a tanh to force the values into the interval [-1,1].

The step-by-step process for this final step is as follows:
1. Apply the tanh function to the current cell state pointwise to obtain the squished cell state, which now lies in [-1,1].
2. Pass the previous hidden state and current input data through the sigmoid activated neural network to obtain the filter vector.
3. Apply this filter vector to the squished cell state by pointwise multiplication.
4. Output the new hidden state!

Although step 3 is the final step in the LSTM cell, there are a few more things we need to think about before our LSTM is outputting predictions of the type we are looking for.

Firstly, the steps above are repeated many times. For example, if you are trying to predict the following days stock price based on the previous 30 days pricing data, then the steps will be repeated 30 times. In other words, your model will have iteratively produced 30 hidden states to predict tomorrow’s price.

But the output is still a hidden state. In our example above we wanted tomorrow’s price, we can’t make any money off tomorrow’s hidden state! And so, to convert the hidden state to the output, we need to apply a linear layer as the very last step in the LSTM process. This linear layer step only happens once, at the very end, which is why it is often not included in the diagrams of an LSTM cell.

 

 

 

References:

https://towardsdatascience.com/lstm-networks-a-detailed-explanation-8fae6aefc7f9

Comments

Popular posts from this blog

A Peek on How Statistical Analysis and Machine Learning Work in Constructing Quantitative Strategy

Are Machine Learning And AI the Future of Investing?

An Introduction of Recurrent Neural Networks