The main question to ask when designing a neural network is: What is the input and what is the output?
Is data sequential? If so, then we need to use a sequential model (RNNs)
Sequence modeling: How we can build models that learn from sequences of data
What is sequential modeling?
Predicting where a ball will travel next – you need to know the previous position of the ball
Predicting the next word in a sentence – you need to know the previous words in the sentence
Audio recognition, etc.
Sentiment classification
Traditional binary classification (e.g. will I pass my class) does not require sequential modeling and can
be modeled with a simple feed-forward neural network
Even if we stack layers together in a feed-forward neural network we still don’t have sequential modeling
We can think of this as evaluating inputs at a certain time
What if we ran this operation multiple times for each timestep?
They would be independent copies of each other with different inputs at different timestamps
But what if an output at a certain timestamp depended on the output of the previous timestamp?
What if we linked something that is being computed at a given timestamp to the computations at a later timestamp?
This is making each timestamp computation stateful, and we call this internal state h (a memory term)
The output of a model is thus a function of the individual inputs and the past state
So, for each execution timestamp the state h updated and fed back into the neurons in a recurrence relation
This ensures that the state h is updated as time progresses
State doesn’t just belong to the model, it belongs to individual neurons
The cell state is the memory of the neuron, and is calculated as a function of the weights, inputs, and previous state
Computing the state at a given timestamp is done by:
Multiplying a set of state weights and multiplying them by the previous state
Multiplying a set of input weights and multiplying them by the input
Adding the two together
Applying a non-linear activation function to the result
Note that the weight matrices for state updates are different than the weight matrices for input updates
To generate an output at the given timestamp, we can use another weight matrix to multiply the state
Thus, there are three main weight matrices in a recurrent neural network:
State update weights (state -> state, W_hh)
Input weights (input -> state, W_xh)
Output weights (state -> output, W_hy)
Another way to picture RNNs (or compute RNNs) is to iteratively process the input sequence over time and passing the state from one timestamp to the next
This is called unrolling the RNN
This is a way to visualize the recurrence relation
The same weight matrices are used at each timestamp
RNNs can take a single input and produce many outputs, (one-to-many)
For example, a single image can be used to generate a caption
RNNs can also take many inputs and produce a single output (many-to-one)
For example, a sequence of words can be used to predict the sentiment of a sentence
RNNs can also take many inputs and produce many outputs (many-to-many)
For example, a sequence of words can be used to generate a sequence of words
This is the training algorithm for RNNs since information is time-dependent
Backpropgation through time needs to:
Backpropagate the loss through each timestamp: e.g. the last timestamp, then the second to last timestamp, then the third to last timestamp, etc.
This is computationally expensive as the gradient explodes because of the number of calculations
Vanishing gradient: The gradient approaches 0 as you backpropagate through time. This is a very real problem since there are
so many multiplications of small number – without a gradient you don’t know how to optimize
Intuition:
Use a sentence as an example (“The clouds are in the __”)
If the related information (“clouds”) are close together in the sentence, then the gradient will be large (because it hasn’t had time to diminish) and work
Another sentence (e.g. “I grew up in France…I speak fluent __”)
The related information (“France”) is far away from the blank, so the gradient will be small as you backpropagate and then the gradient for “France” will be multiplied
by a very small number so it won’t really matter
This is the problem of long-term dependencies
How to alleviate the long-term dependency problems:
Activation Functions. use the ReLU activation function instead of the sigmoid function (values are oftern 1 not 0)
Parameter Initialization – initialize the weights to an identity matrix and set bias to 0
Gated Cells. The most complicated but most effective solution. These are called LSTMs and GRUs
Selectively control what flows into the neuron to control what is used for the computation. This can be used to filter out what isn’t important and
keep the most relevant information
LSTM (Long short term memory) network
This is a type of RNNs that uses gated cells
They maintain a cell state that use gates to control the flow of information to update the network
LSTSMs can
forget – forget irrelevant information
store – store relevant information from the current input