Lstm Networks

Review: Gradient is the value used to update weights

Problems with RNN

Short-term memory
- RNNs may leave out important information from beginning when analyzing a huge paragraph
Vanishing gradient problem - Gradients get so small, weights don't get updated very significantly - so no learning is done
- Early layers in RNNs suffer from this (since early layers can't learn, RNNs forget what it saw in longer sequences)

Have internal mechanisms called gates (regulates flow of info by seeing what info to keep or throw away) eg. Compare to your brain when looking at a paragraph, you only really remember the key words and main ideas

Different NNs that decide what's important (to keep or forget during training)
Contains sigmoid activation functions
- Helpful because any # multiplied by 0 is just forgotten (Uses vanishing gradient problem to our advantage)

Review: Hidden state is information from previous inputs ( $h_1$ for example)

Overall: Decides what info should be forgotten or kept
Outputs values between 0 and 1 since it has a sigmoid function
- 0 is forgotten
- 1 is kept

Overall:
- Updates the cell state
- Decides what values should be updated in info passed from forget gate
Again sigmoid, so outputs values between 0 and 1
While this ^ happens, also passes hidden state (information from previous inputs) + input into tanh activation function which outputs between -1 and 1
Then, it all comes together!!!!!
- Sigmoid output * tanh output
  - Sigmoid function determines what is worth to keep in the tanh output

Cell state * Forget vector (more checking if cell state content should get dropped)
Take this^ value and do pointwise addition with input gate output
- Updates cell state to new values (Gives new cell state)

Combine hidden state and input into a vector
Feed this into forget gate which has a sigmoid function 3. Info is either dropped or kept depending on if output of sigmoid is 0 or 1
Vector is now passed to input gate's sigmoid function (outputs 0 or 1)
Vector is also passed to input gate's tanh function (outputs 0 or -1)
Multiply these two results to determine if the information should be kept or not
Cell State and vector are multiplied
This ^ value and input gate output are pointwise added to update cell state to new values
Pass the vector into the output gate's sigmoid
Pass newly modified cell state into output gate's tanh 11. Multiply both these outputs to see what hidden state should be kept
Output gate outputs the new hidden state which will be passed into the network once again

Note to help remember this: Sigmoid is always used to determine what is dropped or forgotten

Last updated 3 years ago

Was this helpful?