Post

Deep Learning Primer: RNN & LSTM

Deep Learning Primer: RNN & LSTM

Let’s understand BPTT algorithm in RNN

Image of Deep Neural Net

\[S_t = \sigma(U.x_t + W.S_{t-1} + b)\] \[y_t = O(V.S_t + c)\]

Derivation of Gradients in RNNs

1. Gradient Flow

\[\frac{\partial L_t(\theta)}{\partial W} = \underbrace{\frac{\partial L_t(\theta)}{\partial S_t}}_{\text{straight-forward}} \cdot \overbrace{\frac{\partial S_t}{\partial W}}^{\text{How to compute this?}}\]

The Challenge: How do we compute $\frac{\partial S_t}{\partial W}$?

Recall the state update equation:

\[S_t = \sigma(U x_t + W S_{t-1} + b)\]

We cannot simply treat $S_{t-1}$ as a constant because it also depends on $W$.

In such networks, the total derivative $\frac{\partial S_t}{\partial W}$ has 2 parts:

  1. Explicit: $\frac{\partial^+ S_t}{\partial W}$, treating all other inputs (like $S_{t-1}$) as constants.
  2. Implicit: Summing over all indirect paths from $S_t$ to $W$.

2. Recursive Expansion

We can expand the derivative as follows:

\[\frac{\partial S_t}{\partial W} = \underbrace{\frac{\partial^+ S_t}{\partial W}}_{\text{exp.}} + \underbrace{\frac{\partial S_t}{\partial S_{t-1}} \frac{\partial S_{t-1}}{\partial W}}_{\text{imp.}}\]

Substituting $\frac{\partial S_{t-1}}{\partial W}$ recursively:

\[= \frac{\partial^+ S_t}{\partial W} + \frac{\partial S_t}{\partial S_{t-1}} \left[ \frac{\partial^+ S_{t-1}}{\partial W} + \frac{\partial S_{t-1}}{\partial S_{t-2}} \frac{\partial S_{t-2}}{\partial W} \right]\]

Completely expanding results in a sum of terms:

\[= \frac{\partial^+ S_t}{\partial W} + \frac{\partial S_t}{\partial S_{t-1}} \frac{\partial^+ S_{t-1}}{\partial W} + \frac{\partial S_t}{\partial S_{t-1}} \frac{\partial S_{t-1}}{\partial S_{t-2}} \frac{\partial^+ S_{t-2}}{\partial W} + ... + \frac{\partial S_t}{\partial S_{t-1}} \frac{\partial S_{t-1}}{\partial S_{t-2}}...\frac{\partial S_2}{\partial S_1} \frac{\partial^+ S_1}{\partial W}\]

Backpropagation Through Time (BPTT)

For simplicity, we are going to “short-circuit” some of the paths notation. Let $\frac{\partial S_t}{\partial S_t} = 1$.

\[\frac{\partial S_t}{\partial W} = \frac{\partial^+ S_t}{\partial W} + \frac{\partial S_t}{\partial S_{t-1}} \frac{\partial^+ S_{t-1}}{\partial W} + \frac{\partial S_t}{\partial S_{t-1}} \frac{\partial S_{t-1}}{\partial S_{t-2}} \frac{\partial^+ S_{t-2}}{\partial W} + ... + \frac{\partial S_t}{\partial S_{t-1}} \frac{\partial S_{t-1}}{\partial S_{t-2}}...\frac{\partial S_2}{\partial S_1} \frac{\partial^+ S_1}{\partial W}\] \[= \sum_{k=1}^{t} \frac{\partial S_t}{\partial S_k} \frac{\partial^+ S_k}{\partial W}\]

In General:

For a loss at time $t$, $L_t(\theta)$:

\[\frac{\partial L_t(\theta)}{\partial W} = \frac{\partial L_t(\theta)}{\partial S_t} \times \left[ \sum_{k=1}^{t} \frac{\partial S_t}{\partial S_k} \cdot \frac{\partial^+ S_k}{\partial W} \right] \quad \text{--- (Eq 1)}\]

This algorithm is called BPTT (Backpropagation Through Time).

Exploding / Vanishing Gradient

In Equation 1, let’s focus on the term $\frac{\partial S_t}{\partial S_k}$. By the chain rule across time steps:

\[\frac{\partial S_t}{\partial S_k} = \frac{\partial S_t}{\partial S_{t-1}} \cdot \frac{\partial S_{t-1}}{\partial S_{t-2}} \cdot \dots \cdot \frac{\partial S_{k+1}}{\partial S_k} = \prod_{j=k+1}^{t} \frac{\partial S_{j}}{\partial S_{j-1}}\]

Let’s look at the individual terms using the forward pass definitions:

  • $a_j = W S_{j-1} + U x_j + b$
  • $S_j = \sigma(a_j)$

The derivative using the chain rule is:

\[\frac{\partial S_j}{\partial S_{j-1}} = \frac{\partial S_j}{\partial a_j} \cdot \frac{\partial a_j}{\partial S_{j-1}} = \frac{\partial S_j}{\partial a_j} \cdot W\]

Where:

  • $a_j = [a_{j1}, a_{j2}, \dots, a_{jd}]$
  • $S_j = [\sigma(a_{j1}), \sigma(a_{j2}), \dots, \sigma(a_{jd})]$

The Jacobian Matrix:

The term $\frac{\partial S_j}{\partial a_j}$ is a diagonal matrix containing the derivatives of the activation function:

\[\frac{\partial S_j}{\partial a_j} = \begin{bmatrix} \frac{\partial S_{j1}}{\partial a_{j1}} & & \\ & \frac{\partial S_{j2}}{\partial a_{j2}} & \\ & & \ddots \\ & & & \frac{\partial S_{jd}}{\partial a_{jd}} \end{bmatrix} = \begin{bmatrix} \sigma'(a_{j1}) & & \\ & \sigma'(a_{j2}) & \\ & & \ddots \\ & & & \sigma'(a_{jd}) \end{bmatrix}\] \[\frac{\partial S_j}{\partial S_{j-1}} = \text{diag}(\sigma'(a_j)) \cdot W\]

Part A: The “Why” - Derivative of a Vector w.r.t. a Vector

To understand how and why the Jacobian matrix appears here, we need to break down the derivative of a vector function with respect to another vector.

In single-variable calculus, if you have $y = f(x)$, the derivative is a single number (scalar). However, in an RNN:

  • Input: $S_{j-1}$ is a hidden state vector of size $(d \times 1)$.
  • Output: $S_j$ is the next hidden state vector of size $(d \times 1)$.

When you ask, “How does the output vector change when I nudge the input vector?”, you aren’t asking one question. You are asking $d \times d$ questions:

  • How does output unit 1 change when I nudge input unit 1?
  • How does output unit 1 change when I nudge input unit 2?
  • …and so on.

The collection of all these partial derivatives is the Jacobian Matrix. Since input and output are both dimension $d$, the resulting matrix must be $(d \times d)$.

Part B: The Activation Part ($\frac{\partial S_j}{\partial a_j}$)

This is where the diagonal matrix comes from. Recall:

\[S_j = \sigma(a_j)\]

Crucially, activations in a standard layer are element-wise.

  • The activation of neuron 1 ($S_{j,1}$) depends only on the pre-activation of neuron 1 ($a_{j,1}$).
  • Changing $a_{j,2}$ has zero effect on $S_{j,1}$.

If we write out the Jacobian matrix for this part:

\[\frac{\partial S_j}{\partial a_j} = \begin{bmatrix} \frac{\partial S_{j,1}}{\partial a_{j,1}} & \frac{\partial S_{j,1}}{\partial a_{j,2}} & \dots \\ \frac{\partial S_{j,2}}{\partial a_{j,1}} & \frac{\partial S_{j,2}}{\partial a_{j,2}} & \dots \\ \vdots & \vdots & \ddots \end{bmatrix}\]

Because of the element-wise independence:

  1. Off-diagonal terms (like $\frac{\partial S_{j,1}}{\partial a_{j,2}}$) are 0.
  2. Diagonal terms (like $\frac{\partial S_{j,1}}{\partial a_{j,1}}$) are simply the scalar derivative of the activation function, denoted as $\sigma’(a_{j,1})$.

This results in a Diagonal Matrix of size $(d \times d)$:

\[\frac{\partial S_j}{\partial a_j} = \text{diag}(\sigma'(a_j))\]

Part C: Putting it together

Now we multiply Part B and Part A (Matrix multiplication):

\[\frac{\partial S_j}{\partial S_{j-1}} = \underbrace{\text{diag}(\sigma'(a_j))}_{\text{Part B (d x d)}} \cdot \underbrace{W}_{\text{Part A (d x d)}}\]

Back to the derivation Taking the norm of both sides for our original equation:

\[\frac{\partial S_j}{\partial S_{j-1}} = \text{diag}(\sigma'(a_j)) \cdot W\] \[\Rightarrow \lVert \frac{\partial S_j}{\partial S_{j-1}} \rVert = \lVert \text{diag}(\sigma'(a_j)) \cdot W \rVert\]

Using the sub-multiplicative property of matrix norms:

\[\lVert \text{diag}(\sigma'(a_j)) \cdot W \rVert \le \| \text{diag}(\sigma'(a_j)) \| \cdot \| W \|\]

Bounding the Activation Derivative: We define an upper bound $\gamma$ for the derivative of the activation function:

  • $\sigma’(a_j) \le \frac{1}{4} = \gamma$ [if $\sigma$ is logistic]
  • $\sigma’(a_j) \le 1 = \gamma$ [if $\sigma$ is tanh]

Substituting this back into the inequality (and letting $\lVert W \rVert = \lambda$):

\[\Rightarrow \left\| \frac{\partial S_j}{\partial S_{j-1}} \right\| \le \gamma \cdot \| W \| = \gamma \cdot \lambda\]

The Product over Time Steps: Now, looking at the total gradient across time steps from $k$ to $t$:

\[\Rightarrow \left\| \frac{\partial S_t}{\partial S_k} \right\| = \left\| \prod_{j=k+1}^{t} \frac{\partial S_j}{\partial S_{j-1}} \right\| \le \prod_{j=k+1}^{t} \gamma \lambda = (\gamma \lambda)^{t-k}\]

Conclusions:

  • If $(\gamma \cdot \lambda) > 1$, we have an exploding gradient problem.
  • If $(\gamma \cdot \lambda) < 1$, we have a vanishing gradient problem.

Two Solutions

  1. Truncated BPTT: One simple way is to use truncated BP (Backpropagation) where we restrict the product to $\tau$ terms (where $\tau < t-k$).
  2. Gradient Clipping: For exploding gradients, we can just clip the gradients.

How LSTM/GRU improve over RNN

The Problem with Standard RNNs

  • Global History: The state $S_t$ is responsible for capturing information from all previous time steps.
  • Constant Transformation: At every time step, the entire existing memory $S_{t-1}$ is transformed (and potentially distorted) by the new input $x_t$ and weights $W$.
  • Loss of Context: After $t$ steps, the information originally stored at step $t-k$ has been transformed so many times that the original signal is effectively lost. This makes it impossible for the network to learn long-term dependencies.
  • Gradient Issues: This forward-pass instability manifests mathematically during Backpropagation as the Vanishing or Exploding Gradient problem, preventing the network from learning how to correct errors from the distant past.

The Whiteboard Analogy (Fixed Memory)

Here is the step-by-step evolution from a standard RNN to an LSTM, using a Whiteboard Theorem Proving analogy to explain the logic.

In this analogy:

  • The Whiteboard ($S_{t-1}$): Contains the mathematical proofs and lemmas derived up to the previous step.
  • The New Input ($x_t$): A new logical axiom or observation introduced at the current step.

Recall that the original RNN equations were:

\[S_t = \sigma(U.x_t + W.S_{t-1} + b)\] \[y_t = O(V.S_t + c)\]

Phase 1: Selective Forget (The Eraser)

The Analogy: You look at the whiteboard full of equations from yesterday ($S_{t-1}$). Before you start proving today’s part of the theorem, you realize that “Lemma 3,” derived yesterday, was a logical dead-end or is irrelevant to the new direction the proof is taking.

  • Action: You grab the eraser and wipe out “Lemma 3,” but you keep “Lemma 1” and “Lemma 2” because they are still useful.
  • Why: If you don’t erase the bad logic, it will confuse your new derivation.

The Equation Evolution: In a standard RNN, the previous state $S_{t-1}$ is multiplied by a weight matrix $W$, which “morphs” the entire memory. It doesn’t have a specific mechanism to just “keep” or “drop” exact items.

To fix this, we add a Forget Gate ($f_t$). This is a vector of numbers between 0 and 1.

  • 0 = Completely erase.
  • 1 = Keep perfectly.

Modified RNN Equation: \(S_t = \underbrace{f_t \odot S_{t-1}}_{\text{Selective Forget}} + \dots\)

(We are not adding new info yet; we are just cleaning up the old state.)

Phase 2: Selective Write (The Marker)

The Analogy: Now that the board is clean, you analyze the new input axiom ($x_t$). You derive a new intermediate conclusion (Candidate $\tilde{S}_t$).

  • Action: You don’t immediately write this new conclusion in permanent marker. First, you check if it is strong and valid.
    • If the new conclusion is weak (noise), you write very faintly or not at all.
    • If it is a breakthrough, you write it boldly next to the old lemmas you kept.
  • Why: You only want to write information that adds value to the proof.

The Equation Evolution: In a standard RNN, the new input is forced into the state: $+ \sigma(U x_t)$. You have no choice but to add it.

To fix this, we create a Candidate State ($\tilde{S}_t$) (the draft) and an Input Gate ($i_t$) (the decision to write).

Modified RNN Equation: \(S_t = \underbrace{(f_t \odot S_{t-1})}_{\text{Cleaned Old Memory}} + \underbrace{(i_t \odot \tilde{S}_t)}_{\text{Selective Write}}\)

  • \(\tilde{S}_t = \tanh(W h_{t-1} + U x_t)\) (The proposed new content)
  • $i_t$ (Scalar 0 to 1: How much of the proposal to write)

Phase 3: Selective Read (The Presentation)

The Analogy: You have now updated the board. It contains the preserved old lemmas plus the new breakthrough.

  • Action: A student raises their hand and asks, “What is the current result?”
  • Analysis: The board contains everything: intermediate steps, scratchpad notes, and the final result. You don’t read the whole board to the student. You filter the information and only speak out (Read) the parts that answer the specific question.
  • Why: The internal memory (Whiteboard) should be rich and detailed, but the output (what you say) should be concise and relevant.

The Equation Evolution: In a standard RNN, the output $y_t$ is usually just a function of the whole state $S_t$.

To fix this, we distinguish between the Internal Cell State ($S_t$) (the actual whiteboard) and the Hidden State ($h_t$) (what we actually output/pass to the next step). We use an Output Gate ($O_t$).

Modified RNN Equation: \(h_t = \underbrace{O_t \odot \tanh(S_t)}_{\text{Selective Read}}\)

  • $S_t$: The full, messy whiteboard.
  • $h_t$: The clean, filtered report given to the outside world (and passed as “previous context” to the next time step).

Summary: The Complete LSTM Block

By combining these three logical steps, we transform the single RNN update line into the LSTM system:

  1. Forget: $f_t \odot S_{t-1}$ (Clean the past)
  2. Write: $+ i_t \odot \tilde{S}_t$ (Add the future)
  3. Read: $h_t = O_t \odot \tanh(S_t)$ (Report the result)

Image of Deep Neural Net

Understanding the Equations

Here are the complete equations for the LSTM unit, following the Forget $\rightarrow$ Write $\rightarrow$ Read sequence established in our Whiteboard analogy.

We have used the standard notation where $S_t$ is the Cell State (the Whiteboard) and $h_t$ is the Hidden State (the Report/Output).

1. Selective Forget (The Eraser)

First, we calculate the Forget Gate vector ($f_t$). This looks at the previous output ($h_{t-1}$) and current input ($x_t$) to decide which parts of the old whiteboard ($S_{t-1}$) are no longer relevant.

  • The Decision ($f_t$): A vector of values between 0 (completely erase) and 1 (keep everything).

    \[f_t = \sigma(W_f \cdot h_{t-1} + U_f \cdot x_t + b_f)\]
  • The Operation: We multiply the old state by this vector element-wise.

    \[S_{t(\text{cleaned})} = f_t \odot S_{t-1}\]

2. Selective Write (The Marker)

Next, we determine what new information to add. This has two parts: creating the “draft” of new info (Candidate) and deciding how strongly to write it (Input Gate).

  • The “Draft” (Candidate $\tilde{S}_t$): The raw new information derived from the current input.

    \[\tilde{S}_t = \tanh(W_c \cdot h_{t-1} + U_c \cdot x_t + b_c)\]

    (Note: Standard LSTMs use $\tanh$ here to allow for negative values).

  • The “Pressure” (Input Gate $i_t$): Decides which parts of the draft are worth writing.

    \[i_t = \sigma(W_i \cdot h_{t-1} + U_i \cdot x_t + b_i)\]
  • The Operation (Update Cell State): We add the filtered new info to our cleaned old state.

    \[S_t = \underbrace{(f_t \odot S_{t-1})}_{\text{Forget Old}} + \underbrace{(i_t \odot \tilde{S}_t)}_{\text{Write New}}\]

3. Selective Read (The Presentation)

Finally, we have the updated whiteboard ($S_t$). Now we must decide what to reveal to the outside world as the output ($h_t$).

  • The Filter (Output Gate $O_t$): Decides which parts of the internal state are relevant to the immediate task.

    \[O_t = \sigma(W_o \cdot h_{t-1} + U_o \cdot x_t + b_o)\]
  • The Operation (Calculate Hidden State): We normalize the whiteboard values (using $\tanh$) and filter them.

    \[h_t = O_t \odot \tanh(S_t)\]

Summary of Variables

  • $x_t$: Input vector at time $t$.
  • $h_{t-1}$: Previous hidden state (output).
  • $S_{t-1}$: Previous cell state (memory).
  • $\sigma$: Sigmoid function (pushes values to range $[0, 1]$).
  • $\tanh$: Hyperbolic tangent (pushes values to range $[-1, 1]$).
  • $W, U, b$: Learnable weights and biases specific to each gate ($f, i, c, o$).

Summary of LSTM Equations

Gates:

  • Forget Gate: $f_t = \sigma(W_f h_{t-1} + U_f x_t + b_f)$
  • Input Gate: $i_t = \sigma(W_i h_{t-1} + U_i x_t + b_i)$
  • Candidate Gate: \(\tilde{S}_t = \tanh(W_c h_{t-1} + U_c x_t + b_c)\)
  • Output: $O_t = \sigma(W_o h_{t-1} + U_o x_t + b_o)$

States:

  • Cell State: $S_t = f_t \odot S_{t-1} + i_t \odot \tilde{S}_t$
    • Long-Term Memory
    • Designed for stable gradient flow
    • The ‘whiteboard’ in our analogy
  • Hidden State: $h_t = O_t \odot \sigma(S_t)$
    • Short-term ‘readable’ memory
    • Used for outputs and fed into the next time step
    • The ‘lecture summary’ in our analogy

(Note: $S_t$ is popularly denoted by $C_t$ [cell state] in LSTM literature)


GRU (Gated Recurrent Unit)

To derive the Gated Recurrent Unit (GRU) equations, we apply three major simplifications to the LSTM architecture. The goal of the GRU is to achieve similar performance (capturing long-term dependencies) with fewer parameters and operations.

We will transform the LSTM (Forget $\to$ Write $\to$ Read) model into the GRU (Reset $\to$ Update) model.

Simplification 1: Merge the States (The Whiteboard is the Report)

LSTM Approach: We maintained two separate variables:

  1. Cell State ($S_t$): The internal Whiteboard (protected memory).
  2. Hidden State ($h_t$): The exposed Report (filtered output).

GRU Simplification: We assume that we don’t need to hide anything. The internal memory and the output can be the same.

  • Action: Merge $S_t$ and $h_t$ into a single hidden state $h_t$.

Simplification 2: Couple Forget and Write (The “Update” Slider)

LSTM Approach: We had two separate decisions:

  1. Forget Gate ($f_t$): “How much old stuff should I erase?”
  2. Input Gate ($i_t$): “How much new stuff should I write?”

GRU Simplification: In practice, these are often mutually exclusive. If you are writing 100% new information (e.g., a new scene in a movie), you likely want to forget 100% of the old context. We combine them into a single Update Gate ($z_t$). This acts like a slider:

  • If $z_t = 1$: Keep the old memory (Pure Forget behavior).
  • If $z_t = 0$: Overwrite with new info (Pure Write behavior).

Derivation: Instead of:

\[S_t = f_t \odot S_{t-1} + i_t \odot \tilde{S}_t\]

We use:

\[h_t = \underbrace{(1 - z_t) \odot h_{t-1}}_{\text{Retain Old}} + \underbrace{z_t \odot \tilde{h}_t}_{\text{Write New}}\]

The Update Gate equation is standard:

\[z_t = \sigma(W_z \cdot h_{t-1} + U_z \cdot x_t)\]

Simplification 3: Drop Output Gate, Add Reset Gate

LSTM Approach: We calculated the candidate $\tilde{S}_t$ using the full history, but we filtered the final output using an Output Gate ($O_t$).

GRU Simplification: Since we merged the states (Simplification 1), we have no Output Gate. We output $h_t$ directly.

However, to compensate for losing the detailed control of the LSTM, we introduce a Reset Gate ($r_t$) inside the candidate generation.

  • Purpose: It allows the network to “short-term forget” the previous state when computing the new candidate, even if it hasn’t overwritten the actual state yet.
  • Analogy: It’s like squinting at the whiteboard. You briefly ignore parts of the history to form a new thought, without actually erasing the board yet.

Derivation: LSTM Candidate:

\[\tilde{S}_t = \tanh(W S_{t-1} + U x_t)\]

GRU Candidate (with Reset):

\[\tilde{h}_t = \tanh(W (\mathbf{r_t} \odot h_{t-1}) + U x_t)\]

The Reset Gate equation:

\[r_t = \sigma(W_r \cdot h_{t-1} + U_r \cdot x_t)\]

Image of Deep Neural Net

  • Image credit: By Jeblad - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=66225938

Final Summary: The GRU Equations

By applying these simplifications, we reduce the system from 3 Gates + 2 States (LSTM) to 2 Gates + 1 State (GRU).

1. The Gates

  • Reset Gate ($r_t$): (Decides how much past to ignore for the candidate) \(r_t = \sigma(W_r h_{t-1} + U_r x_t)\)
  • Update Gate ($z_t$): (Decides the balance between old memory and new candidate) \(z_t = \sigma(W_z h_{t-1} + U_z x_t)\)

2. The Candidate (Proposed New State)

  • Action: Apply the Reset gate to the previous state, then combine with input. \(\tilde{h}_t = \tanh(W (r_t \odot h_{t-1}) + U x_t)\)

3. The Final Update (Slider)

  • Action: Interpolate between the old state and the new candidate. \(h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\)

How LSTMs Avoid the Problem of Vanishing Gradient

To understand how LSTMs and GRUs solve the vanishing gradient problem, we have to look at the mathematical operation connecting the state at time $t-1$ to the state at time $t$.

The short answer is: They convert multiplicative updates (which decay) into additive updates (which persist).

1. The Villain: Vanilla RNN (Multiplicative Decay)

In a standard RNN, to get from $S_{t-1}$ to $S_t$, we forcefully pass the information through a matrix multiplication and a squashing function ($\tanh$).

\[S_t = \tanh(W S_{t-1} + U x_t)\]

When we backpropagate, the gradient flows backwards using the chain rule. As we derived in our notes, the gradient at step $k$ relative to step $t$ involves a product of Jacobians:

\[\frac{\partial S_t}{\partial S_k} \le (\tanh' \cdot \lVert W \rVert )^{t-k}\]
  • The Problem: If $\lVert W \rVert < 1$ or if the $\tanh’$ (derivative) is small (which it almost always is, maxing at 1.0), this product creates an exponential decay.
  • Analogy: It’s like photocopying a photocopy 100 times. Each step slightly degrades the image (multiplies quality by 0.9) until the original information is unrecognizable.

2. The LSTM Solution: The Additive Superhighway

The LSTM architecture introduces the Cell State ($S_t$), which is our “Whiteboard.” The key innovation is how this whiteboard is updated. Look at the equation again:

\[S_t = \underbrace{f_t \odot S_{t-1}}_{\text{Retain Old}} + \underbrace{i_t \odot \tilde{S}_t}_{\text{Add New}}\]

Notice the operation between $S_t$ and $S_{t-1}$? It is a plus sign (+), not a matrix multiplication.

The Derivative Check

If we compute the partial derivative of the current cell state with respect to the previous cell state (holding other gates constant for simplicity):

\[\frac{\partial S_t}{\partial S_{t-1}} = f_t\]
  • The Magic: The gradient is controlled directly by the Forget Gate $f_t$.
  • The “Remember” Setting: If the network learns that a piece of information is important for the long term, it sets $f_t \approx 1.0$.
  • The Result: The gradient becomes $1.0 \times 1.0 \times 1.0 \dots$
  • The gradient can flow backwards through time (across hundreds of steps) without vanishing.

This is often called the Constant Error Carousel (CEC). The error signal gets trapped in the cell state loop and can spin round and round without degrading until the network decides to “forget” it.

3. The GRU Solution: The Learned Slider

The GRU uses the same logic but implements it via the Update Gate ($z_t$).

\[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\]

If we look at how the gradient flows from $h_t$ back to $h_{t-1}$:

\[\frac{\partial h_t}{\partial h_{t-1}} \approx (1 - z_t)\]
  • If $z_t \approx 0$ (meaning “don’t update, keep the old memory”): The derivative is $\approx 1$.
  • The gradient passes through the unit untouched, preserving the dependency from the distant past.

Summary Comparison

 Vanilla RNNLSTM / GRU
State Update$S_t = \tanh(W S_{t-1})$$S_t = S_{t-1} + \dots$
Math OperationMultiplication (Matrices)Addition (Gated)
Gradient FlowDecays exponentially ($W^n$)Preserved linearly ($1^n$)
AnalogyPhotocopying a photocopyWriting in permanent marker (stays until erased)

Conclusion: LSTMs and GRUs don’t eliminate the vanishing gradient; they control it. They turn “vanishing” from a bug into a feature. The network effectively says: “I will allow the gradient to vanish for irrelevant noise (by setting $f_t=0$), but I will force it to persist for critical context (by setting $f_t=1$).”



Resources

Enjoyed this article? Never miss out on future posts - follow me.
This post is licensed under CC BY 4.0 by the author.