Deep Learning Primer: Diving into the Activation Function Pool
In the architecture of a deep neural network, activation functions are the crucial component that breathes life into the model. Without them, even the deepest network would be nothing more than a giant linear regression model.
In this post, we’ll explore what activation functions are, why they are essential, and how to choose the right one for your model.
A. What are Activation Functions and Why are They Essential?
An activation function is a mathematical gatekeeper found at the end of every neuron in a neural network. It takes the neuron’s weighted sum input ($z$) and transforms it into an output ($a$), deciding whether the neuron should “fire” (be active) or not.
\[a = f(z) = f(w \cdot x + b)\]Why are they essential?
The primary purpose of an activation function is to introduce non-linearity into the network. Imagine a neural network without activation functions. It would simply be a stack of linear operations.
Layer 1: $h_1 = W_1 x + b_1$ (Linear)
Layer 2: $h_2 = W_2 h_1 + b_2$ (Linear)
Output: $y = W_3 h_2 + b_3$ (Linear)
If you substitute the equations, the entire network collapses into one single linear equation:
\[y = W_{new} x + b_{new}\]This means a 100-layer “deep” network without activation functions has the exact same power as a single-layer Linear Regression model. It cannot learn complex patterns like curves, shapes, or decision boundaries. Activation functions break this linearity, allowing the network to approximate any complex function (the Universal Approximation Theorem).
B. Proving Non-Linearity: The ReLU Example
A function $f(x)$ is linear if and only if it satisfies two properties:
Additivity: $f(x + y) = f(x) + f(y)$
Homogeneity: $f(\alpha x) = \alpha f(x)$
Let’s test this on ReLU (Rectified Linear Unit), which is defined as $f(x) = \max(0, x)$.
Test 1: Additivity
Let $x = 5$ and $y = -5$.
$f(x + y) = f(5 + -5) = f(0) = \mathbf{0}$
$f(x) + f(y) = f(5) + f(-5) = 5 + 0 = \mathbf{5}$
Result: $0 \neq 5$. Additivity fails.
Test 2: Homogeneity
Let $x = 5$ and $\alpha = -1$
$f(\alpha x) = f(-5) = \mathbf{0}$
$\alpha f(x) = -1 \cdot f(5) = -1 \cdot 5 = \mathbf{-5}$
Result: $0 \neq -5$. Homogeneity fails.
Because it fails these tests, ReLU is mathematically non-linear. By stacking layers of ReLU neurons, a network can bend and fold the feature space to solve complex problems that a linear model never could.
C. Major Activation Functions: Formulations & Analysis
1. Sigmoid (Logistic)
- Formula: $\sigma(z) = \frac{1}{1 + e^{-z}}$
- Range: $(0, 1)$
- Use Case: Ideally only for the output layer of binary classification.
- Issues:
- Vanishing Gradient: For very high or low inputs (e.g., $z=10$ or $z=-10$), the gradient is near zero. This kills learning in deep networks.
- Not Zero-Centered: Outputs are always positive, which can cause zig-zagging dynamics during gradient descent.
- Scaling Needed: Yes (Standard Scaling). Large inputs push neurons into the “saturation” zone where gradients die.
2. Tanh (Hyperbolic Tangent)
- Formula: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$
- Range: $(-1, 1)$
- Benefit: Unlike Sigmoid, it is zero-centered, which makes optimization easier.
- Issues: Still suffers from the Vanishing Gradient problem when inputs are large.
- Scaling Needed: Yes (Standard Scaling) to keep inputs near 0 where the gradient is steepest.
3. ReLU (Rectified Linear Unit)
- Formula: $f(z) = \max(0, z)$
- Range: $[0, \infty)$
- Benefit: Computationally efficient (just a max check). Solves the vanishing gradient problem for positive values. Converges much faster than Sigmoid/Tanh.
- Issue: “Dying ReLU”. If inputs are negative, the gradient is 0. The neuron “dies” and never updates again.
- Scaling: Not strictly required but highly recommended (Batch Norm) to keep inputs in the active positive range.
4. Leaky ReLU
- Formula: $f(z) = \max(\alpha z, z)$ where $\alpha$ is small (e.g., 0.01).
- Range: $(-\infty, \infty)$
- Benefit: Fixes the “Dying ReLU” problem by allowing a small gradient to flow even when the input is negative.
5. ELU (Exponential Linear Unit)
- Formula: $z$ if $z > 0$, else $\alpha(e^z - 1)$
- Range: $(-\alpha, \infty)$
- Benefit: Smooth curve for negative values (unlike the sharp turn of ReLU/Leaky ReLU). Pushes mean activation closer to zero, speeding up learning.
- Issue: Computationally expensive due to the $e^z$ calculation.
6. Softmax
- Formula: Given a vector of scores/logits: $z = (z_1, z_2, …, z_K)$ the softmax converts these into probabilities:
- Range: $(0, 1)$, sum of outputs = 1.
- Use Case: Exclusively for the output layer of multi-class classification problems. It turns raw scores (logits) into a probability distribution.
D. The Power of Asymmetry (ReLU & ELU)
ReLU and ELU are asymmetric functions (they behave differently for positive vs. negative inputs). This asymmetry is a feature, not a bug.
Sparsity (ReLU): Because ReLU outputs exactly 0 for all negative inputs, typically 50% or more of the neurons in a hidden layer are “inactive” (output 0) for any given input. This creates a sparse representation, which is computationally efficient and helps disentangle complex features.
Specialization: The network learns to activate specific subnetworks for specific features. One path might light up for “cats” while another stays dark (0), effectively switching parts of the network on and off.
E. Tackling Exploding and Vanishing Gradients
The choice of activation function is the primary defense against these gradient problems.
1. The Vanishing Gradient Problem:
Cause: In deep networks with Sigmoid/Tanh, gradients become tiny ($< 1$) as they backpropagate through many layers (chain rule multiplication). The early layers stop learning.
Solution: Use ReLU (or its variants). For positive inputs, the gradient of ReLU is exactly 1. No matter how many layers you multiply $1 \times 1 \times 1$, the gradient does not vanish.
2. The Exploding Gradient Problem:
Cause: Gradients accumulate and become massive, causing weights to update wildly (NaN errors).
Solution:
- Gradient Clipping: Manually cap the gradient norm (e.g., max 1.0) during training. Weight Initialization: Use He Initialization (for ReLU) or Xavier/Glorot Initialization (for Sigmoid/Tanh) to ensure the variance of outputs stays consistent across layers.
- Batch Normalization: Normalizes layer inputs to keep them in a stable range, preventing them from drifting into saturation or explosion zones.