Welcome back to our blog series, “Learn Deep Learning with NumPy”! In Part 3.1, we introduced neural networks with a single-layer perceptron, building on logistic regression to classify data using the sigmoid activation function. While sigmoid is useful for binary classification, it’s just one of many activation functions that power neural networks. In Part 3.2, we’ll dive deeper into activation functions, implementing two crucial ones—ReLU (Rectified Linear Unit) and softmax—that enable neural networks to learn complex, non-linear patterns.
By the end of this post, you’ll understand why activation functions are
essential for introducing non-linearity, implement relu()
and softmax()
in
NumPy, and see them applied to synthetic data and MNIST-like outputs. These
functions will be key components of our toolkit, reused in multi-layer
perceptrons (MLPs) and beyond. Let’s explore the math and code behind activation
functions!
Activation functions are the heart of neural networks, transforming linear
combinations of inputs and weights into non-linear outputs. Without non-linear
activation functions, stacking multiple linear layers (like X @ W + b
) would
still result in a linear model, incapable of solving complex problems like image
classification or natural language processing. Non-linearity allows neural
networks to learn intricate patterns and decision boundaries.
In this post, we’ll focus on two widely used activation functions:
We’ll also briefly touch on their derivatives, which are critical for computing gradients during backpropagation (covered in Part 3.4). Let’s dive into the mathematics of these functions.
ReLU is a simple yet powerful activation function that outputs the input if it’s positive and 0 otherwise. For a scalar input , it is defined as:
Applied element-wise to a matrix or vector , it sets all negative values to 0 while preserving positive values. ReLU is popular in hidden layers because:
The derivative of ReLU, used in backpropagation, is also straightforward:
Softmax is used in the output layer for multi-class classification, converting raw scores (logits) into probabilities that sum to 1. For a vector of length (representing scores for classes), the softmax output for the -th element is:
Applied to a matrix of shape (e.g., samples, classes), softmax operates row-wise, ensuring each row’s outputs sum to 1. This makes it ideal for interpreting outputs as class probabilities (e.g., for MNIST with 10 digits). The derivative of softmax is more complex due to its dependency on all elements, but we’ll cover it in detail during backpropagation.
Now, let’s implement these activation functions in NumPy and see them in action with examples.
We’ll create reusable relu()
and softmax()
functions, adding them to our
deep learning toolkit. These will be essential for building multi-layer neural
networks in upcoming posts.
Here’s the implementation of ReLU using NumPy’s maximum()
function for
efficiency:
import numpy as np
from numpy.typing import NDArray
from typing import Union
def relu(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the ReLU activation function element-wise.
Args:
Z: Input array of any shape with floating-point values
Returns:
Array of the same shape with ReLU applied element-wise, max(0, Z)
"""
return np.maximum(0, Z)
Here’s the implementation of softmax, ensuring numerical stability by subtracting the maximum value to prevent overflow in the exponential operation:
def softmax(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the softmax activation function row-wise.
Args:
Z: Input array of shape (n_samples, n_classes) with floating-point values
Returns:
Array of the same shape with softmax applied row-wise, probabilities summing to 1 per row
"""
# Subtract the max for numerical stability (avoid overflow in exp)
Z_max = np.max(Z, axis=1, keepdims=True)
exp_Z = np.exp(Z - Z_max)
sum_exp_Z = np.sum(exp_Z, axis=1, keepdims=True)
return exp_Z / sum_exp_Z
Let’s test these activation functions on synthetic data. We’ll apply ReLU to a small matrix and softmax to a matrix simulating MNIST classification outputs (4 samples, 10 classes).
# Example 1: Applying ReLU to a 3x2 matrix
Z_relu = np.array([[-1.0, 2.0],
[0.0, -3.0],
[4.0, -0.5]])
A_relu = relu(Z_relu)
print("Input for ReLU (3x2):\n", Z_relu)
print("Output after ReLU (3x2):\n", A_relu)
# Example 2: Applying Softmax to a 4x10 matrix (simulating MNIST outputs)
Z_softmax = np.random.randn(4, 10) # Random scores for 4 samples, 10 classes
A_softmax = softmax(Z_softmax)
print("\nInput for Softmax (4x10, first few columns):\n", Z_softmax[:, :3])
print("Output after Softmax (4x10, first few columns):\n", A_softmax[:, :3])
print("Sum of probabilities per sample (should be ~1):\n", np.sum(A_softmax, axis=1))
Output (approximate, softmax values will vary due to randomness):
Input for ReLU (3x2):
[[-1. 2. ]
[ 0. -3. ]
[ 4. -0.5]]
Output after ReLU (3x2):
[[0. 2.]
[0. 0.]
[4. 0.]]
Input for Softmax (4x10, first few columns):
[[ 0.123 -0.456 0.789]
[-1.234 0.567 -0.321]
[ 0.987 -0.654 1.234]
[-0.111 0.222 -0.333]]
Output after Softmax (4x10, first few columns):
[[0.134 0.075 0.262]
[0.032 0.195 0.080]
[0.247 0.048 0.316]
[0.098 0.137 0.079]]
Sum of probabilities per sample (should be ~1):
[1. 1. 1. 1.]
In the ReLU example, negative values are set to 0, while positive values remain
unchanged, demonstrating its thresholding behavior. In the softmax example, raw
scores are converted to probabilities summing to 1 for each sample, suitable for
multi-class classification like MNIST (10 digits). The subtraction of the
maximum value (Z_max
) in softmax()
ensures numerical stability by preventing
overflow in np.exp()
.
Let’s update our neural_network.py
file to include the relu()
and
softmax()
functions alongside our previous implementations. These activation
functions will be critical for building multi-layer neural networks in upcoming
posts.
# neural_network.py
import numpy as np
from numpy.typing import NDArray
from typing import Union, Callable, Tuple, List, Dict
def normalize(X: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Normalize the input array X by subtracting the mean and dividing by the standard deviation.
Parameters:
X (NDArray[np.floating]): Input array to normalize. Should be a numerical array
(float or compatible type).
Returns:
NDArray[np.floating]: Normalized array with mean approximately 0 and standard
deviation approximately 1 along each axis.
"""
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
normalized_X = X - mean # Start with mean subtraction
mask = std != 0 # Create a mask for non-zero std
if np.any(mask):
normalized_X[:, mask] = normalized_X[:, mask] / std[mask]
return normalized_X
def matrix_multiply(X: NDArray[np.floating], W: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Perform matrix multiplication between two arrays.
Args:
X: First input array/matrix of shape (m, n) with floating-point values
W: Second input array/matrix of shape (n, p) with floating-point values
Returns:
Result of matrix multiplication, shape (m, p) with floating-point values
"""
return np.matmul(X, W)
def sigmoid(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the sigmoid activation function element-wise.
Args:
Z: Input array of any shape with floating-point values
Returns:
Array of the same shape with sigmoid applied element-wise, values in [0, 1]
"""
return 1 / (1 + np.exp(-Z))
def relu(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the ReLU activation function element-wise.
Args:
Z: Input array of any shape with floating-point values
Returns:
Array of the same shape with ReLU applied element-wise, max(0, Z)
"""
return np.maximum(0, Z)
def softmax(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the softmax activation function row-wise.
Args:
Z: Input array of shape (n_samples, n_classes) with floating-point values
Returns:
Array of the same shape with softmax applied row-wise, probabilities summing to 1 per row
"""
Z_max = np.max(Z, axis=1, keepdims=True)
exp_Z = np.exp(Z - Z_max)
sum_exp_Z = np.sum(exp_Z, axis=1, keepdims=True)
return exp_Z / sum_exp_Z
def mse_loss(y_pred: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute the Mean Squared Error loss between predicted and true values.
Args:
y_pred: Predicted values, array of shape (n,) or (n,1) with floating-point values
y: True values, array of shape (n,) or (n,1) with floating-point values
Returns:
Mean squared error as a single float
"""
return np.mean((y_pred - y) ** 2)
def binary_cross_entropy(A: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute the Binary Cross-Entropy loss between predicted probabilities and true labels.
Args:
A: Predicted probabilities (after sigmoid), array of shape (n,) or (n,1), values in [0, 1]
y: True binary labels, array of shape (n,) or (n,1), values in {0, 1}
Returns:
Binary cross-entropy loss as a single float
"""
epsilon = 1e-15
return -np.mean(y * np.log(A + epsilon) + (1 - y) * np.log(1 - A + epsilon))
def gradient_descent(X: NDArray[np.floating], y: NDArray[np.floating], W: NDArray[np.floating],
b: NDArray[np.floating], lr: float, num_epochs: int, batch_size: int,
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float],
activation_fn: Callable[[NDArray[np.floating]], NDArray[np.floating]] = lambda x: x) -> Tuple[NDArray[np.floating], NDArray[np.floating], List[float]]:
"""
Perform mini-batch gradient descent to minimize loss.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
W: Initial weights, shape (n_features, 1)
b: Initial bias, shape (1,) or (1,1)
lr: Learning rate, step size for updates
num_epochs: Number of full passes through the dataset
batch_size: Size of each mini-batch
loss_fn: Loss function to compute error, e.g., mse_loss or binary_cross_entropy
activation_fn: Activation function to apply to linear output (default: identity)
Returns:
Tuple of (updated W, updated b, list of loss values over epochs)
"""
n_samples = X.shape[0]
loss_history = []
for epoch in range(num_epochs):
indices = np.random.permutation(n_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]
for start_idx in range(0, n_samples, batch_size):
end_idx = min(start_idx + batch_size, n_samples)
X_batch = X_shuffled[start_idx:end_idx]
y_batch = y_shuffled[start_idx:end_idx]
batch_size_actual = X_batch.shape[0]
Z_batch = X_batch @ W + b
y_pred_batch = activation_fn(Z_batch)
error = y_pred_batch - y_batch
grad_W = (X_batch.T @ error) / batch_size_actual
grad_b = np.mean(error)
W = W - lr * grad_W
b = b - lr * grad_b
y_pred_full = activation_fn(X @ W + b)
loss = loss_fn(y_pred_full, y)
loss_history.append(loss)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss:.4f}")
return W, b, loss_history
def numerical_gradient(X: NDArray[np.floating], y: NDArray[np.floating], params: Dict[str, NDArray[np.floating]],
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float],
forward_fn: Callable[[NDArray[np.floating], Dict[str, NDArray[np.floating]]], NDArray[np.floating]],
h: float = 1e-4) -> Dict[str, NDArray[np.floating]]:
"""
Compute numerical gradients for parameters using central difference approximation.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
params: Dictionary of parameters (e.g., {'W': ..., 'b': ...})
loss_fn: Loss function to compute error, e.g., mse_loss
forward_fn: Function to compute predictions from X and params
h: Step size for finite difference approximation (default: 1e-4)
Returns:
Dictionary of numerical gradients for each parameter
"""
num_grads = {}
for param_name, param_value in params.items():
num_grad = np.zeros_like(param_value)
it = np.nditer(param_value, flags=['multi_index'])
while not it.finished:
idx = it.multi_index
original_value = param_value[idx]
param_value[idx] = original_value + h
y_pred_plus = forward_fn(X, params)
loss_plus = loss_fn(y_pred_plus, y)
param_value[idx] = original_value - h
y_pred_minus = forward_fn(X, params)
loss_minus = loss_fn(y_pred_minus, y)
num_grad[idx] = (loss_plus - loss_minus) / (2 * h)
param_value[idx] = original_value
it.iternext()
num_grads[param_name] = num_grad
return num_grads
def forward_perceptron(X: NDArray[np.floating], W: NDArray[np.floating], b: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the forward pass of a single-layer perceptron.
Args:
X: Input data, shape (n_samples, n_features)
W: Weights, shape (n_features, 1)
b: Bias, shape (1, 1) or (1,)
Returns:
Output after sigmoid activation, shape (n_samples, 1)
"""
Z = X @ W + b # Linear combination
A = sigmoid(Z) # Sigmoid activation
return A
You can now import these new functions using
from neural_network import relu, softmax
. They will be essential for building
and training multi-layer neural networks in the upcoming posts.
To reinforce your understanding of activation functions, try these Python-focused coding exercises. They’ll prepare you for using non-linear activations in neural networks. Run the code and compare outputs to verify your solutions.
Applying ReLU to Synthetic Data
Create a 4x3 matrix
Z = np.array([[-1.5, 2.0, 0.0], [-0.5, -2.0, 1.5], [3.0, -1.0, 0.5], [-3.0, 4.0, -0.2]])
.
Apply relu()
to it and print the input and output. Verify that all negative
values are set to 0 while positive values remain unchanged.
# Your code here
Z = np.array([[-1.5, 2.0, 0.0], [-0.5, -2.0, 1.5], [3.0, -1.0, 0.5], [-3.0, 4.0, -0.2]])
A_relu = relu(Z)
print("Input Z (4x3):\n", Z)
print("Output after ReLU (4x3):\n", A_relu)
Applying Softmax to Synthetic Classification Data
Create a 3x5 matrix Z = np.random.randn(3, 5)
to simulate scores for 3
samples across 5 classes. Apply softmax()
to it and print the input,
output, and the sum of probabilities per sample (should be ~1). Verify that
the output values are between 0 and 1 and sum to 1 per row.
# Your code here
Z = np.random.randn(3, 5)
A_softmax = softmax(Z)
sums = np.sum(A_softmax, axis=1)
print("Input Z (3x5):\n", Z)
print("Output after Softmax (3x5):\n", A_softmax)
print("Sum of probabilities per sample (should be ~1):\n", sums)
ReLU on Larger Matrix with Mixed Values
Create a 5x4 matrix Z
with a mix of positive, negative, and zero values
(e.g., using np.random.uniform(-5, 5, (5, 4))
). Apply relu()
and count
the number of elements set to 0 (negative or zero inputs). Print the input,
output, and count.
# Your code here
Z = np.random.uniform(-5, 5, (5, 4))
A_relu = relu(Z)
zero_count = np.sum(A_relu == 0)
print("Input Z (5x4):\n", Z)
print("Output after ReLU (5x4):\n", A_relu)
print("Number of elements set to 0:", zero_count)
Softmax Numerical Stability Test
Create a 2x3 matrix
Z = np.array([[1000, 1000, 1000], [-1000, -1000, -1000]])
with extreme
values that could cause overflow in np.exp()
. Apply softmax()
and print
the output. Observe how subtracting the maximum value in our implementation
prevents overflow and still produces valid probabilities summing to 1.
# Your code here
Z = np.array([[1000, 1000, 1000], [-1000, -1000, -1000]])
A_softmax = softmax(Z)
sums = np.sum(A_softmax, axis=1)
print("Input Z (2x3, extreme values):\n", Z)
print("Output after Softmax (2x3):\n", A_softmax)
print("Sum of probabilities per sample (should be ~1):\n", sums)
These exercises will help you build intuition for how ReLU and softmax activation functions work, their behavior with different input ranges, and their role in introducing non-linearity or probabilistic outputs in neural networks.
Congratulations on mastering ReLU and softmax activation functions, essential
tools for building powerful neural networks! In this post, we’ve explored why
non-linear activations are critical, implemented relu()
and softmax()
in
NumPy, and applied them to synthetic data simulating real-world scenarios like
MNIST classification. These functions expand our toolkit, preparing us for
deeper architectures.
In the next chapter (Part 3.3: Multi-Layer Perceptrons and Forward Propagation), we’ll combine multiple layers with activations like ReLU to build MLPs capable of solving non-linear problems, moving beyond the limitations of single-layer perceptrons. We’ll apply this to MNIST for digit classification.
Until then, experiment with the code and exercises above. If you have questions or want to share your solutions, drop a comment below—I’m excited to hear from you. Let’s keep building our deep learning toolkit together!
Next Up: Part 3.3 – Multi-Layer Perceptrons and Forward Propagation