Welcome back to our blog series, “Learn Deep Learning with NumPy”! In Part 4.3, we introduced pooling layers and constructed a simple CNN architecture for MNIST, leveraging spatial hierarchies for efficient image processing. Now, in Part 4.4, we’ll address a critical challenge in training deep models: overfitting. We’ll implement two powerful regularization techniques—L2 regularization and dropout—to prevent overfitting and improve generalization on unseen data.
By the end of this post, you’ll understand the mathematics behind L2
regularization and dropout, implement l2_regularization()
and dropout()
in
NumPy, and apply them to train a 3-layer MLP on MNIST, comparing accuracy with
and without regularization. These techniques enhance our training process,
reusing gradient_descent()
and other toolkit functions. Let’s dive into the
math and code for regularization!
Deep neural networks, with their large number of parameters, are prone to overfitting—memorizing the training data rather than learning general patterns. This results in high accuracy on training data but poor performance on unseen test data. Regularization techniques combat overfitting by adding constraints or randomness to the model, encouraging simpler solutions or reducing reliance on specific neurons.
In deep learning, regularization is crucial because:
In this post, we’ll focus on two popular regularization methods:
We’ll apply these to a 3-layer MLP on MNIST to observe their impact on training and test accuracy. Let’s explore the math behind these techniques.
L2 regularization, also known as weight decay, adds a penalty term to the loss function based on the squared magnitude of the weights. For a loss function (e.g., cross-entropy), the regularized loss becomes:
Where:
During backpropagation, the gradient of the loss with respect to a weight includes an additional term:
This effectively “decays” weights during updates, as the update rule becomes:
Where is the learning rate. This can also be implemented as a separate penalty added to gradients or by scaling weights after updates.
Dropout is a technique that randomly “drops” (deactivates) a fraction of neurons during training, forcing the network to learn redundant representations. For an activation matrix (e.g., output of a layer), dropout applies a random mask:
Where:
Dropout reduces overfitting by preventing neurons from overly specializing or co-adapting, acting as a form of ensemble averaging over many sub-networks. Now, let’s implement these regularization techniques in NumPy and apply them to MNIST training.
We’ll create l2_regularization()
to compute the L2 penalty and its gradient,
and dropout()
to randomly deactivate neurons during training. We’ll then train
a 3-layer MLP on MNIST with these techniques, comparing performance to a
baseline without regularization.
Here’s the implementation of L2 regularization to compute the penalty term and its gradient contribution:
import numpy as np
from numpy.typing import NDArray
from typing import Union, List
def l2_regularization(weights: List[NDArray[np.floating]], lambda_: float) -> tuple[float, List[NDArray[np.floating]]]:
"""
Compute L2 regularization penalty and gradients for a list of weight matrices.
Args:
weights: List of weight matrices (e.g., [W1, W2, W3])
lambda_: Regularization strength (e.g., 0.01)
Returns:
Tuple of (l2_penalty, l2_grads):
- l2_penalty: Scalar penalty term to add to loss (lambda * sum of squared weights)
- l2_grads: List of gradients for each weight matrix (2 * lambda * W)
"""
l2_penalty = 0.0
l2_grads = []
for W in weights:
l2_penalty += np.sum(W ** 2)
l2_grads.append(2 * lambda_ * W)
l2_penalty *= lambda_
return l2_penalty, l2_grads
Here’s the implementation of dropout to randomly deactivate neurons during training:
def dropout(A: NDArray[np.floating], p: float, training: bool = True) -> NDArray[np.floating]:
"""
Apply dropout to an activation matrix by randomly setting elements to 0.
Args:
A: Activation matrix, shape (any)
p: Keep probability (e.g., 0.8 to keep 80% of neurons)
training: Boolean, apply dropout only during training (default: True)
Returns:
Activation matrix after dropout (same shape as input)
"""
if training:
mask = np.random.binomial(1, p, size=A.shape)
return A * mask
return A
Let’s train a 3-layer MLP (784 → 256 → 128 → 10) on MNIST with L2 regularization
and dropout, comparing performance to a baseline without regularization. We’ll
visualize loss and accuracy over epochs using matplotlib
.
Note: Ensure you have sklearn
(pip install scikit-learn
) for loading
MNIST and matplotlib
(pip install matplotlib
) for plotting. We’ll use a
subset of MNIST for CPU efficiency.
import numpy as np
from numpy.typing import NDArray
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from neural_network import normalize, relu, softmax, cross_entropy, forward_mlp_3layer, backward_mlp_3layer
# Load MNIST data (subset for faster training on CPU)
print("Loading MNIST data...")
X_full, y_full = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
X_full = X_full.astype(float)
y_full = y_full.astype(int)
# Limit to 5000 samples for faster training
X = X_full[:5000]
y = y_full[:5000]
# Convert labels to one-hot encoding
n_classes = 10
y_one_hot = np.zeros((y.shape[0], n_classes))
y_one_hot[np.arange(y.shape[0]), y] = 1
# Normalize input data
X = normalize(X)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_one_hot, test_size=0.2, random_state=42)
# Initialize parameters for 3-layer MLP (784 -> 256 -> 128 -> 10)
n_features = X_train.shape[1] # 784 for MNIST
n_hidden1 = 256
n_hidden2 = 128
W1 = np.random.randn(n_features, n_hidden1) * 0.01
b1 = np.zeros((1, n_hidden1))
W2 = np.random.randn(n_hidden1, n_hidden2) * 0.01
b2 = np.zeros((1, n_hidden2))
W3 = np.random.randn(n_hidden2, n_classes) * 0.01
b3 = np.zeros((1, n_classes))
# Training loop with L2 regularization and dropout
lr = 0.1
num_epochs = 20
batch_size = 64
lambda_l2 = 0.01 # L2 regularization strength
dropout_p = 0.8 # Keep probability for dropout (80% keep, 20% drop)
n_samples = X_train.shape[0]
loss_history_with_reg = []
accuracy_history_with_reg = []
loss_history_no_reg = []
accuracy_history_no_reg = []
# Function to train with or without regularization
def train_mlp(with_reg: bool, W1, b1, W2, b2, W3, b3):
W1_copy = W1.copy()
b1_copy = b1.copy()
W2_copy = W2.copy()
b2_copy = b2.copy()
W3_copy = W3.copy()
b3_copy = b3.copy()
loss_history = []
accuracy_history = []
for epoch in range(num_epochs):
indices = np.random.permutation(n_samples)
X_shuffled = X_train[indices]
y_shuffled = y_train[indices]
total_loss = 0.0
for start_idx in range(0, n_samples, batch_size):
end_idx = min(start_idx + batch_size, n_samples)
X_batch = X_shuffled[start_idx:end_idx]
y_batch = y_shuffled[start_idx:end_idx]
# Forward pass with dropout if enabled
A1, A2, A3 = forward_mlp_3layer(X_batch, W1_copy, b1_copy, W2_copy, b2_copy, W3_copy, b3_copy)
if with_reg:
A1 = dropout(A1, dropout_p, training=True)
A2 = dropout(A2, dropout_p, training=True)
# Compute loss
data_loss = cross_entropy(A3, y_batch)
l2_penalty = 0.0
l2_grads = []
if with_reg:
l2_penalty, l2_grads = l2_regularization([W1_copy, W2_copy, W3_copy], lambda_l2)
total_loss += data_loss + l2_penalty
# Backpropagation
Z1 = X_batch @ W1_copy + b1_copy
Z2 = A1 @ W2_copy + b2_copy
grad_W1, grad_b1, grad_W2, grad_b2, grad_W3, grad_b3 = backward_mlp_3layer(
X_batch, A1, A2, A3, y_batch, W1_copy, W2_copy, W3_copy, Z1, Z2)
if with_reg:
grad_W1 += l2_grads[0]
grad_W2 += l2_grads[1]
grad_W3 += l2_grads[2]
# Update parameters
W1_copy -= lr * grad_W1
b1_copy -= lr * grad_b1
W2_copy -= lr * grad_W2
b2_copy -= lr * grad_b2
W3_copy -= lr * grad_W3
b3_copy -= lr * grad_b3
# Compute loss on full training set (without dropout)
_, _, A3_full = forward_mlp_3layer(X_train, W1_copy, b1_copy, W2_copy, b2_copy, W3_copy, b3_copy)
loss = cross_entropy(A3_full, y_train)
loss_history.append(loss)
# Compute accuracy on test set (without dropout)
_, _, A3_test = forward_mlp_3layer(X_test, W1_copy, b1_copy, W2_copy, b2_copy, W3_copy, b3_copy)
predictions = np.argmax(A3_test, axis=1)
true_labels = np.argmax(y_test, axis=1)
accuracy = np.mean(predictions == true_labels)
accuracy_history.append(accuracy)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f} (With Reg: {with_reg})")
return loss_history, accuracy_history
# Train with regularization
print("Training with L2 Regularization and Dropout...")
loss_history_with_reg, accuracy_history_with_reg = train_mlp(
True, W1.copy(), b1.copy(), W2.copy(), b2.copy(), W3.copy(), b3.copy())
# Train without regularization
print("Training without Regularization...")
loss_history_no_reg, accuracy_history_no_reg = train_mlp(
False, W1.copy(), b1.copy(), W2.copy(), b2.copy(), W3.copy(), b3.copy())
# Plot loss and accuracy history for comparison
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 8))
ax1.plot(range(1, num_epochs + 1), loss_history_with_reg, label='Training Loss (With Reg)')
ax1.plot(range(1, num_epochs + 1), loss_history_no_reg, label='Training Loss (No Reg)')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Cross-Entropy Loss')
ax1.set_title('Training Loss Over Epochs')
ax1.legend()
ax1.grid(True)
ax2.plot(range(1, num_epochs + 1), accuracy_history_with_reg, label='Test Accuracy (With Reg)')
ax2.plot(range(1, num_epochs + 1), accuracy_history_no_reg, label='Test Accuracy (No Reg)')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Test Accuracy Over Epochs')
ax2.legend()
ax2.grid(True)
plt.tight_layout()
plt.show()
Output (approximate, values will vary due to randomness and subset size):
Loading MNIST data...
Training with L2 Regularization and Dropout...
Epoch 1/20, Loss: 2.3015, Test Accuracy: 0.1100 (With Reg: True)
...
Epoch 20/20, Loss: 0.3652, Test Accuracy: 0.8850 (With Reg: True)
Training without Regularization...
Epoch 1/20, Loss: 2.3018, Test Accuracy: 0.1080 (With Reg: False)
...
Epoch 20/20, Loss: 0.3127, Test Accuracy: 0.8700 (With Reg: False)
Loss and Accuracy Plots: (Two matplotlib
plots will display, comparing
training loss and test accuracy over epochs for models with and without
regularization. Typically, the model with regularization shows slightly higher
training loss but better test accuracy due to reduced overfitting.)
In this example, we train a 3-layer MLP (784 → 256 → 128 → 10) on a subset of
MNIST (5000 samples for CPU efficiency) over 20 epochs with batch size 64. We
compare two setups: one with L2 regularization (lambda_=0.01
) and dropout
(p=0.8
), and one without. The model with regularization often achieves
slightly better test accuracy (~88-90% vs. ~85-87%) due to reduced overfitting,
though training loss might be higher. Training time remains manageable (~3-6
minutes on CPU).
Let’s update our neural_network.py
file to include the l2_regularization()
and dropout()
functions alongside our previous implementations. These will
enhance our ability to train deep models without overfitting.
# neural_network.py
import numpy as np
from numpy.typing import NDArray
from typing import Union, Callable, Tuple, List, Dict
from scipy import signal
def normalize(X: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Normalize the input array X by subtracting the mean and dividing by the standard deviation.
Parameters:
X (NDArray[np.floating]): Input array to normalize. Should be a numerical array
(float or compatible type).
Returns:
NDArray[np.floating]: Normalized array with mean approximately 0 and standard
deviation approximately 1 along each axis.
"""
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
normalized_X = X - mean # Start with mean subtraction
mask = std != 0 # Create a mask for non-zero std
if np.any(mask):
normalized_X[:, mask] = normalized_X[:, mask] / std[mask]
return normalized_X
def matrix_multiply(X: NDArray[np.floating], W: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Perform matrix multiplication between two arrays.
Args:
X: First input array/matrix of shape (m, n) with floating-point values
W: Second input array/matrix of shape (n, p) with floating-point values
Returns:
Result of matrix multiplication, shape (m, p) with floating-point values
"""
return np.matmul(X, W)
def sigmoid(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the sigmoid activation function element-wise.
Args:
Z: Input array of any shape with floating-point values
Returns:
Array of the same shape with sigmoid applied element-wise, values in [0, 1]
"""
return 1 / (1 + np.exp(-Z))
def relu(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the ReLU activation function element-wise.
Args:
Z: Input array of any shape with floating-point values
Returns:
Array of the same shape with ReLU applied element-wise, max(0, Z)
"""
return np.maximum(0, Z)
def softmax(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the softmax activation function row-wise.
Args:
Z: Input array of shape (n_samples, n_classes) with floating-point values
Returns:
Array of the same shape with softmax applied row-wise, probabilities summing to 1 per row
"""
Z_max = np.max(Z, axis=1, keepdims=True)
exp_Z = np.exp(Z - Z_max)
sum_exp_Z = np.sum(exp_Z, axis=1, keepdims=True)
return exp_Z / sum_exp_Z
def mse_loss(y_pred: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute the Mean Squared Error loss between predicted and true values.
Args:
y_pred: Predicted values, array of shape (n,) or (n,1) with floating-point values
y: True values, array of shape (n,) or (n,1) with floating-point values
Returns:
Mean squared error as a single float
"""
return np.mean((y_pred - y) ** 2)
def binary_cross_entropy(A: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute the Binary Cross-Entropy loss between predicted probabilities and true labels.
Args:
A: Predicted probabilities (after sigmoid), array of shape (n,) or (n,1), values in [0, 1]
y: True binary labels, array of shape (n,) or (n,1), values in {0, 1}
Returns:
Binary cross-entropy loss as a single float
"""
epsilon = 1e-15
return -np.mean(y * np.log(A + epsilon) + (1 - y) * np.log(1 - A + epsilon))
def cross_entropy(A: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute categorical cross-entropy loss for multi-class classification.
Args:
A: Predicted probabilities after softmax, shape (n_samples, n_classes)
y: True labels, one-hot encoded, shape (n_samples, n_classes)
Returns:
Cross-entropy loss as a single float
"""
epsilon = 1e-15 # Small value to prevent log(0)
return -np.mean(np.sum(y * np.log(A + epsilon), axis=1))
def gradient_descent(X: NDArray[np.floating], y: NDArray[np.floating], W: NDArray[np.floating],
b: NDArray[np.floating], lr: float, num_epochs: int, batch_size: int,
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float],
activation_fn: Callable[[NDArray[np.floating]], NDArray[np.floating]] = lambda x: x) -> Tuple[NDArray[np.floating], NDArray[np.floating], List[float]]:
"""
Perform mini-batch gradient descent to minimize loss.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
W: Initial weights, shape (n_features, 1)
b: Initial bias, shape (1,) or (1,1)
lr: Learning rate, step size for updates
num_epochs: Number of full passes through the dataset
batch_size: Size of each mini-batch
loss_fn: Loss function to compute error, e.g., mse_loss or binary_cross_entropy
activation_fn: Activation function to apply to linear output (default: identity)
Returns:
Tuple of (updated W, updated b, list of loss values over epochs)
"""
n_samples = X.shape[0]
loss_history = []
for epoch in range(num_epochs):
indices = np.random.permutation(n_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]
for start_idx in range(0, n_samples, batch_size):
end_idx = min(start_idx + batch_size, n_samples)
X_batch = X_shuffled[start_idx:end_idx]
y_batch = y_shuffled[start_idx:end_idx]
batch_size_actual = X_batch.shape[0]
Z_batch = X_batch @ W + b
y_pred_batch = activation_fn(Z_batch)
error = y_pred_batch - y_batch
grad_W = (X_batch.T @ error) / batch_size_actual
grad_b = np.mean(error)
W = W - lr * grad_W
b = b - lr * grad_b
y_pred_full = activation_fn(X @ W + b)
loss = loss_fn(y_pred_full, y)
loss_history.append(loss)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss:.4f}")
return W, b, loss_history
def numerical_gradient(X: NDArray[np.floating], y: NDArray[np.floating], params: Dict[str, NDArray[np.floating]],
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float],
forward_fn: Callable[[NDArray[np.floating], Dict[str, NDArray[np.floating]]], NDArray[np.floating]],
h: float = 1e-4) -> Dict[str, NDArray[np.floating]]:
"""
Compute numerical gradients for parameters using central difference approximation.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
params: Dictionary of parameters (e.g., {'W': ..., 'b': ...})
loss_fn: Loss function to compute error, e.g., mse_loss
forward_fn: Function to compute predictions from X and params
h: Step size for finite difference approximation (default: 1e-4)
Returns:
Dictionary of numerical gradients for each parameter
"""
num_grads = {}
for param_name, param_value in params.items():
num_grad = np.zeros_like(param_value)
it = np.nditer(param_value, flags=['multi_index'])
while not it.finished:
idx = it.multi_index
original_value = param_value[idx]
param_value[idx] = original_value + h
y_pred_plus = forward_fn(X, params)
loss_plus = loss_fn(y_pred_plus, y)
param_value[idx] = original_value - h
y_pred_minus = forward_fn(X, params)
loss_minus = loss_fn(y_pred_minus, y)
num_grad[idx] = (loss_plus - loss_minus) / (2 * h)
param_value[idx] = original_value
it.iternext()
num_grads[param_name] = num_grad
return num_grads
def forward_perceptron(X: NDArray[np.floating], W: NDArray[np.floating], b: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the forward pass of a single-layer perceptron.
Args:
X: Input data, shape (n_samples, n_features)
W: Weights, shape (n_features, 1)
b: Bias, shape (1, 1) or (1,)
Returns:
Output after sigmoid activation, shape (n_samples, 1)
"""
Z = X @ W + b # Linear combination
A = sigmoid(Z) # Sigmoid activation
return A
def forward_mlp(X: NDArray[np.floating], W1: NDArray[np.floating], b1: NDArray[np.floating],
W2: NDArray[np.floating], b2: NDArray[np.floating]) -> Tuple[NDArray[np.floating], NDArray[np.floating]]:
"""
Compute the forward pass of a 2-layer MLP.
Args:
X: Input data, shape (n_samples, n_features, e.g., 784 for MNIST)
W1: Weights for first layer, shape (n_features, n_hidden, e.g., 784x256)
b1: Bias for first layer, shape (1, n_hidden)
W2: Weights for second layer, shape (n_hidden, n_classes, e.g., 256x10)
b2: Bias for second layer, shape (1, n_classes)
Returns:
Tuple of (A1, A2):
- A1: Hidden layer output after ReLU, shape (n_samples, n_hidden)
- A2: Output layer output after softmax, shape (n_samples, n_classes)
"""
Z1 = X @ W1 + b1 # First layer linear combination
A1 = relu(Z1) # ReLU activation for hidden layer
Z2 = A1 @ W2 + b2 # Second layer linear combination
A2 = softmax(Z2) # Softmax activation for output layer
return A1, A2
def backward_mlp(X: NDArray[np.floating], A1: NDArray[np.floating], A2: NDArray[np.floating],
y: NDArray[np.floating], W1: NDArray[np.floating], W2: NDArray[np.floating],
Z1: NDArray[np.floating]) -> Tuple[NDArray[np.floating], NDArray[np.floating], NDArray[np.floating], NDArray[np.floating]]:
"""
Compute gradients for a 2-layer MLP using backpropagation.
Args:
X: Input data, shape (n_samples, n_features)
A1: Hidden layer output after ReLU, shape (n_samples, n_hidden)
A2: Output layer output after softmax, shape (n_samples, n_classes)
y: True labels, one-hot encoded, shape (n_samples, n_classes)
W1: Weights for first layer, shape (n_features, n_hidden)
W2: Weights for second layer, shape (n_hidden, n_classes)
Z1: Pre-activation values for hidden layer, shape (n_samples, n_hidden)
Returns:
Tuple of gradients (grad_W1, grad_b1, grad_W2, grad_b2)
"""
n = X.shape[0]
# Output layer error (delta2)
delta2 = A2 - y # Shape (n_samples, n_classes)
# Gradients for output layer (W2, b2)
grad_W2 = (A1.T @ delta2) / n # Shape (n_hidden, n_classes)
grad_b2 = np.mean(delta2, axis=0, keepdims=True) # Shape (1, n_classes)
# Hidden layer error (delta1)
delta1 = (delta2 @ W2.T) * (Z1 > 0) # ReLU derivative: 1 if Z1 > 0, 0 otherwise
# Shape (n_samples, n_hidden)
# Gradients for hidden layer (W1, b1)
grad_W1 = (X.T @ delta1) / n # Shape (n_features, n_hidden)
grad_b1 = np.mean(delta1, axis=0, keepdims=True) # Shape (1, n_hidden)
return grad_W1, grad_b1, grad_W2, grad_b2
def forward_mlp_3layer(X: NDArray[np.floating], W1: NDArray[np.floating], b1: NDArray[np.floating],
W2: NDArray[np.floating], b2: NDArray[np.floating],
W3: NDArray[np.floating], b3: NDArray[np.floating]) -> Tuple[NDArray[np.floating], NDArray[np.floating], NDArray[np.floating]]:
"""
Compute the forward pass of a 3-layer MLP.
Args:
X: Input data, shape (n_samples, n_features, e.g., 784 for MNIST)
W1: Weights for first layer, shape (n_features, n_hidden1, e.g., 784x256)
b1: Bias for first layer, shape (1, n_hidden1)
W2: Weights for second layer, shape (n_hidden1, n_hidden2, e.g., 256x128)
b2: Bias for second layer, shape (1, n_hidden2)
W3: Weights for third layer, shape (n_hidden2, n_classes, e.g., 128x10)
b3: Bias for third layer, shape (1, n_classes)
Returns:
Tuple of (A1, A2, A3):
- A1: First hidden layer output after ReLU, shape (n_samples, n_hidden1)
- A2: Second hidden layer output after ReLU, shape (n_samples, n_hidden2)
- A3: Output layer output after softmax, shape (n_samples, n_classes)
"""
Z1 = X @ W1 + b1
A1 = relu(Z1)
Z2 = A1 @ W2 + b2
A2 = relu(Z2)
Z3 = A2 @ W3 + b3
A3 = softmax(Z3)
return A1, A2, A3
def backward_mlp_3layer(X: NDArray[np.floating], A1: NDArray[np.floating], A2: NDArray[np.floating],
A3: NDArray[np.floating], y: NDArray[np.floating],
W1: NDArray[np.floating], W2: NDArray[np.floating], W3: NDArray[np.floating],
Z1: NDArray[np.floating], Z2: NDArray[np.floating]) -> Tuple[NDArray[np.floating], NDArray[np.floating], NDArray[np.floating], NDArray[np.floating], NDArray[np.floating], NDArray[np.floating]]:
"""
Compute gradients for a 3-layer MLP using backpropagation.
Args:
X: Input data, shape (n_samples, n_features)
A1: First hidden layer output after ReLU, shape (n_samples, n_hidden1)
A2: Second hidden layer output after ReLU, shape (n_samples, n_hidden2)
A3: Output layer output after softmax, shape (n_samples, n_classes)
y: True labels, one-hot encoded, shape (n_samples, n_classes)
W1: Weights for first layer, shape (n_features, n_hidden1)
W2: Weights for second layer, shape (n_hidden1, n_hidden2)
W3: Weights for third layer, shape (n_hidden2, n_classes)
Z1: Pre-activation values for first hidden layer, shape (n_samples, n_hidden1)
Z2: Pre-activation values for second hidden layer, shape (n_samples, n_hidden2)
Returns:
Tuple of gradients (grad_W1, grad_b1, grad_W2, grad_b2, grad_W3, grad_b3)
"""
n = X.shape[0]
# Output layer error (delta3)
delta3 = A3 - y # Shape (n_samples, n_classes)
# Gradients for output layer (W3, b3)
grad_W3 = (A2.T @ delta3) / n # Shape (n_hidden2, n_classes)
grad_b3 = np.mean(delta3, axis=0, keepdims=True) # Shape (1, n_classes)
# Second hidden layer error (delta2)
delta2 = (delta3 @ W3.T) * (Z2 > 0) # ReLU derivative: 1 if Z2 > 0, 0 otherwise
# Shape (n_samples, n_hidden2)
# Gradients for second hidden layer (W2, b2)
grad_W2 = (A1.T @ delta2) / n # Shape (n_hidden1, n_hidden2)
grad_b2 = np.mean(delta2, axis=0, keepdims=True) # Shape (1, n_hidden2)
# First hidden layer error (delta1)
delta1 = (delta2 @ W2.T) * (Z1 > 0) # ReLU derivative: 1 if Z1 > 0, 0 otherwise
# Shape (n_samples, n_hidden1)
# Gradients for first hidden layer (W1, b1)
grad_W1 = (X.T @ delta1) / n # Shape (n_features, n_hidden1)
grad_b1 = np.mean(delta1, axis=0, keepdims=True) # Shape (1, n_hidden1)
return grad_W1, grad_b1, grad_W2, grad_b2, grad_W3, grad_b3
def conv2d(image: NDArray[np.floating], filter_kernel: NDArray[np.floating], stride: int = 1) -> NDArray[np.floating]:
"""
Perform 2D convolution on an image using a filter kernel.
Args:
image: Input image, shape (height, width)
filter_kernel: Convolution filter, shape (filter_height, filter_width)
stride: Stride of the convolution operation (default: 1)
Returns:
Output feature map after convolution, shape depends on input, filter size, and stride
"""
output = signal.convolve2d(image, filter_kernel, mode='valid', boundary='fill', fillvalue=0)
if stride > 1:
output = output[::stride, ::stride]
return output
def max_pool(X: NDArray[np.floating], size: int = 2, stride: int = 2) -> NDArray[np.floating]:
"""
Perform 2D max pooling on an input feature map.
Args:
X: Input feature map, shape (height, width) or (height, width, channels)
size: Size of the pooling window (default: 2 for 2x2 pooling)
stride: Stride of the pooling operation (default: 2)
Returns:
Output after max pooling, shape depends on input, size, and stride
"""
if len(X.shape) == 2:
height, width = X.shape
channels = 1
X = X.reshape(height, width, 1)
else:
height, width, channels = X.shape
out_height = (height - size) // stride + 1
out_width = (width - size) // stride + 1
output = np.zeros((out_height, out_width, channels))
for i in range(out_height):
for j in range(out_width):
x_start = i * stride
x_end = x_start + size
y_start = j * stride
y_end = y_start + size
region = X[x_start:x_end, y_start:y_end, :]
output[i, j, :] = np.max(region, axis=(0, 1))
if channels == 1:
output = output[:, :, 0]
return output
def l2_regularization(weights: List[NDArray[np.floating]], lambda_: float) -> Tuple[float, List[NDArray[np.floating]]]:
"""
Compute L2 regularization penalty and gradients for a list of weight matrices.
Args:
weights: List of weight matrices (e.g., [W1, W2, W3])
lambda_: Regularization strength (e.g., 0.01)
Returns:
Tuple of (l2_penalty, l2_grads):
- l2_penalty: Scalar penalty term to add to loss (lambda * sum of squared weights)
- l2_grads: List of gradients for each weight matrix (2 * lambda * W)
"""
l2_penalty = 0.0
l2_grads = []
for W in weights:
l2_penalty += np.sum(W ** 2)
l2_grads.append(2 * lambda_ * W)
l2_penalty *= lambda_
return l2_penalty, l2_grads
def dropout(A: NDArray[np.floating], p: float, training: bool = True) -> NDArray[np.floating]:
"""
Apply dropout to an activation matrix by randomly setting elements to 0.
Args:
A: Activation matrix, shape (any)
p: Keep probability (e.g., 0.8 to keep 80% of neurons)
training: Boolean, apply dropout only during training (default: True)
Returns:
Activation matrix after dropout (same shape as input)
"""
if training:
mask = np.random.binomial(1, p, size=A.shape)
return A * mask
return A
You can now import these new functions using
from neural_network import l2_regularization, dropout
. They enhance our
training capabilities by adding regularization to prevent overfitting in deep
models.
To reinforce your understanding of L2 regularization and dropout, try these Python-focused coding exercises. They’ll help you build intuition for how regularization prevents overfitting and impacts model performance. Run the code and compare outputs to verify your solutions.
L2 Regularization Penalty Calculation
Create two small weight matrices W1 = np.array([[1.0, 2.0], [3.0, 4.0]])
and W2 = np.array([[0.5, 1.5], [2.5, 3.5]])
. Compute the L2 regularization
penalty and gradients using l2_regularization()
with lambda_=0.01
. Print
the penalty and gradients, and verify that the penalty is
lambda_ * sum of squared weights
and gradients are 2 * lambda_ * W
.
# Your code here
W1 = np.array([[1.0, 2.0], [3.0, 4.0]])
W2 = np.array([[0.5, 1.5], [2.5, 3.5]])
lambda_ = 0.01
l2_penalty, l2_grads = l2_regularization([W1, W2], lambda_)
print("L2 Penalty:", l2_penalty)
print("Gradient for W1:\n", l2_grads[0])
print("Gradient for W2:\n", l2_grads[1])
Dropout on Synthetic Activations
Create a synthetic activation matrix
A = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
. Apply dropout()
with
p=0.5
(50% keep probability) during training (training=True
). Run it
twice to observe randomness in dropped elements (values set to 0). Then,
apply with training=False
to verify no dropout occurs. Print results each
time.
# Your code here
A = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
p = 0.5
# First run with training=True
A_drop1 = dropout(A, p, training=True)
print("First Dropout (training=True):\n", A_drop1)
# Second run with training=True (different randomness)
A_drop2 = dropout(A, p, training=True)
print("Second Dropout (training=True):\n", A_drop2)
# Run with training=False (no dropout)
A_no_drop = dropout(A, p, training=False)
print("No Dropout (training=False):\n", A_no_drop)
L2 Regularization in Single Epoch Training
Use a small synthetic dataset X = np.random.randn(5, 3)
(5 samples, 3
features) and one-hot labels
y = np.array([[1, 0], [1, 0], [0, 1], [0, 1], [1, 0]])
(2 classes).
Initialize a 3 → 4 → 2 MLP with small random weights. Perform one epoch of
training with forward_mlp()
and backward_mlp()
, adding L2 regularization
(lambda_=0.01
). Compare the loss with and without L2 penalty. Print both
losses.
# Your code here
X = np.random.randn(5, 3)
y = np.array([[1, 0], [1, 0], [0, 1], [0, 1], [1, 0]])
W1 = np.random.randn(3, 4) * 0.1
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 2) * 0.1
b2 = np.zeros((1, 2))
lr = 0.1
lambda_ = 0.01
# Forward pass
Z1 = X @ W1 + b1
A1 = relu(Z1)
Z2 = A1 @ W2 + b2
A2 = softmax(Z2)
# Loss without regularization
data_loss = cross_entropy(A2, y)
print("Data Loss (without L2):", data_loss)
# Loss with L2 regularization
l2_penalty, l2_grads = l2_regularization([W1, W2], lambda_)
total_loss = data_loss + l2_penalty
print("Total Loss (with L2):", total_loss)
# Backpropagation with L2 gradients
delta2 = A2 - y
grad_W2 = (A1.T @ delta2) / X.shape[0] + l2_grads[1]
grad_b2 = np.mean(delta2, axis=0, keepdims=True)
delta1 = (delta2 @ W2.T) * (Z1 > 0)
grad_W1 = (X.T @ delta1) / X.shape[0] + l2_grads[0]
grad_b1 = np.mean(delta1, axis=0, keepdims=True)
# Update parameters
W1 -= lr * grad_W1
b1 -= lr * grad_b1
W2 -= lr * grad_W2
b2 -= lr * grad_b2
Training with Dropout on Small MNIST Subset
Load a small subset of MNIST (e.g., 1000 samples) using
fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
.
Initialize a 784 → 256 → 128 → 10 MLP. Train for 10 epochs with lr = 0.1
,
batch_size = 32
, applying dropout (p=0.8
) to hidden layers during
training (disable during evaluation). Compare test accuracy to a model
without dropout. Print results and plot loss histories.
# Your code here
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
X_full, y_full = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
X_full = X_full.astype(float)
y_full = y_full.astype(int)
X = X_full[:1000]
y = y_full[:1000]
n_classes = 10
y_one_hot = np.zeros((y.shape[0], n_classes))
y_one_hot[np.arange(y.shape[0]), y] = 1
X = normalize(X)
X_train, X_test, y_train, y_test = train_test_split(X, y_one_hot, test_size=0.2, random_state=42)
W1 = np.random.randn(784, 256) * 0.01
b1 = np.zeros((1, 256))
W2 = np.random.randn(256, 128) * 0.01
b2 = np.zeros((1, 128))
W3 = np.random.randn(128, 10) * 0.01
b3 = np.zeros((1, 10))
lr = 0.1
num_epochs = 10
batch_size = 32
dropout_p = 0.8
n_samples = X_train.shape[0]
loss_history_with_dropout = []
accuracy_history_with_dropout = []
loss_history_no_dropout = []
accuracy_history_no_dropout = []
def train_mlp_with_dropout(use_dropout: bool, W1, b1, W2, b2, W3, b3):
W1_copy = W1.copy()
b1_copy = b1.copy()
W2_copy = W2.copy()
b2_copy = b2.copy()
W3_copy = W3.copy()
b3_copy = b3.copy()
loss_history = []
accuracy_history = []
for epoch in range(num_epochs):
indices = np.random.permutation(n_samples)
X_shuffled = X_train[indices]
y_shuffled = y_train[indices]
for start_idx in range(0, n_samples, batch_size):
end_idx = min(start_idx + batch_size, n_samples)
X_batch = X_shuffled[start_idx:end_idx]
y_batch = y_shuffled[start_idx:end_idx]
A1, A2, A3 = forward_mlp_3layer(X_batch, W1_copy, b1_copy, W2_copy, b2_copy, W3_copy, b3_copy)
if use_dropout:
A1 = dropout(A1, dropout_p, training=True)
A2 = dropout(A2, dropout_p, training=True)
Z1 = X_batch @ W1_copy + b1_copy
Z2 = A1 @ W2_copy + b2_copy
grad_W1, grad_b1, grad_W2, grad_b2, grad_W3, grad_b3 = backward_mlp_3layer(
X_batch, A1, A2, A3, y_batch, W1_copy, W2_copy, W3_copy, Z1, Z2)
W1_copy -= lr * grad_W1
b1_copy -= lr * grad_b1
W2_copy -= lr * grad_W2
b2_copy -= lr * grad_b2
W3_copy -= lr * grad_W3
b3_copy -= lr * grad_b3
_, _, A3_full = forward_mlp_3layer(X_train, W1_copy, b1_copy, W2_copy, b2_copy, W3_copy, b3_copy)
loss = cross_entropy(A3_full, y_train)
loss_history.append(loss)
_, _, A3_test = forward_mlp_3layer(X_test, W1_copy, b1_copy, W2_copy, b2_copy, W3_copy, b3_copy)
predictions = np.argmax(A3_test, axis=1)
true_labels = np.argmax(y_test, axis=1)
accuracy = np.mean(predictions == true_labels)
accuracy_history.append(accuracy)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f} (Dropout: {use_dropout})")
return loss_history, accuracy_history
print("Training with Dropout...")
loss_history_with_dropout, accuracy_history_with_dropout = train_mlp_with_dropout(
True, W1, b1, W2, b2, W3, b3)
print("Training without Dropout...")
loss_history_no_dropout, accuracy_history_no_dropout = train_mlp_with_dropout(
False, W1, b1, W2, b2, W3, b3)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 8))
ax1.plot(range(1, num_epochs + 1), loss_history_with_dropout, label='Training Loss (With Dropout)')
ax1.plot(range(1, num_epochs + 1), loss_history_no_dropout, label='Training Loss (No Dropout)')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Cross-Entropy Loss')
ax1.set_title('Training Loss Over Epochs')
ax1.legend()
ax1.grid(True)
ax2.plot(range(1, num_epochs + 1), accuracy_history_with_dropout, label='Test Accuracy (With Dropout)')
ax2.plot(range(1, num_epochs + 1), accuracy_history_no_dropout, label='Test Accuracy (No Dropout)')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Test Accuracy Over Epochs')
ax2.legend()
ax2.grid(True)
plt.tight_layout()
plt.show()
These exercises will help you build intuition for implementing L2 regularization and dropout, understanding their impact on loss and gradients, and observing how they improve generalization in deep models like MLPs.
Congratulations on mastering L2 regularization and dropout to combat
overfitting! In this post, we’ve explored the mathematics behind these
techniques, implemented l2_regularization()
and dropout()
in NumPy, and
trained a 3-layer MLP on MNIST, demonstrating improved generalization with
regularization (often achieving slightly better test accuracy compared to no
regularization). This enhances our training capabilities, preparing us for more
robust deep learning models.
In the final chapter of our series (Part 4.5: Training CNNs with Backpropagation), we’ll implement backpropagation for CNNs, including gradients for convolutional and pooling layers, and train a complete CNN on MNIST to achieve high accuracy, wrapping up our deep learning journey with a powerful image classification model.
Until then, experiment with the code and exercises above. If you have questions or want to share your solutions, drop a comment below—I’m excited to hear from you. Let’s keep building our deep learning toolkit together!
Next Up: Part 4.5 – Training CNNs with Backpropagation