Welcome back to our blog series, “Learn Deep Learning with NumPy”! In Part 2.2, we implemented gradient descent to minimize loss by iteratively updating model parameters for linear regression. However, when dealing with large datasets—common in deep learning—processing the entire dataset in one go (batch gradient descent) can be computationally expensive. In Part 2.3, we’re extending gradient descent to mini-batch gradient descent, a more efficient approach that processes smaller subsets of data at a time, making it critical for training neural networks on large datasets.
By the end of this post, you’ll understand the concept of mini-batches, why they
improve training efficiency, and how to modify our gradient_descent()
function
to support mini-batch processing. We’ll also apply this to a binary
classification task using a subset of the MNIST dataset (digits 0 vs. 1) with
logistic regression. This enhancement to our toolkit will ensure scalability for
larger models like MLPs and CNNs in future chapters. Let’s dive into the math
and code for mini-batch optimization!
Gradient descent, as implemented in Part 2.2, processes the entire dataset in each iteration to compute gradients and update parameters. This approach, known as batch gradient descent, provides accurate gradient estimates but can be slow and memory-intensive for large datasets like MNIST (60,000 images) or real-world datasets with millions of samples.
Mini-batch gradient descent addresses this by splitting the dataset into smaller subsets called mini-batches (e.g., 32 or 64 samples at a time). In each iteration, it computes gradients and updates parameters using only one mini-batch, cycling through all mini-batches to complete an epoch (one full pass through the dataset). This offers several advantages:
Mini-batch gradient descent is the standard in deep learning, striking a balance between the accuracy of batch gradient descent and the speed of stochastic gradient descent (which uses just one sample per update). Let’s explore the math behind it and extend our implementation.
In batch gradient descent, the gradient of the loss with respect to a parameter (e.g., weights) is computed over the entire dataset of samples. For Mean Squared Error (MSE), it is:
Where is the input matrix, is the predicted output, and is the true output.
In mini-batch gradient descent, we compute the gradient over a smaller subset (mini-batch) of samples, where is the batch size (e.g., 32). For a mini-batch with inputs (shape ), predictions , and true values , the gradient is:
Similarly, for the bias :
The parameters are updated using the same rule as before:
Where is the learning rate. By iterating over mini-batches, we
approximate the full gradient with less computation per update, cycling through
all mini-batches to cover the entire dataset in one epoch. Now, let’s modify our
gradient_descent()
function to support mini-batches and apply it to a
real-world task.
We’ll update our gradient_descent()
function to process data in mini-batches,
making it more scalable for large datasets. We’ll also introduce an optional
activation_fn
parameter to support logistic regression (for binary
classification) in addition to linear regression. Finally, we’ll test it on a
binary subset of the MNIST dataset (digits 0 vs. 1) using logistic regression
with sigmoid activation and binary cross-entropy loss.
Here’s the enhanced implementation with support for mini-batches and flexibility for activation functions:
import numpy as np
from numpy.typing import NDArray
from typing import Union, Callable, Tuple, List
def gradient_descent(X: NDArray[np.floating], y: NDArray[np.floating], W: NDArray[np.floating],
b: NDArray[np.floating], lr: float, num_epochs: int, batch_size: int,
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float],
activation_fn: Callable[[NDArray[np.floating]], NDArray[np.floating]] = lambda x: x) -> Tuple[NDArray[np.floating], NDArray[np.floating], List[float]]:
"""
Perform mini-batch gradient descent to minimize loss.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
W: Initial weights, shape (n_features, 1)
b: Initial bias, shape (1,) or (1,1)
lr: Learning rate, step size for updates
num_epochs: Number of full passes through the dataset
batch_size: Size of each mini-batch
loss_fn: Loss function to compute error, e.g., mse_loss or binary_cross_entropy
activation_fn: Activation function to apply to linear output (default: identity)
Returns:
Tuple of (updated W, updated b, list of loss values over epochs)
"""
n_samples = X.shape[0]
loss_history = []
for epoch in range(num_epochs):
# Shuffle the dataset to ensure random mini-batches
indices = np.random.permutation(n_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]
# Process mini-batches
for start_idx in range(0, n_samples, batch_size):
end_idx = min(start_idx + batch_size, n_samples)
X_batch = X_shuffled[start_idx:end_idx]
y_batch = y_shuffled[start_idx:end_idx]
batch_size_actual = X_batch.shape[0]
# Forward pass: Compute linear output and apply activation
Z_batch = X_batch @ W + b
y_pred_batch = activation_fn(Z_batch)
# Compute gradients (works for MSE with identity or BCE with sigmoid)
error = y_pred_batch - y_batch
grad_W = (X_batch.T @ error) / batch_size_actual
grad_b = np.mean(error)
# Update parameters
W = W - lr * grad_W
b = b - lr * grad_b
# Compute loss on full dataset at end of epoch
y_pred_full = activation_fn(X @ W + b)
loss = loss_fn(y_pred_full, y)
loss_history.append(loss)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss:.4f}")
return W, b, loss_history
Let’s apply mini-batch gradient descent to train a logistic regression model on a binary subset of the MNIST dataset (digits 0 vs. 1). We’ll use sigmoid activation and binary cross-entropy loss. First, ensure you have the necessary dependencies and data.
Note: If you don’t have sklearn
installed, run pip install scikit-learn
to load MNIST. We’ll load a small subset for simplicity and CPU efficiency.
# Load MNIST data (subset for digits 0 and 1)
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
# Load MNIST dataset (this may take a moment)
X_full, y_full = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
y_full = y_full.astype(int)
X_full = X_full.astype(float)
# Filter for digits 0 and 1 (binary classification)
mask = (y_full == 0) | (y_full == 1)
X = X_full[mask][:1000] # Limit to 1000 samples for faster training on CPU
y = y_full[mask][:1000]
y = y.reshape(-1, 1) # Shape (n_samples, 1)
X = normalize(X) # Normalize pixel values using our function
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize parameters
n_features = X_train.shape[1]
W_init = np.zeros((n_features, 1)) # Initial weights
b_init = np.zeros((1, 1)) # Initial bias
lr = 0.1 # Learning rate
num_epochs = 10 # Number of epochs
batch_size = 32 # Mini-batch size
# Train logistic regression with sigmoid activation and BCE loss
W_final, b_final, losses = gradient_descent(
X_train, y_train, W_init, b_init, lr, num_epochs, batch_size,
loss_fn=binary_cross_entropy, activation_fn=sigmoid
)
# Evaluate on test set
y_pred_test = sigmoid(X_test @ W_final + b_final)
test_loss = binary_cross_entropy(y_pred_test, y_test)
accuracy = np.mean((y_pred_test > 0.5) == y_test)
print("Test Loss:", test_loss)
print("Test Accuracy:", accuracy)
Output (values are approximate and may vary due to randomness):
Epoch 1/10, Loss: 0.4123
Epoch 2/10, Loss: 0.3567
...
Epoch 10/10, Loss: 0.1892
Test Loss: 0.1925
Test Accuracy: 0.935
In this example, we train a logistic regression model on a binary subset of MNIST (digits 0 vs. 1) using mini-batch gradient descent. The model processes 32 samples at a time, shuffling the data each epoch to ensure randomness. Over 10 epochs, the loss decreases, and we achieve a reasonable accuracy on the test set (e.g., ~93%), demonstrating the effectiveness of mini-batch training for scalability. Training times are manageable on a CPU due to the small subset and mini-batch approach (~a few seconds per epoch).
Let’s update our neural_network.py
file to reflect the enhanced
gradient_descent()
function with mini-batch support and flexibility for
activation functions.
# neural_network.py
import numpy as np
from numpy.typing import NDArray
from typing import Union, Callable, Tuple, List
def normalize(X: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Normalize an array to have mean=0 and std=1.
Args:
X: NumPy array of any shape with floating-point values
Returns:
Normalized array of the same shape with floating-point values
"""
mean = np.mean(X)
std = np.std(X)
if std == 0: # Avoid division by zero
return X - mean
return (X - mean) / std
def matrix_multiply(X: NDArray[np.floating], W: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Perform matrix multiplication between two arrays.
Args:
X: First input array/matrix of shape (m, n) with floating-point values
W: Second input array/matrix of shape (n, p) with floating-point values
Returns:
Result of matrix multiplication, shape (m, p) with floating-point values
"""
return np.matmul(X, W)
def sigmoid(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the sigmoid activation function element-wise.
Args:
Z: Input array of any shape with floating-point values
Returns:
Array of the same shape with sigmoid applied element-wise, values in [0, 1]
"""
return 1 / (1 + np.exp(-Z))
def mse_loss(y_pred: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute the Mean Squared Error loss between predicted and true values.
Args:
y_pred: Predicted values, array of shape (n,) or (n,1) with floating-point values
y: True values, array of shape (n,) or (n,1) with floating-point values
Returns:
Mean squared error as a single float
"""
return np.mean((y_pred - y) ** 2)
def binary_cross_entropy(A: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute the Binary Cross-Entropy loss between predicted probabilities and true labels.
Args:
A: Predicted probabilities (after sigmoid), array of shape (n,) or (n,1), values in [0, 1]
y: True binary labels, array of shape (n,) or (n,1), values in {0, 1}
Returns:
Binary cross-entropy loss as a single float
"""
epsilon = 1e-15
return -np.mean(y * np.log(A + epsilon) + (1 - y) * np.log(1 - A + epsilon))
def gradient_descent(X: NDArray[np.floating], y: NDArray[np.floating], W: NDArray[np.floating],
b: NDArray[np.floating], lr: float, num_epochs: int, batch_size: int,
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float],
activation_fn: Callable[[NDArray[np.floating]], NDArray[np.floating]] = lambda x: x) -> Tuple[NDArray[np.floating], NDArray[np.floating], List[float]]:
"""
Perform mini-batch gradient descent to minimize loss.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
W: Initial weights, shape (n_features, 1)
b: Initial bias, shape (1,) or (1,1)
lr: Learning rate, step size for updates
num_epochs: Number of full passes through the dataset
batch_size: Size of each mini-batch
loss_fn: Loss function to compute error, e.g., mse_loss or binary_cross_entropy
activation_fn: Activation function to apply to linear output (default: identity)
Returns:
Tuple of (updated W, updated b, list of loss values over epochs)
"""
n_samples = X.shape[0]
loss_history = []
for epoch in range(num_epochs):
indices = np.random.permutation(n_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]
for start_idx in range(0, n_samples, batch_size):
end_idx = min(start_idx + batch_size, n_samples)
X_batch = X_shuffled[start_idx:end_idx]
y_batch = y_shuffled[start_idx:end_idx]
batch_size_actual = X_batch.shape[0]
Z_batch = X_batch @ W + b
y_pred_batch = activation_fn(Z_batch)
error = y_pred_batch - y_batch
grad_W = (X_batch.T @ error) / batch_size_actual
grad_b = np.mean(error)
W = W - lr * grad_W
b = b - lr * grad_b
y_pred_full = activation_fn(X @ W + b)
loss = loss_fn(y_pred_full, y)
loss_history.append(loss)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss:.4f}")
return W, b, loss_history
You can now import this updated function using
from neural_network import gradient_descent
. The addition of mini-batch
support and flexibility with activation functions makes it scalable and
adaptable for training various models, from linear regression to logistic
regression, and eventually neural networks.
To reinforce your understanding of mini-batch gradient descent, try these Python-focused coding exercises. They’ll prepare you for training larger models on real datasets in future chapters. Run the code and compare outputs to verify your solutions.
Mini-Batch Gradient Descent on Synthetic Linear Regression Data
Create synthetic data with
X = np.array([[1.0], [2.0], [3.0], [4.0], [5.0], [6.0], [7.0], [8.0]])
and
y = np.array([[2.0], [4.0], [6.0], [8.0], [10.0], [12.0], [14.0], [16.0]])
(true relationship ). Initialize W = np.array([[0.0]])
and
b = np.array([[0.0]])
. Run gradient_descent()
with lr = 0.01
,
num_epochs = 5
, and batch_size = 2
. Print the initial and final values of
W
, b
, and the loss history.
# Your code here
X = np.array([[1.0], [2.0], [3.0], [4.0], [5.0], [6.0], [7.0], [8.0]])
y = np.array([[2.0], [4.0], [6.0], [8.0], [10.0], [12.0], [14.0], [16.0]])
W_init = np.array([[0.0]])
b_init = np.array([[0.0]])
lr = 0.01
num_epochs = 5
batch_size = 2
W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_epochs, batch_size, mse_loss)
print("Initial weight W:", W_init)
print("Initial bias b:", b_init)
print("Final weight W:", W_final)
print("Final bias b:", b_final)
print("Loss history:", losses)
Effect of Batch Size on Convergence
Using the same data as in Exercise 1, run gradient_descent()
with
lr = 0.01
, num_epochs = 5
, but with a larger batch_size = 4
. Compare the
final W
, b
, and loss history to Exercise 1. Observe how a larger batch
size might affect convergence speed or stability.
# Your code here
X = np.array([[1.0], [2.0], [3.0], [4.0], [5.0], [6.0], [7.0], [8.0]])
y = np.array([[2.0], [4.0], [6.0], [8.0], [10.0], [12.0], [14.0], [16.0]])
W_init = np.array([[0.0]])
b_init = np.array([[0.0]])
lr = 0.01
num_epochs = 5
batch_size = 4
W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_epochs, batch_size, mse_loss)
print("Initial weight W:", W_init)
print("Initial bias b:", b_init)
print("Final weight W:", W_final)
print("Final bias b:", b_final)
print("Loss history:", losses)
Mini-Batch Logistic Regression on Synthetic Data
Create synthetic binary classification data with
X = np.array([[0.5], [1.5], [1.0], [2.0], [3.0], [2.5]])
and
y = np.array([[0.0], [0.0], [0.0], [1.0], [1.0], [1.0]])
(approximating a
decision boundary). Initialize W = np.array([[0.0]])
and
b = np.array([[0.0]])
. Run gradient_descent()
with lr = 0.01
,
num_epochs = 10
, batch_size = 2
, using binary_cross_entropy
as
loss_fn
and sigmoid
as activation_fn
. Print the initial and final W
,
b
, and loss history.
# Your code here
X = np.array([[0.5], [1.5], [1.0], [2.0], [3.0], [2.5]])
y = np.array([[0.0], [0.0], [0.0], [1.0], [1.0], [1.0]])
W_init = np.array([[0.0]])
b_init = np.array([[0.0]])
lr = 0.01
num_epochs = 10
batch_size = 2
W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_epochs, batch_size, binary_cross_entropy, sigmoid)
print("Initial weight W:", W_init)
print("Initial bias b:", b_init)
print("Final weight W:", W_final)
print("Final bias b:", b_final)
print("Loss history:", losses)
Effect of Shuffling in Mini-Batches
Modify the gradient_descent()
function temporarily to remove shuffling
(comment out the np.random.permutation
part and use X
and y
directly).
Run it on the data from Exercise 1 with lr = 0.01
, num_epochs = 5
, and
batch_size = 2
. Compare the loss history to Exercise 1. Observe how the
lack of shuffling might affect convergence due to consistent batch ordering.
# Your code here (modify gradient_descent temporarily)
def gradient_descent_no_shuffle(X, y, W, b, lr, num_epochs, batch_size, loss_fn, activation_fn=lambda x: x):
n_samples = X.shape[0]
loss_history = []
for epoch in range(num_epochs):
for start_idx in range(0, n_samples, batch_size):
end_idx = min(start_idx + batch_size, n_samples)
X_batch = X[start_idx:end_idx]
y_batch = y[start_idx:end_idx]
batch_size_actual = X_batch.shape[0]
Z_batch = X_batch @ W + b
y_pred_batch = activation_fn(Z_batch)
error = y_pred_batch - y_batch
grad_W = (X_batch.T @ error) / batch_size_actual
grad_b = np.mean(error)
W = W - lr * grad_W
b = b - lr * grad_b
y_pred_full = activation_fn(X @ W + b)
loss = loss_fn(y_pred_full, y)
loss_history.append(loss)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss:.4f}")
return W, b, loss_history
X = np.array([[1.0], [2.0], [3.0], [4.0], [5.0], [6.0], [7.0], [8.0]])
y = np.array([[2.0], [4.0], [6.0], [8.0], [10.0], [12.0], [14.0], [16.0]])
W_init = np.array([[0.0]])
b_init = np.array([[0.0]])
lr = 0.01
num_epochs = 5
batch_size = 2
W_final, b_final, losses = gradient_descent_no_shuffle(X, y, W_init, b_init, lr, num_epochs, batch_size, mse_loss)
print("Initial weight W:", W_init)
print("Initial bias b:", b_init)
print("Final weight W:", W_final)
print("Final bias b:", b_final)
print("Loss history:", losses)
These exercises will help you build intuition for how mini-batch gradient descent works, the impact of batch size, and the importance of shuffling for effective training.
Congratulations on mastering mini-batch gradient descent, a crucial optimization
technique for scalable deep learning! In this post, we’ve extended our
gradient_descent()
function to process data in mini-batches, making it
efficient for large datasets, and added flexibility with activation functions to
support logistic regression. We’ve also applied it to a binary classification
task on a subset of MNIST, demonstrating real-world applicability with
reasonable training times on a CPU.
In the next chapter (Part 2.4: Debugging with Numerical Gradients), we’ll learn how to verify our gradient computations using numerical methods, ensuring the correctness of our implementations—a key skill for building reliable deep learning models.
Until then, experiment with the code and exercises above. If you have questions or want to share your solutions, drop a comment below—I’m excited to hear from you. Let’s keep building our deep learning toolkit together!
Next Up: Part 2.4 – Debugging with Numerical Gradients