Welcome back to our blog series, “Learn Deep Learning with NumPy”! In Part
2.3, we enhanced our gradient_descent()
function to support mini-batch
processing, making it scalable for larger datasets like MNIST. As we build more
complex models, ensuring the correctness of our gradient computations becomes
critical. In Part 2.4, we’re focusing on a key debugging skill for deep
learning: verifying gradients using numerical methods. This technique helps
confirm that our analytical gradients—computed via formulas—are implemented
correctly, preventing subtle bugs from derailing training.
By the end of this post, you’ll understand how to approximate gradients
numerically using finite differences, implement a reusable
numerical_gradient()
function in NumPy, and apply it to verify the analytical
gradients for linear regression. This debugging tool will be invaluable as we
move toward multi-layer neural networks in Module 3. Let’s dive into the math
and code for ensuring robust gradient implementations!
In deep learning, training relies on gradient descent, which updates model parameters (weights and biases) based on the gradient of the loss function with respect to those parameters. We typically compute these gradients analytically using formulas derived from calculus (e.g., for MSE loss in linear regression, ). However, deriving and implementing these formulas can be error-prone, especially in complex models with multiple layers and activation functions.
If the analytical gradient is incorrect due to a bug or misderived formula, gradient descent will update parameters in the wrong direction, leading to poor model performance or divergence (increasing loss, as some of you might have seen with an improperly tuned learning rate). Numerical gradient checking offers a way to verify analytical gradients by approximating them using a simple, brute-force method based on the definition of a derivative. This acts as a sanity check, helping us debug implementations before scaling up to larger models.
Numerical gradient checking is particularly useful when:
While numerical gradients are too slow to use during actual training (they require multiple loss evaluations per parameter), they are a powerful tool for validation during development. Let’s explore the math behind this approach.
The gradient of a function with respect to a parameter represents the rate of change of at . Mathematically, the derivative (or gradient for multi-dimensional ) at a point is defined as the limit of the difference quotient:
In practice, we can’t compute the exact limit, so we approximate it with a small value of using the finite difference method. A more accurate version, called the central difference approximation, evaluates the function on both sides of to reduce approximation error:
Where:
This method doesn’t require knowing the analytical formula for the gradient—it simply evaluates the loss at nearby points. By comparing the numerical gradient to our analytical gradient, we can check if they are close (within a small tolerance, e.g., ). If they differ significantly, it indicates a bug in the analytical gradient implementation.
Now, let’s implement this numerical gradient checking in NumPy and use it to verify the gradients for linear regression.
We’ll create a reusable numerical_gradient()
function to approximate gradients
using the central difference method. We’ll apply it to verify the analytical
gradients computed in our gradient_descent()
function for linear regression
with MSE loss. This debugging tool will be added to our growing library.
Here’s the implementation of numerical_gradient()
for a general loss function
and parameters. We’ll use a dictionary for parameters to make it flexible for
models with multiple parameters (e.g., W
and b
).
import numpy as np
from numpy.typing import NDArray
from typing import Union, Callable, Dict
def numerical_gradient(
X: NDArray[np.floating],
y: NDArray[np.floating],
params: Dict[str, NDArray[np.floating]],
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], np.floating],
forward_fn: Callable[
[NDArray[np.floating], Dict[str, NDArray[np.floating]]], NDArray[np.floating]
],
h: float = 1e-4,
) -> Dict[str, NDArray[np.floating]]:
"""
Compute numerical gradients for parameters using central difference approximation.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
params: Dictionary of parameters (e.g., {'W': ..., 'b': ...})
loss_fn: Loss function to compute error, e.g., mse_loss
forward_fn: Function to compute predictions from X and params
h: Step size for finite difference approximation (default: 1e-4)
Returns:
Dictionary of numerical gradients for each parameter
"""
num_grads = {}
for param_name, param_value in params.items():
num_grad = np.zeros_like(param_value)
it = np.nditer(param_value, flags=["multi_index"])
while not it.finished:
idx = it.multi_index
original_value = param_value[idx]
# Compute loss at W + h
param_value[idx] = original_value + h
y_pred_plus = forward_fn(X, params)
loss_plus = loss_fn(y_pred_plus, y)
# Compute loss at W - h
param_value[idx] = original_value - h
y_pred_minus = forward_fn(X, params)
loss_minus = loss_fn(y_pred_minus, y)
# Central difference approximation
num_grad[idx] = (loss_plus - loss_minus) / (2 * h)
# Restore original value
param_value[idx] = original_value
it.iternext()
num_grads[param_name] = num_grad
return num_grads
Let’s use numerical_gradient()
to verify the analytical gradients for linear
regression with MSE loss. We’ll compare the numerical gradients to the
analytical gradients computed as grad_W = (X.T @ (y_pred - y)) / n
and
grad_b = np.mean(y_pred - y)
.
# Define forward function for linear regression
def linear_forward(X: NDArray[np.floating], params: Dict[str, NDArray[np.floating]]) -> NDArray[np.floating]:
return X @ params['W'] + params['b']
# Example: Synthetic data for linear regression (y = 2x + 1)
X = np.array([[1.0], [2.0], [3.0], [4.0]]) # Input (4 samples, 1 feature)
y = np.array([[3.0], [5.0], [7.0], [9.0]]) # True values (y = 2x + 1)
n = X.shape[0]
# Initialize parameters
params = {
'W': np.array([[1.0]]), # Initial weight (not the true value)
'b': np.array([[0.5]]) # Initial bias (not the true value)
}
# Compute analytical gradients
y_pred = linear_forward(X, params)
error = y_pred - y
analytical_grad_W = (2.0 / n) * (
X.T @ error
) # Include factor of 2 from derivative of (y_pred - y)^2
analytical_grad_b = (2.0 / n) * np.sum(
error
) # Include factor of 2 (equivalent to np.mean(error) * 2)
# Compute numerical gradients
numerical_grads = numerical_gradient(X, y, params, mse_loss, linear_forward, h=1e-4)
# Compare analytical and numerical gradients
print("Analytical Gradient for W:", analytical_grad_W)
print("Numerical Gradient for W:", numerical_grads['W'])
print("Difference for W:", np.abs(analytical_grad_W - numerical_grads['W']))
print("Analytical Gradient for b:", analytical_grad_b)
print("Numerical Gradient for b:", numerical_grads['b'])
print("Difference for b:", np.abs(analytical_grad_b - numerical_grads['b']))
Output (values are approximate):
Analytical Gradient for W: [[-5.]]
Numerical Gradient for W: [[-5.00000001]]
Difference for W: [[1e-08]]
Analytical Gradient for b: -1.5
Numerical Gradient for b: [[-1.50000001]]
Difference for b: [[1e-08]]
In this example, the numerical gradients computed using the central difference approximation are extremely close to the analytical gradients (differences on the order of ), confirming that our analytical gradient implementation for linear regression with MSE loss is correct. If the differences were large (e.g., greater than ), it would indicate a bug in the analytical gradient formula or implementation.
Let’s update our neural_network.py
file to include the numerical_gradient()
function alongside our previous implementations. This debugging tool will be
essential for verifying gradients as we build more complex models.
# neural_network.py
import numpy as np
from numpy.typing import NDArray
from typing import Union, Callable, Tuple, List, Dict
def normalize(X: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Normalize the input array X by subtracting the mean and dividing by the standard deviation.
Parameters:
X (NDArray[np.floating]): Input array to normalize. Should be a numerical array
(float or compatible type).
Returns:
NDArray[np.floating]: Normalized array with mean approximately 0 and standard
deviation approximately 1 along each axis.
"""
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
normalized_X = X - mean # Start with mean subtraction
mask = std != 0 # Create a mask for non-zero std
if np.any(mask):
normalized_X[:, mask] = normalized_X[:, mask] / std[mask]
return normalized_X
def matrix_multiply(X: NDArray[np.floating], W: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Perform matrix multiplication between two arrays.
Args:
X: First input array/matrix of shape (m, n) with floating-point values
W: Second input array/matrix of shape (n, p) with floating-point values
Returns:
Result of matrix multiplication, shape (m, p) with floating-point values
"""
return np.matmul(X, W)
def sigmoid(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the sigmoid activation function element-wise.
Args:
Z: Input array of any shape with floating-point values
Returns:
Array of the same shape with sigmoid applied element-wise, values in [0, 1]
"""
return 1 / (1 + np.exp(-Z))
def mse_loss(y_pred: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute the Mean Squared Error loss between predicted and true values.
Args:
y_pred: Predicted values, array of shape (n,) or (n,1) with floating-point values
y: True values, array of shape (n,) or (n,1) with floating-point values
Returns:
Mean squared error as a single float
"""
return np.mean((y_pred - y) ** 2)
def binary_cross_entropy(A: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute the Binary Cross-Entropy loss between predicted probabilities and true labels.
Args:
A: Predicted probabilities (after sigmoid), array of shape (n,) or (n,1), values in [0, 1]
y: True binary labels, array of shape (n,) or (n,1), values in {0, 1}
Returns:
Binary cross-entropy loss as a single float
"""
epsilon = 1e-15
return -np.mean(y * np.log(A + epsilon) + (1 - y) * np.log(1 - A + epsilon))
def gradient_descent(X: NDArray[np.floating], y: NDArray[np.floating], W: NDArray[np.floating],
b: NDArray[np.floating], lr: float, num_epochs: int, batch_size: int,
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float],
activation_fn: Callable[[NDArray[np.floating]], NDArray[np.floating]] = lambda x: x) -> Tuple[NDArray[np.floating], NDArray[np.floating], List[float]]:
"""
Perform mini-batch gradient descent to minimize loss.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
W: Initial weights, shape (n_features, 1)
b: Initial bias, shape (1,) or (1,1)
lr: Learning rate, step size for updates
num_epochs: Number of full passes through the dataset
batch_size: Size of each mini-batch
loss_fn: Loss function to compute error, e.g., mse_loss or binary_cross_entropy
activation_fn: Activation function to apply to linear output (default: identity)
Returns:
Tuple of (updated W, updated b, list of loss values over epochs)
"""
n_samples = X.shape[0]
loss_history = []
for epoch in range(num_epochs):
indices = np.random.permutation(n_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]
for start_idx in range(0, n_samples, batch_size):
end_idx = min(start_idx + batch_size, n_samples)
X_batch = X_shuffled[start_idx:end_idx]
y_batch = y_shuffled[start_idx:end_idx]
batch_size_actual = X_batch.shape[0]
Z_batch = X_batch @ W + b
y_pred_batch = activation_fn(Z_batch)
error = y_pred_batch - y_batch
grad_W = (X_batch.T @ error) / batch_size_actual
grad_b = np.mean(error)
W = W - lr * grad_W
b = b - lr * grad_b
y_pred_full = activation_fn(X @ W + b)
loss = loss_fn(y_pred_full, y)
loss_history.append(loss)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss:.4f}")
return W, b, loss_history
def numerical_gradient(X: NDArray[np.floating], y: NDArray[np.floating], params: Dict[str, NDArray[np.floating]],
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float],
forward_fn: Callable[[NDArray[np.floating], Dict[str, NDArray[np.floating]]], NDArray[np.floating]],
h: float = 1e-4) -> Dict[str, NDArray[np.floating]]:
"""
Compute numerical gradients for parameters using central difference approximation.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
params: Dictionary of parameters (e.g., {'W': ..., 'b': ...})
loss_fn: Loss function to compute error, e.g., mse_loss
forward_fn: Function to compute predictions from X and params
h: Step size for finite difference approximation (default: 1e-4)
Returns:
Dictionary of numerical gradients for each parameter
"""
num_grads = {}
for param_name, param_value in params.items():
num_grad = np.zeros_like(param_value)
it = np.nditer(param_value, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
original_value = param_value[idx]
param_value[idx] = original_value + h
y_pred_plus = forward_fn(X, params)
loss_plus = loss_fn(y_pred_plus, y)
param_value[idx] = original_value - h
y_pred_minus = forward_fn(X, params)
loss_minus = loss_fn(y_pred_minus, y)
num_grad[idx] = (loss_plus - loss_minus) / (2 * h)
param_value[idx] = original_value
it.iternext()
num_grads[param_name] = num_grad
return num_grads
You can now import this debugging tool using
from neural_network import numerical_gradient
. It will be particularly useful
as we move into more complex models with intricate gradient computations in
upcoming modules.
To reinforce your understanding of numerical gradient checking, try these Python-focused coding exercises. They’ll prepare you for debugging gradient implementations in more complex models. Run the code and compare outputs to verify your solutions.
Verify Gradients for Linear Regression with Small Data
Use the synthetic data X = np.array([[1.0], [2.0]])
and
y = np.array([[3.0], [5.0]])
(true relationship ). Initialize
parameters as params = {'W': np.array([[1.5]]), 'b': np.array([[0.5]])}
.
Compute analytical gradients using grad_W = (X.T @ (y_pred - y)) / n
and
grad_b = np.mean(y_pred - y)
, and compare them to numerical gradients from
numerical_gradient()
with h = 1e-4
. Print the differences.
# Your code here
def linear_forward(X, params):
return X @ params['W'] + params['b']
X = np.array([[1.0], [2.0]])
y = np.array([[3.0], [5.0]])
n = X.shape[0]
params = {'W': np.array([[1.5]]), 'b': np.array([[0.5]])}
y_pred = linear_forward(X, params)
error = y_pred - y
analytical_grad_W = 2 * (X.T @ error) / n
analytical_grad_b = 2 * np.mean(error)
numerical_grads = numerical_gradient(X, y, params, mse_loss, linear_forward, h=1e-4)
print("Analytical Gradient for W:", analytical_grad_W)
print("Numerical Gradient for W:", numerical_grads['W'])
print("Difference for W:", np.abs(analytical_grad_W - numerical_grads['W']))
print("Analytical Gradient for b:", analytical_grad_b)
print("Numerical Gradient for b:", numerical_grads['b'])
print("Difference for b:", np.abs(analytical_grad_b - numerical_grads['b']))
Effect of Step Size h on Numerical Gradients
Using the same data and parameters as in Exercise 1, compute numerical
gradients with a larger step size h = 1e-2
and compare the differences to
the analytical gradients. Observe how a larger h
affects the accuracy of
the approximation compared to h = 1e-4
.
# Your code here
def linear_forward(X, params):
return X @ params['W'] + params['b']
X = np.array([[1.0], [2.0]])
y = np.array([[3.0], [5.0]])
n = X.shape[0]
params = {'W': np.array([[1.5]]), 'b': np.array([[0.5]])}
y_pred = linear_forward(X, params)
error = y_pred - y
analytical_grad_W = 2 * (X.T @ error) / n
analytical_grad_b = 2 * np.mean(error)
numerical_grads = numerical_gradient(X, y, params, mse_loss, linear_forward, h=1e-2)
print("Analytical Gradient for W:", analytical_grad_W)
print("Numerical Gradient for W:", numerical_grads['W'])
print("Difference for W:", np.abs(analytical_grad_W - numerical_grads['W']))
print("Analytical Gradient for b:", analytical_grad_b)
print("Numerical Gradient for b:", numerical_grads['b'])
print("Difference for b:", np.abs(analytical_grad_b - numerical_grads['b']))
Verify Gradients for Logistic Regression
Use synthetic binary classification data
X = np.array([[0.5], [1.5], [2.5]])
and
y = np.array([[0.0], [0.0], [1.0]])
. Define a forward function with sigmoid
activation (y_pred = sigmoid(X @ W + b)
), initialize
params = {'W': np.array([[0.0]]), 'b': np.array([[0.0]])}
, and compute
analytical gradients for BCE loss (grad_W = (X.T @ (y_pred - y)) / n
).
Compare to numerical gradients using binary_cross_entropy
as loss_fn
.
# Your code here
def logistic_forward(X, params):
return sigmoid(X @ params['W'] + params['b'])
X = np.array([[0.5], [1.5], [2.5]])
y = np.array([[0.0], [0.0], [1.0]])
n = X.shape[0]
params = {'W': np.array([[0.0]]), 'b': np.array([[0.0]])}
y_pred = logistic_forward(X, params)
error = y_pred - y
analytical_grad_W = (X.T @ error) / n
analytical_grad_b = np.mean(error)
numerical_grads = numerical_gradient(X, y, params, binary_cross_entropy, logistic_forward, h=1e-4)
print("Analytical Gradient for W:", analytical_grad_W)
print("Numerical Gradient for W:", numerical_grads['W'])
print("Difference for W:", np.abs(analytical_grad_W - numerical_grads['W']))
print("Analytical Gradient for b:", analytical_grad_b)
print("Numerical Gradient for b:", numerical_grads['b'])
print("Difference for b:", np.abs(analytical_grad_b - numerical_grads['b']))
Debugging a Buggy Gradient Implementation
Modify the analytical gradient for linear regression to be incorrect (e.g.,
grad_W = (X @ error) / n
instead of X.T @ error
). Use the data from
Exercise 1 and compare to numerical gradients. Observe how the large
difference indicates a bug in the analytical gradient.
# Your code here
def linear_forward(X, params):
return X @ params['W'] + params['b']
X = np.array([[1.0], [2.0]])
y = np.array([[3.0], [5.0]])
n = X.shape[0]
params = {'W': np.array([[1.5]]), 'b': np.array([[0.5]])}
y_pred = linear_forward(X, params)
error = y_pred - y
analytical_grad_W = (X @ error) / n # Buggy: should be X.T @ error
analytical_grad_b = np.mean(error) # Correct for b
numerical_grads = numerical_gradient(X, y, params, mse_loss, linear_forward, h=1e-4)
print("Analytical Gradient for W (buggy):", analytical_grad_W)
print("Numerical Gradient for W:", numerical_grads['W'])
print("Difference for W (large due to bug):", np.abs(analytical_grad_W - numerical_grads['W']))
print("Analytical Gradient for b:", analytical_grad_b)
print("Numerical Gradient for b:", numerical_grads['b'])
print("Difference for b:", np.abs(analytical_grad_b - numerical_grads['b']))
These exercises will help you build intuition for numerical gradient checking,
understand the impact of approximation parameters like h
, and practice
debugging incorrect gradient implementations.
Congratulations on mastering numerical gradient checking, a vital debugging
skill for deep learning! In this post, we’ve explored the finite difference
approximation to compute gradients numerically, implemented a reusable
numerical_gradient()
function, and verified analytical gradients for linear
regression. This tool ensures the correctness of our gradient implementations,
building confidence as we tackle more complex models.
With Module 2 complete, we’ve laid a strong foundation in optimization, from loss functions and gradient descent to mini-batch processing and debugging. In Module 3, starting with Part 3.1 (Single-Layer Perceptrons), we’ll begin building neural networks, applying these optimization techniques to train our first models.
Until then, experiment with the code and exercises above. If you have questions or want to share your solutions, drop a comment below—I’m excited to hear from you. Let’s keep building our deep learning toolkit together!
Next Up: Part 3.1 – Single-Layer Perceptrons