Welcome back to our blog series, “Learn Deep Learning with NumPy”! In Part 2.1, we introduced loss functions like Mean Squared Error (MSE) and Binary Cross-Entropy (BCE), which quantify how far a model’s predictions are from the true values. Now, in Part 2.2, we’re taking the next step in Module 2 by exploring gradient descent—a fundamental optimization algorithm that minimizes loss by iteratively updating a model’s parameters.
By the end of this post, you’ll understand the concept of gradients and learning
rates, learn how gradient descent works to optimize model parameters, and
implement a reusable gradient_descent()
function in NumPy for training a
simple linear regression model. This function will be a cornerstone of our deep
learning toolkit, reused across various models in future chapters. Let’s dive
into the math and code behind optimization!
Training a neural network (or any machine learning model) involves finding the best set of parameters (weights and biases) that minimize the loss function. Think of the loss as a landscape with hills and valleys—our goal is to find the lowest point (the minimum loss). Gradient descent is the algorithm that guides us downhill toward this minimum by iteratively adjusting the parameters based on the slope of the loss landscape.
Here’s the intuition:
Gradient descent is the backbone of training neural networks, used to update weights and biases after each forward pass. In this post, we’ll apply it to a simple linear regression problem, but the concept extends to complex deep learning models.
Gradient descent updates a model’s parameters by taking small steps in the direction that reduces the loss. For a parameter (e.g., a weight matrix), the update rule is:
Where:
This process repeats iteratively—compute predictions, calculate loss, compute gradients, update parameters—until the loss converges to a minimum or we reach a set number of iterations.
For a linear regression model with predictions , where is the input matrix, is the weight matrix, and is the bias, we use MSE as the loss function:
The gradient of MSE with respect to tells us how to adjust to reduce the loss. For a dataset of samples, it is:
Where:
Similarly, the gradient with respect to the bias (if present) is the average error:
These gradients are used in the update rules and . Let’s now implement gradient descent in NumPy to train a linear regression model.
To understand why gradient descent works, it’s essential to derive the gradients of the Mean Squared Error (MSE) loss function with respect to the model parameters—weights () and bias (). This derivation shows how changes in and affect the loss, guiding us on how to adjust them to minimize error. Let’s break this down step by step using calculus.
For a linear regression model, the predicted output is given by , where is the input matrix of shape (with samples and features), is the weight matrix of shape , and is the bias term. The true values are denoted by , a vector of shape . The MSE loss is defined as the average squared difference between predictions and true values:
Substituting , we can write:
Our goal is to compute the gradients and , which are the partial derivatives of with respect to and , respectively. These gradients indicate how much the loss changes with small changes in and , allowing us to update them in the direction that reduces .
To find , we compute the partial derivative of with respect to each element of . Let’s consider the loss for a single sample first and then generalize. For the -th sample, the squared error is:
Since is the -th element of the prediction vector, and is a linear combination, we can express it as , where is the -th feature of the -th sample and is the -th weight. The partial derivative of with respect to (a specific weight) is:
Since (as only the term involving depends on it, and is independent of ), this simplifies to:
Now, for the total loss , the partial derivative with respect to is:
This can be written in vector form for all weights . Notice that for each sample , the term is the error for that sample, and multiplying by across all features corresponds to the -th row of . Summing over all samples , this is equivalent to a matrix multiplication. Thus, the gradient for the entire weight vector is:
In practice, the factor of 2 is often absorbed into the learning rate or omitted for simplicity (as it scales the step size uniformly), so we commonly use:
This is the formula implemented in our gradient_descent()
function. It shows
that the gradient for is proportional to the correlation between the input
features () and the prediction errors (), guiding us to
adjust to reduce those errors.
Next, we derive the gradient for the bias term . The bias is added to every prediction, so for the -th sample, the partial derivative of with respect to is:
Since (the prediction increases by 1 for a unit increase in ), this simplifies to:
For the total loss , the partial derivative with respect to is:
Again, the factor of 2 is often omitted for simplicity, so we use:
This tells us that the gradient for is simply the average error across all samples. If the predictions are systematically too high (positive average error), should decrease, and if too low (negative average error), should increase.
The gradients and point in the direction of the steepest increase in loss. By subtracting them (scaled by the learning rate ) in the gradient descent update rule, we move in the direction of steepest decrease, iteratively reducing the loss. This calculus-based approach ensures that each update to and brings our predictions closer to the true values, minimizing MSE.
Understanding this derivation provides insight into why gradient descent is effective and prepares us to derive gradients for more complex loss functions and models (like neural networks) in future posts. For now, these formulae for linear regression with MSE are the foundation of our optimization process.
We’ll create a reusable gradient_descent()
function to train a linear
regression model by minimizing MSE loss. This function will update parameters
iteratively and be adaptable for more complex models later. As always, we’ll
include type hints for clarity.
Here’s the implementation, along with an example using synthetic data where the true relationship is :
import numpy as np
from numpy.typing import NDArray
from typing import Union, Callable
def gradient_descent(X: NDArray[np.floating], y: NDArray[np.floating], W: NDArray[np.floating],
b: NDArray[np.floating], lr: float, num_iterations: int,
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float]) -> tuple[NDArray[np.floating], NDArray[np.floating], list[float]]:
"""
Perform gradient descent to minimize loss for linear regression.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
W: Initial weights, shape (n_features, 1)
b: Initial bias, shape (1,) or (1,1)
lr: Learning rate, step size for updates
num_iterations: Number of iterations to run gradient descent
loss_fn: Loss function to compute error, e.g., mse_loss
Returns:
Tuple of (updated W, updated b, list of loss values over iterations)
"""
n = X.shape[0]
loss_history = []
for _ in range(num_iterations):
# Forward pass: Compute predictions
y_pred = X @ W + b
# Compute loss
loss = loss_fn(y_pred, y)
loss_history.append(loss)
# Compute gradients for W and b (for MSE loss)
grad_W = (X.T @ (y_pred - y)) / n
grad_b = np.mean(y_pred - y)
# Update parameters
W = W - lr * grad_W
b = b - lr * grad_b
return W, b, loss_history
# Example: Synthetic data (y = 2x + 1)
X = np.array([[1.0], [2.0], [3.0], [4.0], [5.0]]) # Input (5 samples, 1 feature)
y = np.array([[3.0], [5.0], [7.0], [9.0], [11.0]]) # True values (y = 2x + 1)
W_init = np.array([[0.0]]) # Initial weight (start far from true value 2.0)
b_init = np.array([[0.0]]) # Initial bias (start far from true value 1.0)
lr = 0.1 # Learning rate
num_iterations = 100 # Number of iterations
# Use mse_loss from our library
def mse_loss(y_pred: NDArray[np.floating], y: NDArray[np.floating]) -> float:
return np.mean((y_pred - y) ** 2)
# Run gradient descent
W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_iterations, mse_loss)
print("Initial weight W:", W_init)
print("Initial bias b:", b_init)
print("Final weight W after training:", W_final)
print("Final bias b after training:", b_final)
print("Final loss:", losses[-1])
print("First few losses:", losses[:5])
Output (values are approximate):
Initial weight W: [[0.]]
Initial bias b: [[0.]]
Final weight W after training: [[1.9999]]
Final bias b after training: [[1.0001]]
Final loss: 1.234e-08
First few losses: [27.2, 17.408, 11.14112, 7.1303168, 4.563402752]
In this example, we start with initial guesses for weight () and bias
(), far from the true values (, ). Over 100 iterations,
gradient descent updates and using the MSE gradient, converging close to
the true values, and the loss drops significantly. The loss_history
list
tracks how the loss decreases over iterations, showing the optimization
progress.
Let’s update our neural_network.py
file to include the gradient_descent()
function alongside our previous implementations. This keeps our toolkit
organized and reusable for training models.
# neural_network.py
import numpy as np
from numpy.typing import NDArray
from typing import Union, Callable, Tuple, List
def normalize(X: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Normalize an array to have mean=0 and std=1.
Args:
X: NumPy array of any shape with floating-point values
Returns:
Normalized array of the same shape with floating-point values
"""
mean = np.mean(X)
std = np.std(X)
if std == 0: # Avoid division by zero
return X - mean
return (X - mean) / std
def matrix_multiply(X: NDArray[np.floating], W: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Perform matrix multiplication between two arrays.
Args:
X: First input array/matrix of shape (m, n) with floating-point values
W: Second input array/matrix of shape (n, p) with floating-point values
Returns:
Result of matrix multiplication, shape (m, p) with floating-point values
"""
return np.matmul(X, W)
def sigmoid(Z: NDArray[np.floating]) -> NDArray[np.floating]:
"""
Compute the sigmoid activation function element-wise.
Args:
Z: Input array of any shape with floating-point values
Returns:
Array of the same shape with sigmoid applied element-wise, values in [0, 1]
"""
return 1 / (1 + np.exp(-Z))
def mse_loss(y_pred: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute the Mean Squared Error loss between predicted and true values.
Args:
y_pred: Predicted values, array of shape (n,) or (n,1) with floating-point values
y: True values, array of shape (n,) or (n,1) with floating-point values
Returns:
Mean squared error as a single float
"""
return np.mean((y_pred - y) ** 2)
def binary_cross_entropy(A: NDArray[np.floating], y: NDArray[np.floating]) -> float:
"""
Compute the Binary Cross-Entropy loss between predicted probabilities and true labels.
Args:
A: Predicted probabilities (after sigmoid), array of shape (n,) or (n,1), values in [0, 1]
y: True binary labels, array of shape (n,) or (n,1), values in {0, 1}
Returns:
Binary cross-entropy loss as a single float
"""
epsilon = 1e-15
return -np.mean(y * np.log(A + epsilon) + (1 - y) * np.log(1 - A + epsilon))
def gradient_descent(X: NDArray[np.floating], y: NDArray[np.floating], W: NDArray[np.floating],
b: NDArray[np.floating], lr: float, num_iterations: int,
loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float]) -> Tuple[NDArray[np.floating], NDArray[np.floating], List[float]]:
"""
Perform gradient descent to minimize loss for linear regression.
Args:
X: Input data, shape (n_samples, n_features)
y: True values, shape (n_samples, 1)
W: Initial weights, shape (n_features, 1)
b: Initial bias, shape (1,) or (1,1)
lr: Learning rate, step size for updates
num_iterations: Number of iterations to run gradient descent
loss_fn: Loss function to compute error, e.g., mse_loss
Returns:
Tuple of (updated W, updated b, list of loss values over iterations)
"""
n = X.shape[0]
loss_history = []
for _ in range(num_iterations):
y_pred = X @ W + b
loss = loss_fn(y_pred, y)
loss_history.append(loss)
grad_W = (X.T @ (y_pred - y)) / n
grad_b = np.mean(y_pred - y)
W = W - lr * grad_W
b = b - lr * grad_b
return W, b, loss_history
You can now import this function in other scripts using
from neural_network import gradient_descent
. This growing library will be
central to training neural networks and other models in upcoming posts.
To reinforce your understanding of gradient descent, try these Python-focused coding exercises. They’ll prepare you for training more complex models in future chapters. Run the code and compare outputs to verify your solutions.
Basic Gradient Descent on Synthetic Data
Create synthetic data with X = np.array([[1.0], [2.0], [3.0]])
and
y = np.array([[2.0], [4.0], [6.0]])
(true relationship ).
Initialize W = np.array([[0.0]])
and b = np.array([[0.0]])
. Run
gradient_descent()
with lr = 0.1
and num_iterations = 50
. Print the
initial and final values of W
, b
, and the final loss.
# Your code here
X = np.array([[1.0], [2.0], [3.0]])
y = np.array([[2.0], [4.0], [6.0]])
W_init = np.array([[0.0]])
b_init = np.array([[0.0]])
lr = 0.1
num_iterations = 50
W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_iterations, mse_loss)
print("Initial weight W:", W_init)
print("Initial bias b:", b_init)
print("Final weight W:", W_final)
print("Final bias b:", b_final)
print("Final loss:", losses[-1])
Effect of Learning Rate
Using the same data as in Exercise 1, run gradient_descent()
with a very
small learning rate lr = 0.01
for 50 iterations. Compare the final W
,
b
, and loss to Exercise 1. Observe how a smaller learning rate slows
convergence.
# Your code here
X = np.array([[1.0], [2.0], [3.0]])
y = np.array([[2.0], [4.0], [6.0]])
W_init = np.array([[0.0]])
b_init = np.array([[0.0]])
lr = 0.01
num_iterations = 50
W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_iterations, mse_loss)
print("Initial weight W:", W_init)
print("Initial bias b:", b_init)
print("Final weight W:", W_final)
print("Final bias b:", b_final)
print("Final loss:", losses[-1])
Effect of Number of Iterations
Using the same data as in Exercise 1, run gradient_descent()
with
lr = 0.1
but for only 10 iterations. Compare the final W
, b
, and loss
to Exercise 1. Observe how fewer iterations result in less convergence.
# Your code here
X = np.array([[1.0], [2.0], [3.0]])
y = np.array([[2.0], [4.0], [6.0]])
W_init = np.array([[0.0]])
b_init = np.array([[0.0]])
lr = 0.1
num_iterations = 10
W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_iterations, mse_loss)
print("Initial weight W:", W_init)
print("Initial bias b:", b_init)
print("Final weight W:", W_final)
print("Final bias b:", b_final)
print("Final loss:", losses[-1])
Training on Noisy Data
Create synthetic data with noise:
X = np.array([[1.0], [2.0], [3.0], [4.0]])
,
y = np.array([[2.1], [4.2], [5.8], [8.3]])
(approximating with
noise). Initialize W = np.array([[0.0]])
and b = np.array([[0.0]])
. Run
gradient_descent()
with lr = 0.1
for 100 iterations. Observe how the
final W
and b
are close but not exactly equal to the ideal values due to
noise.
# Your code here
X = np.array([[1.0], [2.0], [3.0], [4.0]])
y = np.array([[2.1], [4.2], [5.8], [8.3]])
W_init = np.array([[0.0]])
b_init = np.array([[0.0]])
lr = 0.1
num_iterations = 100
W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_iterations, mse_loss)
print("Initial weight W:", W_init)
print("Initial bias b:", b_init)
print("Final weight W:", W_final)
print("Final bias b:", b_final)
print("Final loss:", losses[-1])
These exercises will help you build intuition for how gradient descent optimizes parameters, the impact of hyperparameters like learning rate and iterations, and handling real-world imperfections like noisy data.
Congratulations on mastering gradient descent, a key optimization algorithm in
deep learning! In this post, we’ve explored the concepts of gradients and
learning rates, understood the math behind updating parameters to minimize loss,
and implemented a reusable gradient_descent()
function for linear regression.
This function will be a building block for training more complex models like
neural networks in upcoming modules.
In the next chapter (Part 2.3: Mini-Batch Gradient Descent), we’ll extend gradient descent to handle larger datasets efficiently by processing data in smaller batches, a technique critical for scalability. This will prepare us for training on real datasets like MNIST.
Until then, experiment with the code and exercises above. If you have questions or want to share your solutions, drop a comment below—I’m excited to hear from you. Let’s keep building our deep learning toolkit together!
Next Up: Part 2.3 – Mini-Batch Gradient Descent