Learn Deep Learning with NumPy, Part 2.2: Gradient Descent for Optimization

2025-06-08 · Artintellica

Introduction

Welcome back to our blog series, “Learn Deep Learning with NumPy”! In Part 2.1, we introduced loss functions like Mean Squared Error (MSE) and Binary Cross-Entropy (BCE), which quantify how far a model’s predictions are from the true values. Now, in Part 2.2, we’re taking the next step in Module 2 by exploring gradient descent—a fundamental optimization algorithm that minimizes loss by iteratively updating a model’s parameters.

By the end of this post, you’ll understand the concept of gradients and learning rates, learn how gradient descent works to optimize model parameters, and implement a reusable gradient_descent() function in NumPy for training a simple linear regression model. This function will be a cornerstone of our deep learning toolkit, reused across various models in future chapters. Let’s dive into the math and code behind optimization!


Why Gradient Descent Matters in Deep Learning

Training a neural network (or any machine learning model) involves finding the best set of parameters (weights and biases) that minimize the loss function. Think of the loss as a landscape with hills and valleys—our goal is to find the lowest point (the minimum loss). Gradient descent is the algorithm that guides us downhill toward this minimum by iteratively adjusting the parameters based on the slope of the loss landscape.

Here’s the intuition:

  • Gradients: The gradient of the loss function with respect to a parameter tells us the direction and magnitude of the steepest increase in loss. To minimize loss, we move in the opposite direction (downhill).
  • Learning Rate: This controls how big each step is. Too large, and we might overshoot the minimum; too small, and training takes forever.

Gradient descent is the backbone of training neural networks, used to update weights and biases after each forward pass. In this post, we’ll apply it to a simple linear regression problem, but the concept extends to complex deep learning models.


Mathematical Foundations: Gradient Descent and MSE Gradient

Gradient Descent Update Rule

Gradient descent updates a model’s parameters by taking small steps in the direction that reduces the loss. For a parameter WW (e.g., a weight matrix), the update rule is:

WWηLW \leftarrow W - \eta \nabla L

Where:

  • WW is the parameter to update.
  • η\eta (eta) is the learning rate, a small positive number (e.g., 0.01) controlling step size.
  • L\nabla L is the gradient of the loss function LL with respect to WW, indicating the direction of steepest ascent. We subtract it to move toward the minimum.

This process repeats iteratively—compute predictions, calculate loss, compute gradients, update parameters—until the loss converges to a minimum or we reach a set number of iterations.

Gradient of Mean Squared Error (MSE)

For a linear regression model with predictions ypred=XW+by_{\text{pred}} = XW + b, where XX is the input matrix, WW is the weight matrix, and bb is the bias, we use MSE as the loss function:

L=1ni=1n(ypred,iyi)2L = \frac{1}{n} \sum_{i=1}^n (y_{\text{pred},i} - y_i)^2

The gradient of MSE with respect to WW tells us how to adjust WW to reduce the loss. For a dataset of nn samples, it is:

WL=1nXT(ypredy)\nabla_W L = \frac{1}{n} X^T (y_{\text{pred}} - y)

Where:

  • XTX^T is the transpose of the input matrix XX.
  • (ypredy)(y_{\text{pred}} - y) is the error vector (predictions minus true values).

Similarly, the gradient with respect to the bias bb (if present) is the average error:

bL=1ni=1n(ypred,iyi)\nabla_b L = \frac{1}{n} \sum_{i=1}^n (y_{\text{pred},i} - y_i)

These gradients are used in the update rules WWηWLW \leftarrow W - \eta \nabla_W L and bbηbLb \leftarrow b - \eta \nabla_b L. Let’s now implement gradient descent in NumPy to train a linear regression model.


Deriving the Gradients for Mean Squared Error (MSE)

To understand why gradient descent works, it’s essential to derive the gradients of the Mean Squared Error (MSE) loss function with respect to the model parameters—weights (WW) and bias (bb). This derivation shows how changes in WW and bb affect the loss, guiding us on how to adjust them to minimize error. Let’s break this down step by step using calculus.

MSE Loss Function

For a linear regression model, the predicted output is given by ypred=XW+by_{\text{pred}} = XW + b, where XX is the input matrix of shape (n,d)(n, d) (with nn samples and dd features), WW is the weight matrix of shape (d,1)(d, 1), and bb is the bias term. The true values are denoted by yy, a vector of shape (n,1)(n, 1). The MSE loss is defined as the average squared difference between predictions and true values:

L=1ni=1n(ypred,iyi)2L = \frac{1}{n} \sum_{i=1}^n (y_{\text{pred},i} - y_i)^2

Substituting ypred,i=(XW+b)iy_{\text{pred},i} = (XW + b)_i, we can write:

L=1ni=1n((XW+b)iyi)2L = \frac{1}{n} \sum_{i=1}^n \left( (XW + b)_i - y_i \right)^2

Our goal is to compute the gradients WL\nabla_W L and bL\nabla_b L, which are the partial derivatives of LL with respect to WW and bb, respectively. These gradients indicate how much the loss changes with small changes in WW and bb, allowing us to update them in the direction that reduces LL.

Gradient with Respect to Weights (WW)

To find WL\nabla_W L, we compute the partial derivative of LL with respect to each element of WW. Let’s consider the loss for a single sample first and then generalize. For the ii-th sample, the squared error is:

Li=((XW+b)iyi)2L_i = \left( (XW + b)_i - y_i \right)^2

Since (XW+b)i(XW + b)_i is the ii-th element of the prediction vector, and XWXW is a linear combination, we can express it as j=1dXi,jWj+b\sum_{j=1}^d X_{i,j} W_j + b, where Xi,jX_{i,j} is the jj-th feature of the ii-th sample and WjW_j is the jj-th weight. The partial derivative of LiL_i with respect to WjW_j (a specific weight) is:

LiWj=2((XW+b)iyi)Wj((XW+b)i)\frac{\partial L_i}{\partial W_j} = 2 \left( (XW + b)_i - y_i \right) \cdot \frac{\partial}{\partial W_j} \left( (XW + b)_i \right)

Since Wj((XW+b)i)=Xi,j\frac{\partial}{\partial W_j} \left( (XW + b)_i \right) = X_{i,j} (as only the term involving WjW_j depends on it, and bb is independent of WjW_j), this simplifies to:

LiWj=2((XW+b)iyi)Xi,j\frac{\partial L_i}{\partial W_j} = 2 \left( (XW + b)_i - y_i \right) \cdot X_{i,j}

Now, for the total loss L=1ni=1nLiL = \frac{1}{n} \sum_{i=1}^n L_i, the partial derivative with respect to WjW_j is:

LWj=1ni=1n2((XW+b)iyi)Xi,j\frac{\partial L}{\partial W_j} = \frac{1}{n} \sum_{i=1}^n 2 \left( (XW + b)_i - y_i \right) \cdot X_{i,j}

This can be written in vector form for all weights WW. Notice that for each sample ii, the term 2((XW+b)iyi)2 \left( (XW + b)_i - y_i \right) is the error for that sample, and multiplying by Xi,jX_{i,j} across all features jj corresponds to the ii-th row of XX. Summing over all samples ii, this is equivalent to a matrix multiplication. Thus, the gradient for the entire weight vector is:

WL=2nXT(ypredy)\nabla_W L = \frac{2}{n} X^T (y_{\text{pred}} - y)

In practice, the factor of 2 is often absorbed into the learning rate or omitted for simplicity (as it scales the step size uniformly), so we commonly use:

WL=1nXT(ypredy)\nabla_W L = \frac{1}{n} X^T (y_{\text{pred}} - y)

This is the formula implemented in our gradient_descent() function. It shows that the gradient for WW is proportional to the correlation between the input features (XX) and the prediction errors (ypredyy_{\text{pred}} - y), guiding us to adjust WW to reduce those errors.

Gradient with Respect to Bias (bb)

Next, we derive the gradient for the bias term bb. The bias is added to every prediction, so for the ii-th sample, the partial derivative of LiL_i with respect to bb is:

Lib=2((XW+b)iyi)b((XW+b)i)\frac{\partial L_i}{\partial b} = 2 \left( (XW + b)_i - y_i \right) \cdot \frac{\partial}{\partial b} \left( (XW + b)_i \right)

Since b((XW+b)i)=1\frac{\partial}{\partial b} \left( (XW + b)_i \right) = 1 (the prediction increases by 1 for a unit increase in bb), this simplifies to:

Lib=2((XW+b)iyi)\frac{\partial L_i}{\partial b} = 2 \left( (XW + b)_i - y_i \right)

For the total loss L=1ni=1nLiL = \frac{1}{n} \sum_{i=1}^n L_i, the partial derivative with respect to bb is:

Lb=1ni=1n2((XW+b)iyi)=2ni=1n(ypred,iyi)\frac{\partial L}{\partial b} = \frac{1}{n} \sum_{i=1}^n 2 \left( (XW + b)_i - y_i \right) = \frac{2}{n} \sum_{i=1}^n \left( y_{\text{pred},i} - y_i \right)

Again, the factor of 2 is often omitted for simplicity, so we use:

bL=1ni=1n(ypred,iyi)=mean(ypredy)\nabla_b L = \frac{1}{n} \sum_{i=1}^n \left( y_{\text{pred},i} - y_i \right) = \text{mean}(y_{\text{pred}} - y)

This tells us that the gradient for bb is simply the average error across all samples. If the predictions are systematically too high (positive average error), bb should decrease, and if too low (negative average error), bb should increase.

Why These Gradients Work

The gradients WL\nabla_W L and bL\nabla_b L point in the direction of the steepest increase in loss. By subtracting them (scaled by the learning rate η\eta) in the gradient descent update rule, we move in the direction of steepest decrease, iteratively reducing the loss. This calculus-based approach ensures that each update to WW and bb brings our predictions closer to the true values, minimizing MSE.

Understanding this derivation provides insight into why gradient descent is effective and prepares us to derive gradients for more complex loss functions and models (like neural networks) in future posts. For now, these formulae for linear regression with MSE are the foundation of our optimization process.


Implementing Gradient Descent with NumPy

We’ll create a reusable gradient_descent() function to train a linear regression model by minimizing MSE loss. This function will update parameters iteratively and be adaptable for more complex models later. As always, we’ll include type hints for clarity.

Gradient Descent Implementation

Here’s the implementation, along with an example using synthetic data where the true relationship is y=2x+1y = 2x + 1:

import numpy as np
from numpy.typing import NDArray
from typing import Union, Callable

def gradient_descent(X: NDArray[np.floating], y: NDArray[np.floating], W: NDArray[np.floating],
                     b: NDArray[np.floating], lr: float, num_iterations: int,
                     loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float]) -> tuple[NDArray[np.floating], NDArray[np.floating], list[float]]:
    """
    Perform gradient descent to minimize loss for linear regression.
    Args:
        X: Input data, shape (n_samples, n_features)
        y: True values, shape (n_samples, 1)
        W: Initial weights, shape (n_features, 1)
        b: Initial bias, shape (1,) or (1,1)
        lr: Learning rate, step size for updates
        num_iterations: Number of iterations to run gradient descent
        loss_fn: Loss function to compute error, e.g., mse_loss
    Returns:
        Tuple of (updated W, updated b, list of loss values over iterations)
    """
    n = X.shape[0]
    loss_history = []

    for _ in range(num_iterations):
        # Forward pass: Compute predictions
        y_pred = X @ W + b
        # Compute loss
        loss = loss_fn(y_pred, y)
        loss_history.append(loss)
        # Compute gradients for W and b (for MSE loss)
        grad_W = (X.T @ (y_pred - y)) / n
        grad_b = np.mean(y_pred - y)
        # Update parameters
        W = W - lr * grad_W
        b = b - lr * grad_b

    return W, b, loss_history

# Example: Synthetic data (y = 2x + 1)
X = np.array([[1.0], [2.0], [3.0], [4.0], [5.0]])  # Input (5 samples, 1 feature)
y = np.array([[3.0], [5.0], [7.0], [9.0], [11.0]])  # True values (y = 2x + 1)
W_init = np.array([[0.0]])  # Initial weight (start far from true value 2.0)
b_init = np.array([[0.0]])  # Initial bias (start far from true value 1.0)
lr = 0.1  # Learning rate
num_iterations = 100  # Number of iterations

# Use mse_loss from our library
def mse_loss(y_pred: NDArray[np.floating], y: NDArray[np.floating]) -> float:
    return np.mean((y_pred - y) ** 2)

# Run gradient descent
W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_iterations, mse_loss)

print("Initial weight W:", W_init)
print("Initial bias b:", b_init)
print("Final weight W after training:", W_final)
print("Final bias b after training:", b_final)
print("Final loss:", losses[-1])
print("First few losses:", losses[:5])

Output (values are approximate):

Initial weight W: [[0.]]
Initial bias b: [[0.]]
Final weight W after training: [[1.9999]]
Final bias b after training: [[1.0001]]
Final loss: 1.234e-08
First few losses: [27.2, 17.408, 11.14112, 7.1303168, 4.563402752]

In this example, we start with initial guesses for weight (W=0W = 0) and bias (b=0b = 0), far from the true values (W=2W = 2, b=1b = 1). Over 100 iterations, gradient descent updates WW and bb using the MSE gradient, converging close to the true values, and the loss drops significantly. The loss_history list tracks how the loss decreases over iterations, showing the optimization progress.


Organizing Our Growing Library

Let’s update our neural_network.py file to include the gradient_descent() function alongside our previous implementations. This keeps our toolkit organized and reusable for training models.

# neural_network.py
import numpy as np
from numpy.typing import NDArray
from typing import Union, Callable, Tuple, List

def normalize(X: NDArray[np.floating]) -> NDArray[np.floating]:
    """
    Normalize an array to have mean=0 and std=1.
    Args:
        X: NumPy array of any shape with floating-point values
    Returns:
        Normalized array of the same shape with floating-point values
    """
    mean = np.mean(X)
    std = np.std(X)
    if std == 0:  # Avoid division by zero
        return X - mean
    return (X - mean) / std

def matrix_multiply(X: NDArray[np.floating], W: NDArray[np.floating]) -> NDArray[np.floating]:
    """
    Perform matrix multiplication between two arrays.
    Args:
        X: First input array/matrix of shape (m, n) with floating-point values
        W: Second input array/matrix of shape (n, p) with floating-point values
    Returns:
        Result of matrix multiplication, shape (m, p) with floating-point values
    """
    return np.matmul(X, W)

def sigmoid(Z: NDArray[np.floating]) -> NDArray[np.floating]:
    """
    Compute the sigmoid activation function element-wise.
    Args:
        Z: Input array of any shape with floating-point values
    Returns:
        Array of the same shape with sigmoid applied element-wise, values in [0, 1]
    """
    return 1 / (1 + np.exp(-Z))

def mse_loss(y_pred: NDArray[np.floating], y: NDArray[np.floating]) -> float:
    """
    Compute the Mean Squared Error loss between predicted and true values.
    Args:
        y_pred: Predicted values, array of shape (n,) or (n,1) with floating-point values
        y: True values, array of shape (n,) or (n,1) with floating-point values
    Returns:
        Mean squared error as a single float
    """
    return np.mean((y_pred - y) ** 2)

def binary_cross_entropy(A: NDArray[np.floating], y: NDArray[np.floating]) -> float:
    """
    Compute the Binary Cross-Entropy loss between predicted probabilities and true labels.
    Args:
        A: Predicted probabilities (after sigmoid), array of shape (n,) or (n,1), values in [0, 1]
        y: True binary labels, array of shape (n,) or (n,1), values in {0, 1}
    Returns:
        Binary cross-entropy loss as a single float
    """
    epsilon = 1e-15
    return -np.mean(y * np.log(A + epsilon) + (1 - y) * np.log(1 - A + epsilon))

def gradient_descent(X: NDArray[np.floating], y: NDArray[np.floating], W: NDArray[np.floating],
                     b: NDArray[np.floating], lr: float, num_iterations: int,
                     loss_fn: Callable[[NDArray[np.floating], NDArray[np.floating]], float]) -> Tuple[NDArray[np.floating], NDArray[np.floating], List[float]]:
    """
    Perform gradient descent to minimize loss for linear regression.
    Args:
        X: Input data, shape (n_samples, n_features)
        y: True values, shape (n_samples, 1)
        W: Initial weights, shape (n_features, 1)
        b: Initial bias, shape (1,) or (1,1)
        lr: Learning rate, step size for updates
        num_iterations: Number of iterations to run gradient descent
        loss_fn: Loss function to compute error, e.g., mse_loss
    Returns:
        Tuple of (updated W, updated b, list of loss values over iterations)
    """
    n = X.shape[0]
    loss_history = []

    for _ in range(num_iterations):
        y_pred = X @ W + b
        loss = loss_fn(y_pred, y)
        loss_history.append(loss)
        grad_W = (X.T @ (y_pred - y)) / n
        grad_b = np.mean(y_pred - y)
        W = W - lr * grad_W
        b = b - lr * grad_b

    return W, b, loss_history

You can now import this function in other scripts using from neural_network import gradient_descent. This growing library will be central to training neural networks and other models in upcoming posts.


Exercises: Practice with Gradient Descent

To reinforce your understanding of gradient descent, try these Python-focused coding exercises. They’ll prepare you for training more complex models in future chapters. Run the code and compare outputs to verify your solutions.

  1. Basic Gradient Descent on Synthetic Data
    Create synthetic data with X = np.array([[1.0], [2.0], [3.0]]) and y = np.array([[2.0], [4.0], [6.0]]) (true relationship y=2xy = 2x). Initialize W = np.array([[0.0]]) and b = np.array([[0.0]]). Run gradient_descent() with lr = 0.1 and num_iterations = 50. Print the initial and final values of W, b, and the final loss.

    # Your code here
    X = np.array([[1.0], [2.0], [3.0]])
    y = np.array([[2.0], [4.0], [6.0]])
    W_init = np.array([[0.0]])
    b_init = np.array([[0.0]])
    lr = 0.1
    num_iterations = 50
    W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_iterations, mse_loss)
    print("Initial weight W:", W_init)
    print("Initial bias b:", b_init)
    print("Final weight W:", W_final)
    print("Final bias b:", b_final)
    print("Final loss:", losses[-1])
    
  2. Effect of Learning Rate
    Using the same data as in Exercise 1, run gradient_descent() with a very small learning rate lr = 0.01 for 50 iterations. Compare the final W, b, and loss to Exercise 1. Observe how a smaller learning rate slows convergence.

    # Your code here
    X = np.array([[1.0], [2.0], [3.0]])
    y = np.array([[2.0], [4.0], [6.0]])
    W_init = np.array([[0.0]])
    b_init = np.array([[0.0]])
    lr = 0.01
    num_iterations = 50
    W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_iterations, mse_loss)
    print("Initial weight W:", W_init)
    print("Initial bias b:", b_init)
    print("Final weight W:", W_final)
    print("Final bias b:", b_final)
    print("Final loss:", losses[-1])
    
  3. Effect of Number of Iterations
    Using the same data as in Exercise 1, run gradient_descent() with lr = 0.1 but for only 10 iterations. Compare the final W, b, and loss to Exercise 1. Observe how fewer iterations result in less convergence.

    # Your code here
    X = np.array([[1.0], [2.0], [3.0]])
    y = np.array([[2.0], [4.0], [6.0]])
    W_init = np.array([[0.0]])
    b_init = np.array([[0.0]])
    lr = 0.1
    num_iterations = 10
    W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_iterations, mse_loss)
    print("Initial weight W:", W_init)
    print("Initial bias b:", b_init)
    print("Final weight W:", W_final)
    print("Final bias b:", b_final)
    print("Final loss:", losses[-1])
    
  4. Training on Noisy Data
    Create synthetic data with noise: X = np.array([[1.0], [2.0], [3.0], [4.0]]), y = np.array([[2.1], [4.2], [5.8], [8.3]]) (approximating y=2xy = 2x with noise). Initialize W = np.array([[0.0]]) and b = np.array([[0.0]]). Run gradient_descent() with lr = 0.1 for 100 iterations. Observe how the final W and b are close but not exactly equal to the ideal values due to noise.

    # Your code here
    X = np.array([[1.0], [2.0], [3.0], [4.0]])
    y = np.array([[2.1], [4.2], [5.8], [8.3]])
    W_init = np.array([[0.0]])
    b_init = np.array([[0.0]])
    lr = 0.1
    num_iterations = 100
    W_final, b_final, losses = gradient_descent(X, y, W_init, b_init, lr, num_iterations, mse_loss)
    print("Initial weight W:", W_init)
    print("Initial bias b:", b_init)
    print("Final weight W:", W_final)
    print("Final bias b:", b_final)
    print("Final loss:", losses[-1])
    

These exercises will help you build intuition for how gradient descent optimizes parameters, the impact of hyperparameters like learning rate and iterations, and handling real-world imperfections like noisy data.


Closing Thoughts

Congratulations on mastering gradient descent, a key optimization algorithm in deep learning! In this post, we’ve explored the concepts of gradients and learning rates, understood the math behind updating parameters to minimize loss, and implemented a reusable gradient_descent() function for linear regression. This function will be a building block for training more complex models like neural networks in upcoming modules.

In the next chapter (Part 2.3: Mini-Batch Gradient Descent), we’ll extend gradient descent to handle larger datasets efficiently by processing data in smaller batches, a technique critical for scalability. This will prepare us for training on real datasets like MNIST.

Until then, experiment with the code and exercises above. If you have questions or want to share your solutions, drop a comment below—I’m excited to hear from you. Let’s keep building our deep learning toolkit together!

Next Up: Part 2.3 – Mini-Batch Gradient Descent



Next Blog Posts

Earlier Blog Posts


Back to Blog

Copyright © 2025 Identellica LLC
Home · Blog · Source Code