Linear Algebra for Machine Learning, Part 15: Gradient Descent in Linear Models

2025-06-04 · Artintellica

Welcome back to our series on linear algebra for machine learning! In this post, we’re focusing on Gradient Descent in Linear Models, a fundamental optimization technique used to train models by iteratively updating parameters to minimize a loss function. Building on our previous exploration of least squares and linear regression, we’ll dive deeper into how gradient descent leverages matrix calculus and vectorization for efficient computation. Whether you’re fitting a simple linear model or scaling to larger datasets, mastering gradient descent is crucial. Let’s explore the math, intuition, and implementation with Python code, visualizations, and hands-on exercises.

What Is Gradient Descent?

Gradient Descent (GD) is an iterative optimization algorithm used to find the parameters of a model that minimize a loss function. In the context of linear models, we aim to find the weights $w$ (and optionally a bias $b$ ) for the equation $y = Xw + b$ that best predict the target variable $y$ given the input data $X \in \mathbb{R}^{n \times d}$ (with $n$ samples and $d$ features).

The typical loss function for linear regression is the Mean Squared Error (MSE):

\text{Loss}(w) = \frac{1}{n} \| y - Xw \|_2^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - (Xw)_i)^2

Gradient Descent works by computing the gradient of the loss with respect to the parameters and updating the parameters in the direction that reduces the loss. The gradient of the MSE loss with respect to $w$ is derived using matrix calculus:

\nabla_w \text{Loss} = \frac{2}{n} X^T (Xw - y)

The update rule for gradient descent is:

w \leftarrow w - \eta \cdot \nabla_w \text{Loss}

where $\eta$ is the learning rate, a hyperparameter controlling the step size of each update. If a bias term is included (via a column of ones in $X$ ), it is updated similarly as part of $w$ .

Variants of Gradient Descent

Batch Gradient Descent: Uses the entire dataset to compute the gradient at each step. It’s accurate but slow for large datasets.
Stochastic Gradient Descent (SGD): Uses a single random sample (or mini-batch) per update. It’s faster but noisier, often leading to faster convergence with some randomness.
Mini-Batch Gradient Descent: A compromise, using small batches of data for each update, balancing speed and stability.

Why Does Gradient Descent Matter in Machine Learning?

Gradient Descent is a cornerstone of machine learning optimization for several reasons:

Scalability: Unlike closed-form solutions like the normal equations, gradient descent can handle large datasets by processing data in batches.
Flexibility: It applies to a wide range of loss functions and models beyond linear regression, including neural networks.
Foundation for Advanced Optimization: Many modern optimization algorithms (e.g., Adam, RMSprop) are variants of gradient descent.
Matrix Efficiency: Vectorized implementations using linear algebra make gradient descent computationally efficient.

Understanding the linear algebra behind gradient computation and vectorization is key to implementing efficient and scalable ML models.

Implementing Gradient Descent in Python

Let’s implement batch gradient descent and stochastic gradient descent (SGD) for linear regression using NumPy. We’ll also visualize the loss over iterations to understand convergence behavior.

Example 1: Batch Gradient Descent with NumPy

import numpy as np
import matplotlib.pyplot as plt

# Generate a simple 2D dataset (1 feature + bias)
np.random.seed(42)
n_samples = 50
X = np.random.randn(n_samples, 1) * 2  # Feature
y = 3 * X[:, 0] + 2 + np.random.randn(n_samples) * 0.5  # Target with noise
print("Dataset shape:", X.shape, y.shape)

# Add a column of ones to X for the bias term
X_with_bias = np.hstack([np.ones((n_samples, 1)), X])

# Initialize weights
w_init = np.zeros(2)
eta = 0.01  # Learning rate
n_iterations = 100

# Batch Gradient Descent
w = w_init.copy()
losses = []
for _ in range(n_iterations):
    # Compute gradient: (2/n) * X^T * (Xw - y)
    gradient = (2 / n_samples) * X_with_bias.T @ (X_with_bias @ w - y)
    # Update weights
    w = w - eta * gradient
    # Compute and store loss (MSE)
    loss = np.mean((X_with_bias @ w - y) ** 2)
    losses.append(loss)

print("Final weights (bias, slope):", w)

# Plot loss over iterations
plt.figure(figsize=(8, 6))
plt.plot(range(n_iterations), losses, label='Loss (MSE)')
plt.xlabel('Iteration')
plt.ylabel('Mean Squared Error')
plt.title('Loss Over Iterations (Batch Gradient Descent)')
plt.legend()
plt.grid(True)
plt.show()

Output (abbreviated):

Dataset shape: (50, 1) (50,)
Final weights (bias, slope): [2.0115 2.9867]

This code implements batch gradient descent on a simple 2D dataset. It computes the gradient using the entire dataset at each step, updates the weights, and tracks the mean squared error (MSE) loss over iterations. The plot shows the loss decreasing as the algorithm converges to weights close to the true values (bias=2, slope=3).

Example 2: Stochastic Gradient Descent (SGD) with NumPy

# Stochastic Gradient Descent for the same dataset
w = w_init.copy()
losses_sgd = []
n_iterations_sgd = 500  # More iterations since updates are noisier

for _ in range(n_iterations_sgd):
    # Randomly select one sample
    idx = np.random.randint(0, n_samples)
    X_sample = X_with_bias[idx:idx+1]  # Shape (1, 2)
    y_sample = y[idx:idx+1]  # Shape (1,)
    # Compute gradient for single sample: 2 * X^T * (Xw - y)
    gradient = 2 * X_sample.T @ (X_sample @ w - y_sample)
    # Update weights
    w = w - eta * gradient
    # Compute and store loss on full dataset for monitoring
    loss = np.mean((X_with_bias @ w - y) ** 2)
    losses_sgd.append(loss)

print("Final weights with SGD (bias, slope):", w)

# Plot loss over iterations
plt.figure(figsize=(8, 6))
plt.plot(range(n_iterations_sgd), losses_sgd, label='Loss (MSE)', alpha=0.5)
plt.xlabel('Iteration')
plt.ylabel('Mean Squared Error')
plt.title('Loss Over Iterations (Stochastic Gradient Descent)')
plt.legend()
plt.grid(True)
plt.show()

Output (abbreviated):

Final weights with SGD (bias, slope): [2.0153 2.9841]

This implements stochastic gradient descent (SGD), updating weights based on a single random sample per iteration. The loss plot is noisier compared to batch GD due to the randomness in updates, but it still converges to similar weights with more iterations.

Exercises

Here are six exercises to deepen your understanding of gradient descent in linear models. Each exercise requires writing Python code to explore concepts and applications in machine learning.

Batch Gradient Descent on 2D Data: Generate a new 2D dataset (1 feature, 40 samples) with NumPy, add a bias column, and implement batch gradient descent to fit a linear model. Plot the loss over 100 iterations and print the final weights.
Stochastic Gradient Descent Comparison: Using the dataset from Exercise 1, implement stochastic gradient descent (SGD) with 500 iterations. Plot the loss over iterations and compare the final weights with batch gradient descent from Exercise 1.
Learning Rate Impact: Using the dataset from Exercise 1, run batch gradient descent with three different learning rates (e.g., 0.001, 0.01, 0.1). Plot the loss curves for each rate on the same graph to observe convergence behavior.
Mini-Batch Gradient Descent: Implement mini-batch gradient descent on the dataset from Exercise 1, using a batch size of 5. Run for 200 iterations, plot the loss, and compare the final weights with batch and stochastic GD from Exercises 1 and 2.
Multiple Features with SGD: Generate a synthetic 4D dataset (3 features + bias, 100 samples) with NumPy. Implement stochastic gradient descent to fit a linear model, and print the final weights compared to the true weights used to generate the data.
Real Dataset with Mini-Batch GD: Load the California Housing dataset from scikit-learn (sklearn.datasets.fetch_california_housing), select 3 features, and implement mini-batch gradient descent (batch size=32) to fit a linear model. Plot the loss over 100 epochs (full passes through the data) and print the final MSE on a test split.

Conclusion

Gradient Descent in Linear Models introduces a powerful and scalable approach to optimization, leveraging matrix calculus and vectorized operations for efficient parameter updates. By implementing batch and stochastic gradient descent with NumPy, we’ve seen how these methods converge to optimal solutions and how their behavior varies with factors like learning rate and batch size. These concepts are foundational for training more complex models in machine learning.

In the next post, we’ll explore Neural Networks as Matrix Functions, connecting linear algebra to deep learning by examining how layers and parameter updates are expressed through matrix operations. Stay tuned, and happy learning!