Welcome back to our series on linear algebra for machine learning! In this post, we’re focusing on Gradient Descent in Linear Models, a fundamental optimization technique used to train models by iteratively updating parameters to minimize a loss function. Building on our previous exploration of least squares and linear regression, we’ll dive deeper into how gradient descent leverages matrix calculus and vectorization for efficient computation. Whether you’re fitting a simple linear model or scaling to larger datasets, mastering gradient descent is crucial. Let’s explore the math, intuition, and implementation with Python code, visualizations, and hands-on exercises.
Gradient Descent (GD) is an iterative optimization algorithm used to find the parameters of a model that minimize a loss function. In the context of linear models, we aim to find the weights (and optionally a bias ) for the equation that best predict the target variable given the input data (with samples and features).
The typical loss function for linear regression is the Mean Squared Error (MSE):
Gradient Descent works by computing the gradient of the loss with respect to the parameters and updating the parameters in the direction that reduces the loss. The gradient of the MSE loss with respect to is derived using matrix calculus:
The update rule for gradient descent is:
where is the learning rate, a hyperparameter controlling the step size of each update. If a bias term is included (via a column of ones in ), it is updated similarly as part of .
Gradient Descent is a cornerstone of machine learning optimization for several reasons:
Understanding the linear algebra behind gradient computation and vectorization is key to implementing efficient and scalable ML models.
Let’s implement batch gradient descent and stochastic gradient descent (SGD) for linear regression using NumPy. We’ll also visualize the loss over iterations to understand convergence behavior.
import numpy as np
import matplotlib.pyplot as plt
# Generate a simple 2D dataset (1 feature + bias)
np.random.seed(42)
n_samples = 50
X = np.random.randn(n_samples, 1) * 2 # Feature
y = 3 * X[:, 0] + 2 + np.random.randn(n_samples) * 0.5 # Target with noise
print("Dataset shape:", X.shape, y.shape)
# Add a column of ones to X for the bias term
X_with_bias = np.hstack([np.ones((n_samples, 1)), X])
# Initialize weights
w_init = np.zeros(2)
eta = 0.01 # Learning rate
n_iterations = 100
# Batch Gradient Descent
w = w_init.copy()
losses = []
for _ in range(n_iterations):
# Compute gradient: (2/n) * X^T * (Xw - y)
gradient = (2 / n_samples) * X_with_bias.T @ (X_with_bias @ w - y)
# Update weights
w = w - eta * gradient
# Compute and store loss (MSE)
loss = np.mean((X_with_bias @ w - y) ** 2)
losses.append(loss)
print("Final weights (bias, slope):", w)
# Plot loss over iterations
plt.figure(figsize=(8, 6))
plt.plot(range(n_iterations), losses, label='Loss (MSE)')
plt.xlabel('Iteration')
plt.ylabel('Mean Squared Error')
plt.title('Loss Over Iterations (Batch Gradient Descent)')
plt.legend()
plt.grid(True)
plt.show()
Output (abbreviated):
Dataset shape: (50, 1) (50,)
Final weights (bias, slope): [2.0115 2.9867]
This code implements batch gradient descent on a simple 2D dataset. It computes the gradient using the entire dataset at each step, updates the weights, and tracks the mean squared error (MSE) loss over iterations. The plot shows the loss decreasing as the algorithm converges to weights close to the true values (bias=2, slope=3).
# Stochastic Gradient Descent for the same dataset
w = w_init.copy()
losses_sgd = []
n_iterations_sgd = 500 # More iterations since updates are noisier
for _ in range(n_iterations_sgd):
# Randomly select one sample
idx = np.random.randint(0, n_samples)
X_sample = X_with_bias[idx:idx+1] # Shape (1, 2)
y_sample = y[idx:idx+1] # Shape (1,)
# Compute gradient for single sample: 2 * X^T * (Xw - y)
gradient = 2 * X_sample.T @ (X_sample @ w - y_sample)
# Update weights
w = w - eta * gradient
# Compute and store loss on full dataset for monitoring
loss = np.mean((X_with_bias @ w - y) ** 2)
losses_sgd.append(loss)
print("Final weights with SGD (bias, slope):", w)
# Plot loss over iterations
plt.figure(figsize=(8, 6))
plt.plot(range(n_iterations_sgd), losses_sgd, label='Loss (MSE)', alpha=0.5)
plt.xlabel('Iteration')
plt.ylabel('Mean Squared Error')
plt.title('Loss Over Iterations (Stochastic Gradient Descent)')
plt.legend()
plt.grid(True)
plt.show()
Output (abbreviated):
Final weights with SGD (bias, slope): [2.0153 2.9841]
This implements stochastic gradient descent (SGD), updating weights based on a single random sample per iteration. The loss plot is noisier compared to batch GD due to the randomness in updates, but it still converges to similar weights with more iterations.
Here are six exercises to deepen your understanding of gradient descent in linear models. Each exercise requires writing Python code to explore concepts and applications in machine learning.
sklearn.datasets.fetch_california_housing
), select 3
features, and implement mini-batch gradient descent (batch size=32) to fit a
linear model. Plot the loss over 100 epochs (full passes through the data)
and print the final MSE on a test split.Gradient Descent in Linear Models introduces a powerful and scalable approach to optimization, leveraging matrix calculus and vectorized operations for efficient parameter updates. By implementing batch and stochastic gradient descent with NumPy, we’ve seen how these methods converge to optimal solutions and how their behavior varies with factors like learning rate and batch size. These concepts are foundational for training more complex models in machine learning.
In the next post, we’ll explore Neural Networks as Matrix Functions, connecting linear algebra to deep learning by examining how layers and parameter updates are expressed through matrix operations. Stay tuned, and happy learning!