USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 1.3? the description is:
3. Batch vs. Stochastic Gradient Descent
please respond with the full blog post only, and no other information.
ASSISTANT:
When training machine learning models, how you update your parameters can make a big difference in how fast and how well your model learns. Two of the most fundamental methods for updating parameters are Batch Gradient Descent and Stochastic Gradient Descent (SGD).
In this lesson, you’ll learn what these terms mean, why they matter, and how they shape the way we train models in practice. We’ll break it down simply, build up some mathematical intuition, and then try them both out in PyTorch!
Imagine you’re learning to shoot basketball free throws. Every evening, you shoot 100 balls.
Batches make your learning stable, but slow to react. Stochastic updates are fast and noisy—they can wobble around a lot, but sometimes get you better results faster! There’s also a middle ground: mini-batch, which uses small groups instead of one shot or all shots.
In machine learning, gradients tell us how to update our weights to reduce the error (loss). Given a loss function , the basic weight update rule is:
where:
Batch Gradient Descent uses all data to compute the gradient:
where is the total number of samples.
Stochastic Gradient Descent uses one sample at a time:
for randomly chosen at each step.
Mini-batch Gradient Descent uses a subset ( samples):
BGD computes the average gradient over the entire dataset and performs one update per epoch (one sweep through data). This means each update is precise but requires you to see all the data before making any change. Useful when data fits in memory and you want smooth progress, but can be slow for large datasets.
SGD computes the gradient and updates the weights using only a single data point at a time. This means you can make many small updates, which injects noise and randomness into your steps—which can help avoid bad local minima, but can make convergence bouncy.
A compromise: use small batches (like 16, 32, or 64 samples). This combines the stability of batch methods with the speed of stochastic methods. It’s the default in modern deep learning!
Let’s see both methods in action with a simple linear regression problem.
First, to make sure you have PyTorch and matplotlib installed, run:
uv pip install torch matplotlib
import torch
import matplotlib.pyplot as plt
from typing import Tuple, List
# Generate dummy regression data
def generate_data(n: int = 100) -> Tuple[torch.Tensor, torch.Tensor]:
torch.manual_seed(0)
X = torch.linspace(0, 1, n).unsqueeze(1)
y = 2 * X + 1 + 0.1 * torch.randn_like(X) # y = 2x + 1 + noise
return X, y
# Linear regression batch update
def train_batch(X: torch.Tensor, y: torch.Tensor, epochs: int = 100, lr: float = 0.1) -> List[float]:
# Initialize weights and bias
w = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)
losses = []
for epoch in range(epochs):
# Forward
y_pred = X * w + b
loss = ((y_pred - y) ** 2).mean()
losses.append(loss.item())
# Backward
loss.backward()
# Update
with torch.no_grad():
w -= lr * w.grad
b -= lr * b.grad
# Zero gradients
w.grad.zero_()
b.grad.zero_()
return losses
X, y = generate_data()
losses_batch = train_batch(X, y, epochs=50, lr=0.5)
plt.plot(losses_batch, label="Batch Gradient Descent")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Batch Gradient Descent Loss Curve")
plt.legend()
plt.show()
def train_stochastic(X: torch.Tensor, y: torch.Tensor, epochs: int = 100, lr: float = 0.1) -> List[float]:
w = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)
n = X.size(0)
losses = []
for epoch in range(epochs):
perm = torch.randperm(n)
epoch_loss = 0.0
for i in perm:
xi = X[i]
yi = y[i]
# Forward
y_pred = xi * w + b
loss = (y_pred - yi) ** 2
epoch_loss += loss.item()
# Backward
loss.backward()
# Update
with torch.no_grad():
w -= lr * w.grad
b -= lr * b.grad
w.grad.zero_()
b.grad.zero_()
losses.append(epoch_loss / n)
return losses
losses_sgd = train_stochastic(X, y, epochs=50, lr=0.05)
plt.plot(losses_sgd, label="Stochastic Gradient Descent")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Stochastic Gradient Descent Loss Curve")
plt.legend()
plt.show()
def train_minibatch(X: torch.Tensor, y: torch.Tensor, batch_size: int = 16, epochs: int = 100, lr: float = 0.1) -> List[float]:
w = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)
n = X.size(0)
losses = []
for epoch in range(epochs):
perm = torch.randperm(n)
epoch_loss = 0.0
for i in range(0, n, batch_size):
idx = perm[i:i + batch_size]
xi = X[idx]
yi = y[idx]
# Forward
y_pred = xi * w + b
loss = ((y_pred - yi) ** 2).mean()
epoch_loss += loss.item() * xi.size(0)
# Backward
loss.backward()
# Update
with torch.no_grad():
w -= lr * w.grad
b -= lr * b.grad
w.grad.zero_()
b.grad.zero_()
losses.append(epoch_loss / n)
return losses
losses_minibatch = train_minibatch(X, y, batch_size=16, epochs=50, lr=0.1)
plt.plot(losses_batch, label="Batch")
plt.plot(losses_sgd, label="Stochastic")
plt.plot(losses_minibatch, label="Mini-batch")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Loss Curves: Batch vs. Stochastic vs. Mini-batch")
plt.legend()
plt.show()
You’ll see that:
# Try batch_size = 1 (should behave like SGD), and batch_size = X.size(0) (should behave like Batch GD)
losses_bs1 = train_minibatch(X, y, batch_size=1, epochs=50, lr=0.1)
losses_bsfull = train_minibatch(X, y, batch_size=X.size(0), epochs=50, lr=0.1)
plt.plot(losses_bs1, label="Batch size = 1 (SGD)")
plt.plot(losses_bsfull, label="Full Batch")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Effect of Batch Size")
plt.legend()
plt.show()
X_large, y_large = generate_data(10_000)
import time
start = time.time()
losses_large_batch = train_batch(X_large, y_large, epochs=10, lr=0.5)
print("Batch GD Time:", time.time() - start)
start = time.time()
losses_large_mb = train_minibatch(X_large, y_large, batch_size=64, epochs=10, lr=0.5)
print("Mini-batch GD Time:", time.time() - start)
X_noisy, y_noisy = generate_data(100)
y_noisy += 0.5 * torch.randn_like(y_noisy)
losses_batch_noisy = train_batch(X_noisy, y_noisy, epochs=50, lr=0.5)
losses_sgd_noisy = train_stochastic(X_noisy, y_noisy, epochs=50, lr=0.05)
losses_minibatch_noisy = train_minibatch(X_noisy, y_noisy, batch_size=16, epochs=50, lr=0.1)
plt.plot(losses_batch_noisy, label="Batch")
plt.plot(losses_sgd_noisy, label="Stochastic")
plt.plot(losses_minibatch_noisy, label="Mini-batch")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Loss Curves with Noisy Data")
plt.legend()
plt.show()
Experiment, learn, and see how the training loop comes alive in PyTorch!
USER: