USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 3.1?
the description is:
1. Optimization Algorithms Beyond SGD
please respond with the full blog post only, and no other information.
ASSISTANT:
Welcome back to “Learn the Training Loop with PyTorch”! So far, we’ve trained models using the most fundamental optimizer: Stochastic Gradient Descent (SGD). But, have you ever wondered how modern neural networks actually manage to train so well and so fast, even when they’re so large and complex? The secret lies in improved optimization algorithms.
Today, let’s unravel Adam, RMSProp, and their friends. How do they work, mathematically and in practice? And why are they so widely used in deep learning?
Imagine you’re descending a hill blindfolded. SGD is like taking steps directly downhill, but the size of your steps is always the same. That’s simple, but not always smart—if the ground is rough or slopes change, you might fumble or even miss the deepest valley!
Optimizers like RMSProp and Adam are like having a smart hiking buddy whisper in your ear:
These algorithms adjust your steps—automatically—to help you reach the bottom faster and more reliably.
Let’s formalize what an optimizer does:
Suppose we’re minimizing a loss with respect to model parameters . At each step :
Stochastic Gradient Descent keeps fixed and uses only (the current gradient) to update each parameter.
RMSProp introduces a per-parameter learning rate, adapting it for each weight depending on past gradient magnitudes.
The RMSProp update is:
Adam builds on RMSProp and adds momentum, i.e., keeps track of both:
Adam’s update rule:
SGD is the most basic optimizer: it makes updates in the opposite direction of the current gradient, using the same step size for every parameter, every time. While elegant and actually quite effective, it’s sensitive to:
RMSProp solves one of SGD’s big headaches: parameters with big, volatile gradients can end up taking steps that are too large (causing instability), or steps that are too small (slowing learning). By keeping a running history of how “noisy” each parameter’s gradient has been, RMSProp divides each step by the recent average, shrinking updates for bouncy or large-gradient parameters.
Adam piles on another trick: momentum. Imagine rolling a ball: momentum means you keep some “inertia” from the previous steps, helping you speed up when going in a straight line, or smooth out shaky updates. Adam keeps an exponentially decaying average of both the mean and variance of gradients, and smartly combines them for much more reliable, faster convergence.
Why are these so useful?
We’ll use a simple regression problem as our playground to see these optimizers in action.
Make sure you’ve installed PyTorch and matplotlib (using uv):
uv pip install torch matplotlib
We’ll fit a simple linear function with added noise, using a tiny neural net (just a single linear layer) and compare optimizers.
import torch
import matplotlib.pyplot as plt
from typing import Tuple
# -- Data Preparation
def make_data(n_samples: int = 100) -> Tuple[torch.Tensor, torch.Tensor]:
torch.manual_seed(42)
X = torch.linspace(0, 1, n_samples).unsqueeze(1)
y = 2 * X + 3 + 0.2 * torch.randn_like(X)
return X, y
X, y = make_data()
# -- Simple Linear Model
class SimpleLinear(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
self.linear = torch.nn.Linear(1, 1)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.linear(x)
# -- Training function
def train_loop(optimizer_name: str, num_epochs: int = 100) -> Tuple[list[float], SimpleLinear]:
model = SimpleLinear()
criterion = torch.nn.MSELoss()
if optimizer_name == "sgd":
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
elif optimizer_name == "rmsprop":
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.1, alpha=0.9)
elif optimizer_name == "adam":
optimizer = torch.optim.Adam(model.parameters(), lr=0.1, betas=(0.9, 0.999))
else:
raise ValueError("Unknown optimizer")
losses = []
for epoch in range(num_epochs):
optimizer.zero_grad()
y_pred = model(X)
loss = criterion(y_pred, y)
loss.backward()
optimizer.step()
losses.append(loss.item())
return losses, model
optimizers = ["sgd", "rmsprop", "adam"]
results = {}
for opt in optimizers:
losses, model = train_loop(opt)
results[opt] = {"losses": losses, "model": model}
# -- Plotting Loss Curves
plt.figure(figsize=(8,5))
for opt in optimizers:
plt.plot(results[opt]["losses"], label=opt.upper())
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Convergence of Different Optimizers")
plt.legend()
plt.show()
# -- Plotting Predictions
plt.figure(figsize=(8,5))
plt.scatter(X.numpy(), y.numpy(), label="Data", alpha=0.6)
x_plot = torch.linspace(0, 1, 100).unsqueeze(1)
for opt in optimizers:
y_pred = results[opt]["model"](x_plot).detach().numpy()
plt.plot(x_plot.numpy(), y_pred, label=opt.upper())
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.title("Learned Linear Fit by Optimizer")
plt.show()
What do you see?
Change the learning rate for each optimizer in the code above to 0.01,
0.05, and 0.2. Observe and plot the loss curves for each setting.
What happens if your learning rate is too high or too low?
Solution: Replace lr=0.1
in the optimizer setup with your desired learning
rates. Plotting different curves will reveal the effect.
PyTorch’s SGD
can use momentum! Modify the optimizer in the training loop to
use momentum (e.g., momentum=0.9
). Compare its performance to plain SGD.
Example:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
Compare the loss curves with and without momentum.
Modify the make_data
function to have 0.5 * torch.randn_like(X)
instead of
0.2 * ...
. Which optimizer is most robust to this increased noise?
Solution:
def make_data(n_samples: int = 100) -> Tuple[torch.Tensor, torch.Tensor]:
torch.manual_seed(42)
X = torch.linspace(0, 1, n_samples).unsqueeze(1)
y = 2 * X + 3 + 0.5 * torch.randn_like(X)
return X, y
Re-run training and compare optimizers.
At the end of training, print out the values of model weights and biases for each optimizer:
for opt in optimizers:
w = results[opt]["model"].linear.weight.item()
b = results[opt]["model"].linear.bias.item()
print(f"{opt.upper()} learns w = {w:.2f}, b = {b:.2f}")
Do they all learn similar parameters? Why/why not?
In this post, we explored what makes modern optimizers like Adam and RMSProp so powerful compared to simple SGD. Instead of blindly taking steps based on the current gradient, these optimizers keep a running memory—of how big, bouncy, or noisy gradients are—and tweak each parameter’s stepsized accordingly. Adam goes a step further, using momentum and adaptive learning rates to accelerate and stabilize learning.
Up next: We’ll continue exploring powerful training tricks to get the most out of your neural networks!
USER: