USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 3.2?
the description is:
2. Learning Rate Scheduling
please respond with the full blog post only, and no other information.
ASSISTANT:
When training a neural network, the “learning rate” is a critical value that controls how much we update the network’s weights at each step. If the learning rate is too high, our model might never settle down and miss the best solution. If it’s too low, the training could take forever or get stuck in a bad spot.
But here’s the catch: the best learning rate might change during training! That’s where learning rate scheduling comes in. We can start training with a higher learning rate (for big, fast steps), then lower it later (for smaller, fine-tuning steps) to help the model settle into the best possible solution.
Imagine you’re trying to find the lowest point in a foggy valley:
Learning rate scheduling is the machine learning version of gradually shrinking those steps, helping us train smarter and faster.
Let’s dive into the math.
A neural network learns by updating its parameters (these are its weights and biases) at each step to minimize some loss function . The vanilla update step looks like:
With learning rate scheduling, we change over time:
There are several popular ways to schedule :
Step Decay: Reduce the learning rate by a factor every k steps.
where is the decay rate (e.g. 0.1).
Exponential Decay: Decrease the learning rate exponentially with time.
Plateau/Performance-based: Lower the learning rate when the validation loss stops improving.
PyTorch makes implementing learning rate schedules extremely easy with torch.optim.lr_scheduler
. You wrap your optimizer with a scheduler, and step the scheduler every epoch.
Some useful built-in schedulers:
StepLR
: Reduces LR by a factor every few epochs.ExponentialLR
: Exponentially decays LR.ReduceLROnPlateau
: Reduces LR based on validation performance.Let’s see learning rate scheduling in action! We’ll train a simple neural network on the synthetic 2D dataset and compare training with:
If you haven’t already, install the necessary libraries:
uv pip install torch matplotlib
import torch
from torch import nn
from torch.optim import SGD
from torch.optim.lr_scheduler import StepLR
import matplotlib.pyplot as plt
# Set random seeds for reproducibility
torch.manual_seed(0)
# Generate a simple synthetic dataset: y = 2x + 1 + noise
def make_data(n_samples: int = 200) -> tuple[torch.Tensor, torch.Tensor]:
X = torch.linspace(-1, 1, n_samples).unsqueeze(1)
y = 2 * X + 1 + 0.2 * torch.randn_like(X)
return X, y
X, y = make_data()
# Define a simple one-layer neural net (just linear regression)
class SimpleNet(nn.Module):
def __init__(self) -> None:
super().__init__()
self.fc = nn.Linear(1, 1)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.fc(x)
# Training loop template
def train(
model: nn.Module,
X: torch.Tensor,
y: torch.Tensor,
lr: float,
epochs: int,
scheduler: torch.optim.lr_scheduler._LRScheduler | None = None
) -> list[float]:
criterion = nn.MSELoss()
optimizer = SGD(model.parameters(), lr=lr)
losses = []
for epoch in range(epochs):
model.train()
optimizer.zero_grad()
out = model(X)
loss = criterion(out, y)
loss.backward()
optimizer.step()
if scheduler is not None:
scheduler.step()
losses.append(loss.item())
return losses
# Settings
epochs = 100
lr = 0.1
# Training with constant learning rate
model_a = SimpleNet()
losses_const = train(model_a, X, y, lr=lr, epochs=epochs)
# Training with step LR scheduler (decay by 0.1 at epoch 50)
model_b = SimpleNet()
optimizer_b = SGD(model_b.parameters(), lr=lr)
scheduler_b = StepLR(optimizer_b, step_size=50, gamma=0.1)
def train_with_optimizer(
model: nn.Module,
optimizer: torch.optim.Optimizer,
scheduler: torch.optim.lr_scheduler._LRScheduler,
X: torch.Tensor,
y: torch.Tensor,
epochs: int
) -> list[float]:
criterion = nn.MSELoss()
losses = []
for epoch in range(epochs):
model.train()
optimizer.zero_grad()
out = model(X)
loss = criterion(out, y)
loss.backward()
optimizer.step()
scheduler.step()
losses.append(loss.item())
return losses
losses_sched = train_with_optimizer(model_b, optimizer_b, scheduler_b, X, y, epochs)
# Plotting
plt.figure(figsize=(8,6))
plt.plot(losses_const, label="Constant LR (0.1)")
plt.plot(losses_sched, label="Step Decay LR (0.1 → 0.01 at 50)")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Effect of Learning Rate Scheduling")
plt.legend()
plt.grid(True)
plt.show()
You should see that:
This simple example shows how a good scheduler can help your model reach a better solution!
Train a model using torch.optim.lr_scheduler.ExponentialLR
and compare its loss curve to the two approaches above. Use a gamma
value of 0.95.
# Exercise 1 Solution
model_c = SimpleNet()
optimizer_c = SGD(model_c.parameters(), lr=lr)
scheduler_c = torch.optim.lr_scheduler.ExponentialLR(optimizer_c, gamma=0.95)
def train_exp(
model: nn.Module,
optimizer: torch.optim.Optimizer,
scheduler: torch.optim.lr_scheduler._LRScheduler,
X: torch.Tensor,
y: torch.Tensor,
epochs: int
) -> list[float]:
criterion = nn.MSELoss()
losses = []
for epoch in range(epochs):
model.train()
optimizer.zero_grad()
out = model(X)
loss = criterion(out, y)
loss.backward()
optimizer.step()
scheduler.step()
losses.append(loss.item())
return losses
losses_exp = train_exp(model_c, optimizer_c, scheduler_c, X, y, epochs)
plt.figure(figsize=(8,6))
plt.plot(losses_const, label="Constant LR (0.1)")
plt.plot(losses_sched, label="Step Decay LR")
plt.plot(losses_exp, label="Exponential Decay LR ($\gamma=0.95$)")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Comparing Learning Rate Schedules")
plt.legend()
plt.grid(True)
plt.show()
Use ReduceLROnPlateau
to drop the learning rate when the loss isn’t improving. Set patience=10
, factor=0.5
.
# Exercise 2 Solution
model_d = SimpleNet()
optimizer_d = SGD(model_d.parameters(), lr=lr)
scheduler_d = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer_d, patience=10, factor=0.5, verbose=True)
criterion = nn.MSELoss()
losses_plateau = []
for epoch in range(epochs):
model_d.train()
optimizer_d.zero_grad()
out = model_d(X)
loss = criterion(out, y)
loss.backward()
optimizer_d.step()
losses_plateau.append(loss.item())
scheduler_d.step(loss.item())
plt.figure(figsize=(8,6))
plt.plot(losses_const, label="Constant LR (0.1)")
plt.plot(losses_plateau, label="ReduceLROnPlateau")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Effect of ReduceLROnPlateau")
plt.legend()
plt.grid(True)
plt.show()
Learning rate scheduling is a simple but powerful tool for optimizing how your model learns. By dynamically adjusting the step size over the course of training, you can achieve both faster convergence and better accuracy.
Stay tuned for the next lesson, where we dive into weight initialization and why your training loop can benefit from choosing smarter starting points!
USER: