USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 2.6?
the description is:
6. Regularization and Overfitting
please respond with the full blog post only, and no other information.
ASSISTANT:
In this lesson, we dive into two crucial concepts for anyone training neural networks: overfitting and regularization. You’ll learn what overfitting is, why it happens, and how regularization techniques like L2 regularization and Dropout help make your models more robust. We’ll look at the intuition, the math, and practical PyTorch implementations to cement your understanding.
Imagine you’re preparing for a spelling bee. Each night, you memorize a list of words, and your friend quizzes you on exactly those. You ace the list every time! But in the contest, the words are different, and suddenly you’re not so great.
Your “model” (your brain) learned the training words perfectly, but didn’t really understand how to spell in general.
That’s overfitting.
Your neural network got so good at remembering the training data, it failed to generalize to new, unseen examples.
Regularization is like practicing with different word lists, or being penalized for only memorizing. It helps you learn to spell—not just recall answers.
Overfitting occurs when a model learns not only the underlying patterns in the training data but also the noise, leading to poor performance on new data.
Suppose we have a model , which outputs predictions based on input and parameters (weights and biases).
We want to minimize the loss on the whole data distribution (), but we only have access to a finite training set ().
If the model capacity is too high, we may have:
L2 Regularization discourages the model weights from becoming too large, making the model simpler and less prone to memorization.
The standard loss (e.g., Mean Squared Error for regression):
With L2 regularization, the new objective becomes:
You can think of this as penalizing the model for having large weights.
Dropout is a technique where, during training, some neurons are randomly “dropped out” (set to zero) for each mini-batch. This prevents neurons from co-adapting and forces the network to distribute its learning.
Mathematically, for activations in a layer, with dropout probability :
During inference (model evaluation), dropout is turned off, and activations are scaled accordingly.
Neural networks, especially deep ones, have many parameters (weights and biases). When given limited data, they can “cheat” by memorizing exact training examples, instead of learning true underlying patterns.
As a result, predictions on new (unseen) data are often unreliable.
By adding a penalty on large weights (via the term), L2 regularization encourages the network to keep weights small. Smaller weights often result in simpler network behavior, which tends to generalize better. If we make too large, the network can’t learn much and may underfit (do poorly on all data).
L2 works by modifying the gradients during backpropagation, slightly nudging weights toward zero as training progresses.
Dropout randomly “turns off” neurons during each training step. This means the network can’t rely on any one neuron, and must spread the learning throughout the network.
At inference time, all neurons are active, but their outputs are scaled to account for the fact that some were missing during training.
Let’s train a simple neural network with and without regularization techniques to see the effect.
uv pip install torch matplotlib
We’ll make a simple one-dimensional regression problem with noisy labels, and intentionally keep the dataset small (to make overfitting easy).
import torch
import matplotlib.pyplot as plt
from torch import nn
from typing import Tuple
def make_toy_data(n_samples: int=30, seed: int=42) -> Tuple[torch.Tensor, torch.Tensor]:
torch.manual_seed(seed)
X = torch.linspace(-3, 3, n_samples).reshape(-1, 1)
# True function: y = sin(x), noise added
y = torch.sin(X) + 0.3 * torch.randn_like(X)
return X, y
X_train, y_train = make_toy_data()
X_test = torch.linspace(-3, 3, 100).reshape(-1, 1)
y_test = torch.sin(X_test)
plt.scatter(X_train.numpy(), y_train.numpy(), label='Train data')
plt.plot(X_test.numpy(), y_test.numpy(), label='True function', color='green')
plt.legend()
plt.show()
class TinyNet(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(1, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, 1)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.layers(x)
def train(model: nn.Module, X: torch.Tensor, y: torch.Tensor,
n_epochs: int, lr: float, l2: float=0.0, dropout: bool=False) -> list:
if dropout:
# Replace activation layers with Dropout+activation
new_layers = []
for layer in model.layers:
new_layers.append(layer)
if isinstance(layer, nn.Tanh): # Insert Dropout after activation
new_layers.append(nn.Dropout(p=0.3))
model.layers = nn.Sequential(*new_layers)
optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=l2)
loss_fn = nn.MSELoss()
losses = []
model.train()
for epoch in range(n_epochs):
y_pred = model(X)
loss = loss_fn(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss.item())
return losses
# Training without regularization
model_no_reg = TinyNet()
losses_no_reg = train(model_no_reg, X_train, y_train, n_epochs=500, lr=0.01)
plt.plot(losses_no_reg)
plt.title('Training Loss (No Regularization)')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.show()
# Plot predictions
model_no_reg.eval()
with torch.no_grad():
y_pred_train = model_no_reg(X_train)
y_pred_test = model_no_reg(X_test)
plt.scatter(X_train.numpy(), y_train.numpy(), label='Train')
plt.plot(X_test.numpy(), y_test.numpy(), label='True', color='green')
plt.plot(X_test.numpy(), y_pred_test.numpy(), label='Predicted', color='red')
plt.legend()
plt.title('Model Fit (No Regularization)')
plt.show()
model_l2 = TinyNet()
losses_l2 = train(model_l2, X_train, y_train, n_epochs=500, lr=0.01, l2=0.1)
plt.plot(losses_l2, label='Train Loss (L2)')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.title('Training Loss (L2 Regularization)')
plt.show()
# Plot predictions
model_l2.eval()
with torch.no_grad():
y_pred_test_l2 = model_l2(X_test)
plt.scatter(X_train.numpy(), y_train.numpy(), label='Train')
plt.plot(X_test.numpy(), y_test.numpy(), label='True', color='green')
plt.plot(X_test.numpy(), y_pred_test_l2.numpy(), label='Predicted (L2)', color='red')
plt.legend()
plt.title('Model Fit (L2 Regularization)')
plt.show()
model_dropout = TinyNet()
losses_drop = train(model_dropout, X_train, y_train, n_epochs=500, lr=0.01, dropout=True)
plt.plot(losses_drop, label='Train Loss (Dropout)')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.title('Training Loss (Dropout)')
plt.show()
model_dropout.eval()
with torch.no_grad():
y_pred_test_drop = model_dropout(X_test)
plt.scatter(X_train.numpy(), y_train.numpy(), label='Train')
plt.plot(X_test.numpy(), y_test.numpy(), label='True', color='green')
plt.plot(X_test.numpy(), y_pred_test_drop.numpy(), label='Predicted (Dropout)', color='red')
plt.legend()
plt.title('Model Fit (Dropout)')
plt.show()
Observe how unregularized, L2, and dropout models fit the data differently.
plt.figure(figsize=(8,6))
plt.plot(X_test.numpy(), y_test.numpy(), label='True', color='green')
with torch.no_grad():
plt.plot(X_test.numpy(), model_no_reg(X_test).numpy(), label='No Reg', linestyle='--')
plt.plot(X_test.numpy(), model_l2(X_test).numpy(), label='L2', linestyle='-.')
plt.plot(X_test.numpy(), model_dropout(X_test).numpy(), label='Dropout', linestyle=':')
plt.scatter(X_train.numpy(), y_train.numpy(), label='Train', alpha=0.5)
plt.legend()
plt.title('Regularization Comparison')
plt.show()
Notice that:
Exercise 1: Vary L2 Regularization Strength
Modify the l2
argument when training your model. Try l2=0.001
, 0.01
, 0.1
, and 1.0
. Plot all the predicted functions.
What happens as you increase L2 regularization?
Solution:
for l2 in [0.001, 0.01, 0.1, 1.0]:
model = TinyNet()
train(model, X_train, y_train, n_epochs=500, lr=0.01, l2=l2)
with torch.no_grad():
y_pred = model(X_test)
plt.plot(X_test.numpy(), y_pred.numpy(), label=f'l2={l2}')
plt.plot(X_test.numpy(), y_test.numpy(), label='True', color='green', linewidth=2)
plt.scatter(X_train.numpy(), y_train.numpy(), color='black', alpha=0.3)
plt.legend()
plt.title('Effect of L2 Regularization')
plt.show()
Exercise 2: Try a Deeper Network With Dropout
Change TinyNet
to have 4 hidden layers instead of 2. Retrain the model with dropout enabled.
What do you observe versus the shallow network?
Solution:
class DeepNet(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(1, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, 1)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.layers(x)
deep_model = DeepNet()
train(deep_model, X_train, y_train, n_epochs=500, lr=0.01, dropout=True)
with torch.no_grad():
y_pred_deep = deep_model(X_test)
plt.plot(X_test.numpy(), y_test.numpy(), label='True', color='green')
plt.plot(X_test.numpy(), y_pred_deep.numpy(), label='Dropout DeepNet', color='purple')
plt.scatter(X_train.numpy(), y_train.numpy(), alpha=0.4)
plt.legend()
plt.title('Deeper Net with Dropout')
plt.show()
Exercise 3: Add Both L2 and Dropout
Modify the train
function to accept both l2
and dropout
as arguments and run them together.
What do you see when both are used?
Solution: (Function defined above already supports both.)
both_model = TinyNet()
train(both_model, X_train, y_train, n_epochs=500, lr=0.01, l2=0.01, dropout=True)
with torch.no_grad():
y_pred_both = both_model(X_test)
plt.plot(X_test.numpy(), y_test.numpy(), label='True', color='green')
plt.plot(X_test.numpy(), y_pred_both.numpy(), label='L2 + Dropout', color='orange')
plt.scatter(X_train.numpy(), y_train.numpy(), alpha=0.4)
plt.legend()
plt.title('L2 + Dropout Combined')
plt.show()
Next up: We’ll recap Module 2 and see how our neural networks stack up against linear regression!
USER:
in this code i’m getting an error: model.layers is not iterable. how do i fix this?
for layer in model.layers:
new_layers.append(layer)
if isinstance(layer, nn.Tanh): # Insert Dropout after activation
new_layers.append(nn.Dropout(p=0.3))
ASSISTANT:
Great catch! The problem arises because in your code, model.layers
is an instance of nn.Sequential
.
While you can’t iterate directly over an nn.Sequential
object like a regular Python list, you can iterate over its child modules using model.layers.children()
.
So, you should replace this:
for layer in model.layers:
new_layers.append(layer)
if isinstance(layer, nn.Tanh):
new_layers.append(nn.Dropout(p=0.3))
with this:
for layer in model.layers.children():
new_layers.append(layer)
if isinstance(layer, nn.Tanh): # Insert Dropout after activation
new_layers.append(nn.Dropout(p=0.3))
Alternatively, you can use model.layers
as a list by accessing its _modules
attribute values (though this is a bit more “private”):
for layer in model.layers._modules.values():
# ... as above
But the recommended way is:
for layer in model.layers.children():
new_layers.append(layer)
if isinstance(layer, nn.Tanh):
new_layers.append(nn.Dropout(p=0.3))
This will resolve the error!
USER:
can you please give me the full source code for the final demonstration including all three approaches?
ASSISTANT:
Absolutely! Below is a complete, runnable script that:
TinyNet
)Copy-paste this code into a .py
file and run it—you can tweak parameters as you wish!
import torch
from torch import nn
import matplotlib.pyplot as plt
from typing import Tuple, List
# ==== 1. Data Generation ====
def make_toy_data(n_samples: int = 30, seed: int = 42) -> Tuple[torch.Tensor, torch.Tensor]:
torch.manual_seed(seed)
X = torch.linspace(-3, 3, n_samples).reshape(-1, 1)
y = torch.sin(X) + 0.3 * torch.randn_like(X)
return X, y
X_train, y_train = make_toy_data() # 30 points with noise
X_test = torch.linspace(-3, 3, 100).reshape(-1, 1)
y_test = torch.sin(X_test) # true signal without noise
# ==== 2. Model Definition ====
class TinyNet(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(1, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, 1)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.layers(x)
# ==== 3. Training Function ====
def train(
model: nn.Module,
X: torch.Tensor,
y: torch.Tensor,
n_epochs: int,
lr: float,
l2: float = 0.0,
dropout: bool = False,
dropout_p: float = 0.3
) -> List[float]:
if dropout:
# insert Dropout after each activation
new_layers = []
for layer in model.layers.children(): # FIX: use .children()
new_layers.append(layer)
if isinstance(layer, nn.Tanh):
new_layers.append(nn.Dropout(p=dropout_p))
model.layers = nn.Sequential(*new_layers)
optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=l2)
loss_fn = nn.MSELoss()
losses = []
model.train()
for epoch in range(n_epochs):
optimizer.zero_grad()
y_pred = model(X)
loss = loss_fn(y_pred, y)
loss.backward()
optimizer.step()
losses.append(loss.item())
return losses
# ==== 4. Training Three Models ====
# (A) No Regularization
model_no_reg = TinyNet()
losses_no_reg = train(model_no_reg, X_train, y_train, n_epochs=500, lr=0.01)
# (B) L2 Regularization (weight decay)
model_l2 = TinyNet()
losses_l2 = train(model_l2, X_train, y_train, n_epochs=500, lr=0.01, l2=0.1)
# (C) Dropout
model_dropout = TinyNet()
losses_dropout = train(model_dropout, X_train, y_train, n_epochs=500, lr=0.01, dropout=True, dropout_p=0.3)
# ==== 5. Plotting Loss Curves ====
plt.figure(figsize=(12,4))
plt.subplot(1,3,1)
plt.plot(losses_no_reg)
plt.title("No Regularization")
plt.xlabel("Epoch"); plt.ylabel("MSE Loss")
plt.subplot(1,3,2)
plt.plot(losses_l2)
plt.title("L2 Regularization")
plt.xlabel("Epoch"); plt.ylabel("MSE Loss")
plt.subplot(1,3,3)
plt.plot(losses_dropout)
plt.title("Dropout")
plt.xlabel("Epoch"); plt.ylabel("MSE Loss")
plt.tight_layout()
plt.show()
# ==== 6. Plotting Predictions ====
plt.figure(figsize=(8,6))
plt.plot(X_test.numpy(), y_test.numpy(), label='True Function', color='green', linewidth=2)
with torch.no_grad():
plt.plot(X_test.numpy(), model_no_reg(X_test).numpy(), label='No Reg', linestyle='--', color='orange')
plt.plot(X_test.numpy(), model_l2(X_test).numpy(), label='L2', linestyle='-.', color='blue')
plt.plot(X_test.numpy(), model_dropout(X_test).numpy(), label='Dropout', linestyle=':', color='red')
plt.scatter(X_train.numpy(), y_train.numpy(), label='Train Data', color='black', alpha=0.5)
plt.legend()
plt.title("Regularization Comparison: Model Predictions")
plt.xlabel("x"); plt.ylabel("y")
plt.show()
Notes:
torch
and matplotlib
. Install with uv pip install torch matplotlib
Let me know if you want exercises or function refactors!
USER: