USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 2.3?
the description is:
3. Implementing a Simple Neural Net from Scratch
please respond with the full blog post only, and no other information.
ASSISTANT:
Welcome back to Learn the Training Loop with PyTorch! So far, we’ve understood the basic training loop and dived into the world of neural networks. In this lesson, we’re going to get our hands dirty and build a simple neural network from scratch—implementing each part ourselves. We’ll code a one-hidden-layer “multilayer perceptron” (MLP), covering the forward pass, the loss calculation, the backward pass (i.e., gradients!), and manual parameter updates.
Imagine you’re building a robot brain that can recognize handwritten numbers. This brain is a neural network—a bunch of little “switches” (neurons) that talk to each other. Each neuron takes some numbers, squishes them together, and passes them on. By chaining a few of these together (layers), our robot can learn all sorts of patterns.
In this lesson, we’re building a baby neural network with just one hidden layer. We’ll see, step by step, how data flows through the network, how it learns by looking at its mistakes, and how it updates itself to do better next time.
Let’s formalize what we mean by a “single hidden layer network.”
Suppose our input data is a matrix , where:
We want to map to an output via a hidden layer:
The computations are:
Where:
For a regression task, we often use mean squared error (MSE):
We need to compute the gradients of the loss with respect to each parameter (), then update in the opposite direction of the gradient:
Where is the learning rate.
Let’s expand on what’s going on:
Initialization: We start by randomly initializing all weights () and biases ().
Forward Pass: For each batch of inputs , compute the value at the hidden layer (h
) via a linear transform, then apply an activation function (ReLU). Take that output, and compute the final output (y_pred
) via another linear transform.
Compute Loss: Calculate how far your predictions are from the ground truth, usually via MSE.
Backward Pass (Backpropagation): Use the chain rule to calculate the derivatives (gradients) of the loss with respect to every parameter.
loss.backward()
—under the hood, PyTorch builds a computation graph and works everything out for you.Parameter Update: For each parameter, adjust it “a little” in the direction that reduces loss, using gradient descent.
Repeat: Go back to the start! With enough iterations, the model learns patterns in the data.
Let’s put this all into practice by building a simple one-hidden-layer neural network with manual parameter updates (no nn.Module, no torch.optim!).
We’ll do a basic regression: the network will learn to fit a simple function.
If you haven’t installed PyTorch or matplotlib yet, run:
uv pip install torch matplotlib
import torch
from torch import Tensor
import matplotlib.pyplot as plt
Let’s generate a simple regression task: predict for .
def generate_sine_data(n_samples: int = 100) -> tuple[Tensor, Tensor]:
# Inputs: shape (N, 1)
x = torch.linspace(-2 * torch.pi, 2 * torch.pi, n_samples).unsqueeze(1)
# Outputs: shape (N, 1)
y = torch.sin(x)
return x, y
X, y_true = generate_sine_data(300)
D_in = 1 # Input size
D_hidden = 10 # Hidden layer width
D_out = 1 # Output size
# Manual parameter initialization with requires_grad=True
W1 = torch.randn(D_in, D_hidden, requires_grad=True)
b1 = torch.zeros(D_hidden, requires_grad=True)
W2 = torch.randn(D_hidden, D_out, requires_grad=True)
b2 = torch.zeros(D_out, requires_grad=True)
learning_rate = 1e-2
n_epochs = 1000
losses: list[float] = []
for epoch in range(n_epochs):
# Forward pass: linear, activation, linear
h: Tensor = X @ W1 + b1 # shape: (N, D_hidden)
h_relu: Tensor = torch.relu(h)
y_pred: Tensor = h_relu @ W2 + b2 # shape: (N, D_out)
# Compute and record loss
loss: Tensor = ((y_pred - y_true) ** 2).mean()
losses.append(loss.item())
# Backward pass
loss.backward()
# Update parameters using SGD (gradient descent)
with torch.no_grad():
W1 -= learning_rate * W1.grad
b1 -= learning_rate * b1.grad
W2 -= learning_rate * W2.grad
b2 -= learning_rate * b2.grad
# Zero gradients for next iteration
W1.grad.zero_()
b1.grad.zero_()
W2.grad.zero_()
b2.grad.zero_()
# Print progress every 100 epochs
if (epoch+1) % 100 == 0:
print(f"Epoch {epoch+1:4d}, Loss: {loss.item():.4f}")
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 4))
# Model predictions
with torch.no_grad():
h = X @ W1 + b1
h_relu = torch.relu(h)
y_pred = h_relu @ W2 + b2
plt.plot(X.numpy(), y_true.numpy(), label="True Function")
plt.plot(X.numpy(), y_pred.numpy(), label="Neural Net Prediction")
plt.legend()
plt.title("Neural Network Fit to sin(x)")
plt.xlabel("x")
plt.ylabel("y")
plt.show()
# Plot training loss
plt.figure()
plt.plot(losses)
plt.title("Training Loss Curve")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.show()
Try the following exercises to deepen your understanding. Solutions are provided for reference!
Task: Replace the ReLU activation with a Tanh activation. How does the model’s prediction change?
# Replace torch.relu with torch.tanh everywhere in the training loop
learning_rate = 1e-2
n_epochs = 1000
# Re-initialize weights
W1 = torch.randn(D_in, D_hidden, requires_grad=True)
b1 = torch.zeros(D_hidden, requires_grad=True)
W2 = torch.randn(D_hidden, D_out, requires_grad=True)
b2 = torch.zeros(D_out, requires_grad=True)
losses: list[float] = []
for epoch in range(n_epochs):
h = X @ W1 + b1
h_act = torch.tanh(h) # <-- changed here
y_pred = h_act @ W2 + b2
loss = ((y_pred - y_true) ** 2).mean()
losses.append(loss.item())
loss.backward()
with torch.no_grad():
W1 -= learning_rate * W1.grad
b1 -= learning_rate * b1.grad
W2 -= learning_rate * W2.grad
b2 -= learning_rate * b2.grad
W1.grad.zero_()
b1.grad.zero_()
W2.grad.zero_()
b2.grad.zero_()
Task: Increase from 10 to 50. What happens to the learned function?
D_hidden = 50 # Increased hidden layer width
# Redefine and rerun the entire model
W1 = torch.randn(D_in, D_hidden, requires_grad=True)
b1 = torch.zeros(D_hidden, requires_grad=True)
W2 = torch.randn(D_hidden, D_out, requires_grad=True)
b2 = torch.zeros(D_out, requires_grad=True)
losses: list[float] = []
for epoch in range(n_epochs):
h = X @ W1 + b1
h_relu = torch.relu(h)
y_pred = h_relu @ W2 + b2
loss = ((y_pred - y_true) ** 2).mean()
losses.append(loss.item())
loss.backward()
with torch.no_grad():
W1 -= learning_rate * W1.grad
b1 -= learning_rate * b1.grad
W2 -= learning_rate * W2.grad
b2 -= learning_rate * b2.grad
W1.grad.zero_()
b1.grad.zero_()
W2.grad.zero_()
b2.grad.zero_()
Task: Add code to stop training when the loss is below 0.01.
early_stop_threshold = 0.01
for epoch in range(n_epochs):
h = X @ W1 + b1
h_relu = torch.relu(h)
y_pred = h_relu @ W2 + b2
loss = ((y_pred - y_true) ** 2).mean()
losses.append(loss.item())
if loss.item() < early_stop_threshold:
print(f"Early stop at epoch {epoch+1}, Loss: {loss.item():.4f}")
break
loss.backward()
with torch.no_grad():
W1 -= learning_rate * W1.grad
b1 -= learning_rate * b1.grad
W2 -= learning_rate * W2.grad
b2 -= learning_rate * b2.grad
W1.grad.zero_()
b1.grad.zero_()
W2.grad.zero_()
b2.grad.zero_()
Task: Add Gaussian noise to and train the network again. How does the learned function change?
X, y_true = generate_sine_data(300)
noise = 0.1 * torch.randn_like(y_true)
y_noisy = y_true + noise
# Use y_noisy instead of y_true in your training loop
We’ve built a fully functioning, one-hidden-layer neural network from scratch using PyTorch tensors and manual parameter updates. Along the way, we’ve seen:
.backward()
and torch.no_grad()
nn.Module
and torch.optim
.Next up, we’ll see why activation functions matter, and how they let neural networks discover richer patterns than a single linear function. See you there!
USER:
this line gets a type error because “loss refers to itself”:
losses.append(loss.item()) # type: ignore
any idea how to fix that type error without ignoring it?
ASSISTANT:
Absolutely! The type error most likely happens because your variable name loss
is being interpreted as the module/function loss
from another import (which sometimes happens if, e.g., Static Typing or a linter gets confused by shadowing or forward references). In the code shown above, however, loss
is just a local variable of type torch.Tensor
. If you want to help static type checkers understand this, you can annotate loss
explicitly in your loop:
loss: torch.Tensor = ((y_pred - y_true) ** 2).mean()
This tells tools like mypy (and IDEs with type inference) that loss
is a torch.Tensor
, so loss.item()
is valid.
If you think there may be confusion with other loss
objects in outer scopes, rename the variable:
loss_value: torch.Tensor = ((y_pred - y_true) ** 2).mean()
losses.append(loss_value.item())
If you prefer not to annotate inline, use a comment:
loss = ((y_pred - y_true) ** 2).mean() # type: torch.Tensor
for epoch in range(n_epochs):
# ... your code ...
loss: torch.Tensor = ((y_pred - y_true) ** 2).mean()
losses.append(loss.item())
# ... rest of training loop ...
This should satisfy type checkers and avoid the need for # type: ignore
.
USER:
in the second exercise, when i increase d_hidden from 10 to 50, i no longer get a match at all. loss increases. is this expected?
ASSISTANT:
Great observation! Increasing the width of your hidden layer from 10 to 50 does not guarantee better learning in this simple training setup, and sometimes it can make convergence harder or even worse. Here’s why, and what you can do about it:
Parameter Initialization Becomes More Important
Learning Rate Sensitivity
Overfitting Is More Likely (in some cases)
Optimization Dynamics
Try Lowering the Learning Rate
learning_rate
by 10x or 100x (e.g., try 1e-3
or 1e-4
).Check Weight Initialization
torch.randn(...)*0.1
instead of torch.randn(...)
for smaller initial weights, which usually helps.W1 = torch.randn(D_in, D_hidden) * 0.1
W1.requires_grad_()
And the same for W2
.Increase Number of Epochs
Monitor Activations
h_relu
to see if they get stuck (e.g., all zeros for ReLU).It’s not just about more neurons = more power; it’s about the right initialization, learning rate, and sometimes advanced training tricks. Modern deep learning frameworks like PyTorch’s nn.Linear
use careful initialization and adaptive optimizers by default, which is why those networks tend to work robustly with larger sizes.
In your from-scratch setup, these details matter a lot!
D_hidden = 50
learning_rate = 1e-3 # try lower
W1 = (torch.randn(D_in, D_hidden) * 0.1).requires_grad_()
b1 = torch.zeros(D_hidden, requires_grad=True)
W2 = (torch.randn(D_hidden, D_out) * 0.1).requires_grad_()
b2 = torch.zeros(D_out, requires_grad=True)
And the rest of the loop as before.
Let me know if you try these suggestions and see improved training!
USER: