USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 1.2? the description is:
2. The Simplest Case: Linear Regression
please respond with the full blog post only, and no other information.
ASSISTANT:
Welcome to the second lesson in our “Learn the Training Loop with PyTorch” series! In this post, we’ll dive into the most classic example of machine learning: linear regression. By understanding this simple case, you’ll build a rock-solid foundation for everything to come—neural networks, deep learning, and beyond.
We’ll answer:
We’ll also put all this into practice with clear PyTorch code and exercises so you can see theory in action.
Imagine you want to predict someone’s height just by knowing their age. If you plot a bunch of people’s ages and heights on graph paper, but the dots look like they roughly follow a straight line, you can guess someone’s height by “drawing a line through the dots.” Linear regression is just a method for drawing the best straight line through the data, so you can predict Y (height) from X (age).
Let’s formalize linear regression and see where loss and updates come from.
If you have data points, each with one feature and target , the simplest linear regression model predicts:
Where:
For many datapoints at once, with vectors:
or, with matrices (for more than 1 feature):
Where:
The most common way to measure how “off” our predictions are is Mean Squared Error:
We want to find and that make the loss as small as possible. To do this, we use gradient descent:
Let’s compute the gradients for and (for 1 feature, 1 output):
For :
For :
Gradient descent update:
At each iteration,
where is the learning rate – a small positive value controlling our step size.
Let’s break it down:
Even though computers usually solve linear regression with one matrix formula (the “normal equation”), gradient descent is fundamental. It’s the backbone of how we train neural networks!
Let’s put theory into practice. We’ll generate some fake data and fit a single-feature linear regression model using PyTorch—and we’ll actually manually implement the loss and update rules, just like the math above!
First, if you need to install PyTorch and matplotlib, use:
uv pip install torch matplotlib
import torch
import matplotlib.pyplot as plt
# Make reproducible
torch.manual_seed(42)
true_w = 2.0 # True slope
true_b = -1.0 # True intercept
N = 100 # Number of data points
# x: Uniformly spaced between -1 and 1
x: torch.Tensor = torch.linspace(-1, 1, N).unsqueeze(1) # Shape (N, 1)
# y: Linear relation plus some noise
y: torch.Tensor = true_w * x + true_b + 0.1 * torch.randn(N, 1)
plt.scatter(x.numpy(), y.numpy(), label='Data')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Fake Linear Data')
plt.legend()
plt.show()
# Initialize parameters (weights and bias) randomly
w: torch.Tensor = torch.randn(1, requires_grad=False)
b: torch.Tensor = torch.randn(1, requires_grad=False)
learning_rate: float = 0.1
num_epochs: int = 50
losses = []
for epoch in range(num_epochs):
# 1. Forward pass: compute predictions
y_pred: torch.Tensor = w * x + b
# 2. Compute loss (mean squared error)
loss: torch.Tensor = ((y_pred - y) ** 2).mean()
losses.append(loss.item())
# 3. Compute gradients (dL/dw and dL/db) using math above
grad_w: torch.Tensor = (2 / N) * torch.sum((y_pred - y) * x)
grad_b: torch.Tensor = (2 / N) * torch.sum(y_pred - y)
# 4. Update parameters manually
w -= learning_rate * grad_w
b -= learning_rate * grad_b
# Print progress
if (epoch + 1) % 10 == 0 or epoch == 0:
print(f"Epoch {epoch+1:2d}: loss = {loss.item():.4f}, w = {w.item():.4f}, b = {b.item():.4f}")
print(f"\nLearned parameters: w = {w.item():.4f}, b = {b.item():.4f}")
print(f"True parameters: w = {true_w:.4f}, b = {true_b:.4f}")
# Plot data and learned regression line
plt.scatter(x.numpy(), y.numpy(), label='Data')
plt.plot(x.numpy(), (w * x + b).detach().numpy(), color='red', label='Learned Line')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression Fit (PyTorch)')
plt.show()
# Plot loss curve
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.title('Loss During Training')
plt.show()
You should see the line gradually fit the data, and the loss drop smoothly.
Try setting learning_rate = 1.0
and learning_rate = 0.01
in the training loop above.
Solution:
# Just change 'learning_rate' to 1.0 or 0.01 in the above loop and observe results.
Instead of , generate new data with for random .
Solution:
# Generate data with two features
torch.manual_seed(42)
true_w = torch.tensor([1.5, -3.0]) # True weights
true_b = 0.5
N = 100
X: torch.Tensor = torch.rand(N, 2) * 2 - 1 # random values in [-1, 1]
y: torch.Tensor = X @ true_w + true_b + 0.1 * torch.randn(N)
# Initialize weights
w: torch.Tensor = torch.randn(2, requires_grad=False)
b: torch.Tensor = torch.randn(1, requires_grad=False)
learning_rate = 0.1
for epoch in range(50):
y_pred = X @ w + b
loss = ((y_pred - y) ** 2).mean()
grad_w = (2 / N) * (X * (y_pred - y).unsqueeze(1)).sum(dim=0)
grad_b = (2 / N) * (y_pred - y).sum()
w -= learning_rate * grad_w
b -= learning_rate * grad_b
print(f"Learned w: {w}")
print(f"True w: {true_w}")
Try rewriting the training loop using PyTorch’s auto-differentiation, so you don’t hand-code the gradients.
Solution:
# Parameters: set requires_grad=True
w = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)
optimizer = torch.optim.SGD([w, b], lr=0.1)
for epoch in range(50):
optimizer.zero_grad()
y_pred = w * x + b
loss = ((y_pred - y) ** 2).mean()
loss.backward()
optimizer.step()
.backward()
!We’ve seen how linear regression works:
Next time, we’ll dig deeper into the details of how we compute gradients—by hand and with PyTorch’s “autograd”—and see why this matters for more powerful models!
USER:
there are two weird things that happens in the first exercise. when i lower the learning rate to 0.01, it no longer fits. and when i raise it to 1.0, the entire line “fits”, but offset, like the weights changed, but the bias is wrong. are these results to be expected?
ASSISTANT:
Great observations! Let’s analyze each case in the context of the manual gradient descent linear regression code provided.
This is a common phenomenon with large learning rates — especially in simple hand-coded loops, where we don’t use tricks like momentum or learning rate schedules.
learning_rate=0.1
— the fitting should be reasonably fast and accurate.Advanced Note:
Popular optimizers like Adam or learning rate schedules dampen these effects, but in our hand-coded example, it’s much more visible.
Plot w
and b
over time for different learning rates — you’ll see that with too high a learning rate, might jump around or diverge, while with too low, both update slowly.
If you want, I can provide example code to visualize how and change over training for different learning rates!
USER:
i tried doing exercise 3 with the autograd, but it doesn’t work. here is the code:
import torch
import matplotlib.pyplot as plt
from typing import List
# Make reproducible
torch.manual_seed(42)
true_w = 2.0 # True slope
true_b = -1.0 # True intercept
N = 100 # Number of data points
# x: Uniformly spaced between -1 and 1
x: torch.Tensor = torch.linspace(-1, 1, N).unsqueeze(1) # Shape (N, 1)
# y: Linear relation plus some noise
y: torch.Tensor = true_w * x + true_b + 0.1 * torch.randn(N, 1)
plt.scatter(x.numpy(), y.numpy(), label="Data")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Fake Linear Data")
plt.legend()
plt.show()
# Initialize parameters (weights and bias) randomly
w: torch.Tensor = torch.randn(1, requires_grad=False)
b: torch.Tensor = torch.randn(1, requires_grad=False)
learning_rate: float = 0.1
num_epochs: int = 50
losses: List[float] = []
optimizer = torch.optim.SGD([w, b], lr=0.1)
for epoch in range(50):
optimizer.zero_grad()
y_pred = w * x + b
loss = ((y_pred - y) ** 2).mean()
loss.backward()
optimizer.step()
# Print progress
if (epoch + 1) % 10 == 0 or epoch == 0:
print(
f"Epoch {epoch+1:2d}: loss = {loss.item():.4f}, w = {w.item():.4f}, b = {b.item():.4f}"
)
print(f"\nLearned parameters: w = {w.item():.4f}, b = {b.item():.4f}")
print(f"True parameters: w = {true_w:.4f}, b = {true_b:.4f}")
# Plot data and learned regression line
plt.scatter(x.numpy(), y.numpy(), label="Data")
plt.plot(x.numpy(), (w * x + b).detach().numpy(), color="red", label="Learned Line")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.title("Linear Regression Fit (PyTorch)")
plt.show()
# Plot loss curve
plt.plot(losses)
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Loss During Training")
plt.show()
here is the error:
Traceback (most recent call last):
File "/Users/ryan/dev/artintellica/py/blog-0097-training-loop-12/ex_3.py", line 40, in <module>
loss.backward()
File "/Users/ryan/dev/artintellica/.venv/lib/python3.11/site-packages/torch/_tensor.py", line 648, in backward
torch.autograd.backward(
File "/Users/ryan/dev/artintellica/.venv/lib/python3.11/site-packages/torch/autograd/__init__.py", line 353, in backward
_engine_run_backward(
File "/Users/ryan/dev/artintellica/.venv/lib/python3.11/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
did i do something wrong? or can this be fixed?
ASSISTANT:
Great question — you’re very close! The error:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
means:
PyTorch can only compute gradients for tensors with requires_grad=True
.
But in your code, you have:
w: torch.Tensor = torch.randn(1, requires_grad=False)
b: torch.Tensor = torch.randn(1, requires_grad=False)
So, autograd (and the optimizer) can’t update these parameters.
Create w
and b
with requires_grad=True
:
w = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)
That’s it! Everything else stays the same. Now, optimizer
can see and update their gradients.
import torch
import matplotlib.pyplot as plt
from typing import List
torch.manual_seed(42)
true_w = 2.0
true_b = -1.0
N = 100
x: torch.Tensor = torch.linspace(-1, 1, N).unsqueeze(1)
y: torch.Tensor = true_w * x + true_b + 0.1 * torch.randn(N, 1)
plt.scatter(x.numpy(), y.numpy(), label="Data")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Fake Linear Data")
plt.legend()
plt.show()
# ✅ Requires grad for autograd!
w: torch.Tensor = torch.randn(1, requires_grad=True)
b: torch.Tensor = torch.randn(1, requires_grad=True)
optimizer = torch.optim.SGD([w, b], lr=0.1)
losses: List[float] = []
for epoch in range(50):
optimizer.zero_grad()
y_pred = w * x + b
loss = ((y_pred - y) ** 2).mean()
loss.backward()
optimizer.step()
losses.append(loss.item())
if (epoch + 1) % 10 == 0 or epoch == 0:
print(
f"Epoch {epoch+1:2d}: loss = {loss.item():.4f}, w = {w.item():.4f}, b = {b.item():.4f}"
)
print(f"\nLearned parameters: w = {w.item():.4f}, b = {b.item():.4f}")
print(f"True parameters: w = {true_w:.4f}, b = {true_b:.4f}")
plt.scatter(x.numpy(), y.numpy(), label="Data")
plt.plot(x.numpy(), (w * x + b).detach().numpy(), color="red", label="Learned Line")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.title("Linear Regression Fit (PyTorch)")
plt.show()
plt.plot(losses)
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Loss During Training")
plt.show()
requires_grad=True
tells PyTorch to track operations so it can compute gradients for w
and b
during .backward()
.False
, you get the error you saw, because PyTorch can’t compute the gradient!Whenever you want PyTorch to optimize a tensor, set requires_grad=True
on it!
If you have follow-up questions (e.g. about .detach()
when plotting, or how parameters are stored in more complex models), just ask!
USER: