USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 1.4? the description is:
4. Visualizing the Loss Landscape
please respond with the full blog post only, and no other information.
ASSISTANT:
Welcome back to Learn the Training Loop with PyTorch! In our last posts, we explored what a training loop is, built a basic linear regression model, understood the role of loss functions, and looked at different ways to perform gradient descent. But so far, it might feel abstract: what does all this mean, visually? Today, we’ll bridge the gap by visualizing the “loss landscape” — the surface that tells us how good or bad a model’s predictions are, depending on its parameters.
Imagine you’re hiking in a hilly landscape, looking for the lowest valley. Each point you stand on has a height: low points mean you’re close to your goal, high points mean you need to keep searching. For machine learning, the “height” at each point represents the loss — how far off our predictions are. The “place” you stand on is actually a set of choices for your model’s parameters (like the “weight” and “bias” of a line).
Visualizing the loss landscape is like drawing a map of all the hills and valleys — it helps us understand:
Let’s consider simple linear regression with a single feature, so we can plot things in 2D or 3D.
A linear regression with one feature:
where:
We’ll use mean squared error (MSE):
where:
Explanation:
For simple models (like a single-feature linear regression), the loss landscape can literally be plotted as a surface above the plane:
By moving and , we can watch how loss changes. The lowest point is where our model fits the data best.
When we “watch convergence by hand,” we’re seeing how gradient descent walks step by step down the loss surface toward the bottom of a valley — the minimum-loss solution.
We’ll generate simple data, compute the loss over a grid of , and then watch the optimization steps.
uv pip install torch matplotlib numpy
Let’s make a dataset of points with a little noise.
import torch
import numpy as np
import matplotlib.pyplot as plt
# Set seed for reproducibility
torch.manual_seed(42)
# Generate data
def create_data(n_samples: int = 30) -> tuple[torch.Tensor, torch.Tensor]:
x = torch.linspace(-3, 3, n_samples)
y = 2 * x + 1 + torch.randn(n_samples) * 0.5
return x, y
x, y = create_data()
Create a grid of plausible values, compute loss for each pair.
# Set grid ranges
w_range = torch.linspace(0.0, 4.0, 100)
b_range = torch.linspace(-1.0, 3.0, 100)
W, B = torch.meshgrid(w_range, b_range, indexing='ij')
loss_surface = torch.zeros_like(W)
for i in range(W.shape[0]):
for j in range(B.shape[1]):
w, b = W[i, j], B[i, j]
y_pred = w * x + b
loss = torch.mean((y - y_pred) ** 2)
loss_surface[i, j] = loss
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(W.numpy(), B.numpy(), loss_surface.numpy(), cmap='viridis', alpha=0.8)
ax.set_xlabel('Weight (w)')
ax.set_ylabel('Bias (b)')
ax.set_zlabel('MSE Loss')
ax.set_title('Loss Surface of Linear Regression')
plt.show()
Let’s randomly initialize and “walk” down the landscape.
def gradient_descent_steps(
x: torch.Tensor,
y: torch.Tensor,
w_start: float = 3.5,
b_start: float = 2.5,
lr: float = 0.1,
steps: int = 15
) -> list[tuple[float, float]]:
w = torch.tensor(w_start, requires_grad=True)
b = torch.tensor(b_start, requires_grad=True)
path = [(w.item(), b.item())]
for _ in range(steps):
y_pred = w * x + b
loss = torch.mean((y - y_pred) ** 2)
loss.backward()
with torch.no_grad():
w -= lr * w.grad
b -= lr * b.grad
w.grad.zero_()
b.grad.zero_()
path.append((w.item(), b.item()))
return path
# Run optimization
path = gradient_descent_steps(x, y)
ws, bs = zip(*path)
# Overlay the path on contour plot
plt.figure(figsize=(8, 6))
plt.contourf(W.numpy(), B.numpy(), loss_surface.numpy(), levels=50, cmap='viridis')
plt.plot(ws, bs, marker='o', color='red', label='Gradient Descent Path')
plt.scatter(ws[0], bs[0], color='white', edgecolor='black', s=100, label='Start')
plt.scatter(ws[-1], bs[-1], color='yellow', edgecolor='black', s=100, label='End')
plt.xlabel('Weight (w)')
plt.ylabel('Bias (b)')
plt.title('Gradient Descent Path on the Loss Landscape')
plt.legend()
plt.show()
Task: Pick a new starting point for gradient descent (try something like ). Plot the resulting path on the loss landscape.
# Example solution
path_new = gradient_descent_steps(x, y, w_start=0.0, b_start=-1.0)
ws_new, bs_new = zip(*path_new)
plt.figure(figsize=(8, 6))
plt.contourf(W.numpy(), B.numpy(), loss_surface.numpy(), levels=50, cmap='magma')
plt.plot(ws_new, bs_new, marker='o', color='limegreen', label='Gradient Descent Path')
plt.scatter(ws_new[0], bs_new[0], color='white', edgecolor='black', s=100, label='Start')
plt.scatter(ws_new[-1], bs_new[-1], color='orange', edgecolor='black', s=100, label='End')
plt.xlabel('Weight (w)')
plt.ylabel('Bias (b)')
plt.title('GD Path with New Starting Point')
plt.legend()
plt.show()
Task: Hold fixed (say ) and plot loss as a function of only.
b_fixed: float = 1.0
losses = []
for w_val in w_range:
loss = torch.mean((y - (w_val * x + b_fixed)) ** 2)
losses.append(loss.item())
plt.plot(w_range.numpy(), losses)
plt.xlabel('Weight (w)')
plt.ylabel('MSE Loss')
plt.title('Loss Curve with b fixed at 1.0')
plt.show()
Task: Try to create a “movie” where a point moves along the gradient descent path over the surface. For simplicity, just use a loop of plt.show() calls (or save as images).
(This is optional for further exploration.)
In the next post, we’ll dig into how gradients are computed, both numerically and analytically. Stay curious and keep exploring!
USER:
when writing a linear equation with scalars, we might write: . however, when writing a linear equation with vectors, we might write: . in the first case, is a scalar, while in the second case, is a vector. furthermore, in the first case, is a scalar, while in the second case, is a vector. finally, in the first case, is a scalar, while in the second case, is a matrix. in the second case, why is it that the weights are the vector, and the input data is the matrix? this is the opposite of what i would expect.
ASSISTANT:
That’s a great and important question! It often feels counterintuitive at first, but there’s a logical reason for why, in the standard vectorized linear regression setup, the data is a matrix and the weights are a vector.
Let’s break it down step by step:
Equation:
Here:
Equation:
Here:
Let’s look at what each one represents:
is usually a data matrix of shape :
For example, say you have a dataset of samples, and each sample has features:
Each row is one sample, each column is one feature.
is a vector of length —one weight per feature.
E.g.,
Matrix-vector multiplication:
In other words, for each row/sample in (which is a vector of length ), we compute the dot product with , just like in the scalar case, but for all samples simultaneously.
It looks “opposite” only if you expect the weights to be laid out across many samples. But remember, we learn weights per feature, not per sample.
Case | Input(s) | Weight(s) | Output(s) | Equation |
---|---|---|---|---|
Scalar | (scalar) | (scalar) | (scalar) | |
Vectorized | ( matrix) | (-vector) | (-vector) |
This might be counter to what you expect if you mix up samples and features, but with this view, hopefully it’s clear why in the vectorized version the data is a matrix and the weights remain a vector!
USER:
i understand the simple case for a line, because is a point on the line, and is the slope of the line, and is the y-intercept. can you give me a simple example of the next case, where is a 2-d matrix, is a 2-d vector, and is a 2-d vector? mostly, i’m interested in why the data points are now where is a matrix, and is a vector, and why the weights are now a vector.
ASSISTANT:
Absolutely, and this is a key bridge between thinking about lines (1D) and hyperplanes or multidimensional linear models!
Let’s take it step by step with a concrete example, using real numbers so you can see how everything fits together.
If you have one data point and your model is :
Suppose instead each data point has TWO features.
Think of a data point describing a home:
And the target is the price.
Now one data point is a vector:
And the linear model is:
Here:
But usually you have MULTIPLE data points (rows)!
Let
Let
(e.g., prices in $K)
And the weights:
(so, means K per sq. ft, means K per bedroom)
Let’s set for clarity.
The linear model for ALL data points at once is:
Let’s compute :
So:
Input Data () | Weights () | Output () | |
---|---|---|---|
Shape | (number of data, features) | (features) | (number of data) |
As soon as you have multiple features, and/or multiple data points, it’s most efficient to “stack” them into a matrix ().
The weights remain a vector, one per feature. When you multiply, you get predictions for all data points at once:
Let me know if you’d like to see a visual or a small PyTorch demo of this!
USER:
for large neural networks such as in deep learning which have many layers, is it ALWAYS the case that the data is in the form of a matrix , and the weights are a vector , and the predictions are a vector ? or can the data be a vector, and the weights be a matrix?
ASSISTANT:
Great question! As neural networks get larger and deeper, the shapes of data and weights actually evolve as information passes through the layers. Sometimes data is a matrix and weights are a vector, but in neural networks, most often the weights are themselves matrices (or even higher-dimensional tensors), especially in fully connected (dense) layers.
Let’s break this down:
So, if you have samples, each with features:
In deep learning, almost all layers use weights that are matrices (or more complex tensors).
Suppose you have:
The operation is:
where:
For batched input:
No!
a) First layer (single sample):
b) Hidden layer (single sample):
c) Input batch:
d) Convolutional/Transformer layers:
So, in deep learning:
import torch
# batch_size = 2, input_dim = 3, output_dim = 4
X = torch.tensor([[1., 2., 3.], [4., 5., 6.]]) # shape: (2, 3)
W = torch.randn(4, 3) # shape: (4, 3)
b = torch.randn(4) # shape: (4,)
Y = X @ W.T + b
# Y shape: (2, 4)
Case | Data | Weights | Predictions |
---|---|---|---|
Scalar regression | scalar | scalar | scalar |
Vector regression | vector | vector | scalar |
Multiple samples | matrix | vector | vector |
Fully connected | matrix | matrix | matrix |
Deep/Large models | matrix | matrix/tensor | matrix/tensor |
In deep neural networks, weights are usually matrices (or higher-dimensional tensors), not just vectors, because they map between different sizes of input and output at each layer. The data can be a vector or a matrix (if using batches).
You do NOT always have data as a matrix and weights as a vector!
Most deep learning layers use weights as matrices (or tensors) to enable flexible, powerful transformations from one layer to the next.
Let me know if you’d like to see more specific examples — e.g., for CNNs or transformers!
USER:
so if i understand correctly, we can use either equation: or . whether or is a matrix depends on how we approach a given problem. is that correct? and when it comes to deep neural networks, it is possible to take either approach, although we normally use as vector and as a matrix?
ASSISTANT:
Exactly, you’re right! Let’s clarify both the notation and the reasoning behind these two equivalent-looking equations, and how they’re used in deep learning:
The standard notation is:
where:
For a batch: You often see
Input | Weights | Output | Typical Equation | Used for |
---|---|---|---|---|
Vector | Matrix | Vector | One sample, deep NN | |
Matrix | Vector | Vector | Batch, linear reg. | |
Matrix | Matrix | Matrix | Batch, deep NN |
In deep learning, we usually use as an input vector and as a weight matrix at each layer (). For batches, the data input becomes a matrix, and the weights remain a matrix or tensor. The shapes just need to align for proper multiplication!
Let me know if you’d like a visual or code example showing both conventions!
USER:
what does this line mean?
W, B = torch.meshgrid(w_range, b_range, indexing='ij')
ASSISTANT:
Great question! That line is essential for building the loss landscape grid in your visualization example.
Let’s break it down:
torch.meshgrid
do?Given two 1D tensors, torch.meshgrid
creates two 2D grids (matrices), where each coordinate in the grid is made from every possible pair of values from the input vectors.
Think of it as laying two axes ( and ) out in 2D, and then producing a full grid of all possible combinations.
W, B = torch.meshgrid(w_range, b_range, indexing='ij')
w_range
: 1D tensor of possible weight values.b_range
: 1D tensor of possible bias values.torch.meshgrid(w_range, b_range, indexing='ij')
:
(len(w_range), len(b_range))
.W[i, j]
is set to w_range[i]
(the value for row ).B[i, j]
is set to b_range[j]
(the value for column ).indexing='ij'
:
Suppose:
w_range = torch.tensor([1, 2])
b_range = torch.tensor([10, 20, 30])
W, B = torch.meshgrid(w_range, b_range, indexing='ij')
Then:
W
is:
[[1, 1, 1],
[2, 2, 2]]
B
is:
[[10, 20, 30],
[10, 20, 30]]
Each grid cell represents one particular pair .
torch.meshgrid(w_range, b_range, indexing='ij')
creates 2D grids of all pairs, so you can evaluate and plot the loss for any combination of weight and bias. Theindexing='ij'
ensures the “row = ” and “column = ” layout matches math/matrix convention.
Let me know if you’d like a code demo or visualization!
USER:
this code doesn’t work:
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(W.numpy(), B.numpy(), loss_surface.numpy(), cmap='viridis', alpha=0.8)
ax.set_xlabel('Weight (w)')
ax.set_ylabel('Bias (b)')
ax.set_zlabel('MSE Loss')
ax.set_title('Loss Surface of Linear Regression')
plt.show()
the function plot_surface
is not defined. what should i do?
ASSISTANT:
No worries! The issue is most likely due to the fact that plot_surface
is only available on 3D axes in Matplotlib (more exactly, on ax
created by specifying projection='3d'
) and requires importing the 3D toolkit. However, since you already created the axis correctly, the real problem might arise because you haven’t imported the mpl_toolkits.mplot3d
module, which enables 3D plotting in Matplotlib.
To fix this:
At the top of your Python script, add this import:
from mpl_toolkits.mplot3d import Axes3D # This line can be required in some setups for 3D plotting
Even though you don’t seem to use Axes3D
directly, importing it ensures Matplotlib properly initializes 3D plotting.
Your variables , , and loss_surface
should be convertible to NumPy arrays, which you’re already doing with .numpy()
calls.
Here’s a minimal version for 3D plotting:
import torch
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # <-- Add this line
# ...your code to define w_range, b_range, W, B, loss_surface...
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(W.numpy(), B.numpy(), loss_surface.numpy(), cmap='viridis', alpha=0.8)
ax.set_xlabel('Weight (w)')
ax.set_ylabel('Bias (b)')
ax.set_zlabel('MSE Loss')
ax.set_title('Loss Surface of Linear Regression')
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
AttributeError: 'AxesSubplot' object has no attribute 'plot_surface'
, check that you created the axis with projection='3d'
(which you did) and that you imported Axes3D
.pip show matplotlib
or uv pip list
).Axes3D
from mpl_toolkits.mplot3d
to properly set up 3D plots.projection='3d'
)..plot_surface(...)
on that axis.Let me know if you still have an error message, and paste it here—I’ll help debug!
USER: