Learn the Training Loop with PyTorch, Part 1.4: Visualizing the Loss Landscape

2025-06-15 · Artintellica

USER:

i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.

first, let me give you the outline of the series:


Module 1: The Elementary Training Loop

Goal: Build foundational intuition by hand-coding a simple training loop from scratch.

1. Introduction: What is a Training Loop?

  • Definition and importance
  • Loss, optimization, and learning in ML

2. The Simplest Case: Linear Regression

  • What is linear regression?
  • The mean squared error loss
  • Deriving the update rules manually

3. Batch vs. Stochastic Gradient Descent

  • Differences and practical implications
  • Hands-on: See both in Python/numpy

4. Visualizing the Loss Landscape

  • Plotting loss as a function of parameters
  • Demo: Watching convergence by hand

5. Numerical vs. Analytical Gradients

  • How are gradients computed?
  • Manual calculation and confirmation with numerical gradient

6. Recap and Key Takeaways


Module 2: The Training Loop for Neural Networks

Goal: Extend the basic loop to more powerful models; introduce neural network concepts.

1. From Linear to Nonlinear: Why Neural Networks?

  • Multi-layer perceptrons (MLPs)
  • Activation functions (ReLU, sigmoid, tanh)

2. Forward and Backward Passes

  • The chain rule of differentiation
  • Backpropagation intuition (no need for full rigor yet)

3. Implementing a Simple Neural Net from Scratch

  • Single hidden layer network in numpy/Python
  • Coding forward, loss, backward, and parameter updates

4. The Role of Activations

  • Intuitive explanation and code examples
  • Effect on learning and expressiveness

5. Mini-batching and Data Pipelines

  • How batching speeds up and stabilizes learning

6. Regularization and Overfitting

  • Techniques: L2, dropout (conceptual intro)

7. Recap: Comparing Our Simple Network with Linear Regression


Module 3: Advanced Training Loops and Modern Tricks

Goal: Connect foundational understanding to the large-scale models used in industry.

1. Optimization Algorithms Beyond SGD

  • Adam, RMSProp, etc.: How they work, why they’re useful

2. Learning Rate Scheduling

  • How/why learning rates are changed during training
  • Demo: Effect of learning rate on convergence and results

3. Weight Initialization

  • Why it matters, common strategies

4. Deeper Networks and Backprop Challenges

  • Vanishing/exploding gradients
  • Solutions: normalization, skip connections

5. Large-Scale Training: Data Parallelism and Hardware

  • How the same training loop runs efficiently on GPUs and across many machines

6. Monitoring and Debugging the Training Loop

  • Visualizing loss/accuracy, diagnosing common problems

7. Modern Regularization and Generalization Techniques

  • Early stopping, batch/layer normalization, data augmentation (conceptual intro)

8. The Training Loop in Practice: Case Studies

  • High-level overview of how OpenAI/Google/Meta might train LLMs: what’s the same, what’s different.
  • What “scaling up” really means: dataset, parameters, hardware, tricks.

9. Conclusion: What’s Next After the Training Loop?

  • Discussion: fine-tuning, transfer learning, and unsupervised/self-supervised methods.

General Tips for Each Module:

  • Start simple, build incrementally: Keep code simple at first, then show how to layer complexity.
  • Accompany every lesson with practical code examples.
  • Focus on intuition: Use diagrams, plots, and hands-on demo to make concepts tangible.
  • Summarize and review regularly: Add checklists or quizzes at the end of each module.

great. now, for each blog post, i am hoping to follow the following structure:

  • the course is called “Learn the Training Loop with PyTorch”
  • always provide a brief introduction to the topic at hand
  • then always provide a very simple “ELI5” explanation of the topic, that theortically anyone could understand
  • then provide the mathematical foundations. for math, always use dollar delimited LaTeX, and always provide a brief explanation of the math. for instance, y=Xw\mathbf{y} = \mathbf{X}\mathbf{w} is the equation for linear regression, where y\mathbf{y} is the output vector, X\mathbf{X} is the input matrix, and w\mathbf{w} is the weight vector.
  • for block-level math, always use two dollar signs, like this: y=Xw\mathbf{y} = \mathbf{X}\mathbf{w}
  • after the mathematical foundations, such as definitions, then always provide an in-depth explanation, covering the topic in detail, and expanding on the explanation of the math if necessary.
  • for other formatting, use markdown like headings, bullet points, and code blocks to make the content readable. do not use HTML tags.
  • after the math is complete, then provide one or more demonstrations of the topic using PyTorch. if possible, always use PyTorch, not numpy, just to keep it simple and consistnt. you can also use matplotlib to visualize the results.
  • for all python code, always use type hints, and, if necessary, describe installation instructions using “uv” (not pip).
  • then, create a series of exercises, somewhere from 1 - 4 total exercises that cover the material. please provide the soruce code for each exercise to make sure the student can easily compare their solution to the correct answer. do not hide the exercises and do not put HTML in the exercises.
  • finally, provide a summary of the topic, and a list of key takeaways.

that’s it, now can you please write blog post 1.4? the description is:

4. Visualizing the Loss Landscape

  • Plotting loss as a function of parameters
  • Demo: Watching convergence by hand

please respond with the full blog post only, and no other information.

ASSISTANT:

Learn the Training Loop with PyTorch

Module 1.4 – Visualizing the Loss Landscape


Introduction

Welcome back to Learn the Training Loop with PyTorch! In our last posts, we explored what a training loop is, built a basic linear regression model, understood the role of loss functions, and looked at different ways to perform gradient descent. But so far, it might feel abstract: what does all this mean, visually? Today, we’ll bridge the gap by visualizing the “loss landscape” — the surface that tells us how good or bad a model’s predictions are, depending on its parameters.


ELI5: What’s a Loss Landscape, and Why Visualize It?

Imagine you’re hiking in a hilly landscape, looking for the lowest valley. Each point you stand on has a height: low points mean you’re close to your goal, high points mean you need to keep searching. For machine learning, the “height” at each point represents the loss — how far off our predictions are. The “place” you stand on is actually a set of choices for your model’s parameters (like the “weight” and “bias” of a line).

Visualizing the loss landscape is like drawing a map of all the hills and valleys — it helps us understand:

  • Where are the lowest points (best models)?
  • How does changing the parameters impact our predictions?
  • Why does gradient descent move step-by-step towards a valley?

Mathematical Foundations

Let’s consider simple linear regression with a single feature, so we can plot things in 2D or 3D.

The Model

A linear regression with one feature:

y^i=wxi+b\hat{y}_i = wx_i + b

where:

  • xix_i is the input (a scalar)
  • ww is the weight (slope)
  • bb is the bias (intercept)
  • y^i\hat{y}_i is the predicted output

The Loss Function

We’ll use mean squared error (MSE):

L(w,b)=1Ni=1N(yi(wxi+b))2\mathcal{L}(w, b) = \frac{1}{N} \sum_{i=1}^N (y_i - (wx_i + b))^2

where:

  • NN is the number of data points,
  • yiy_i is the true value.

Explanation:

  • The MSE loss tells us, for any choice of (w,b)(w, b), how “wrong” the model is on average.
  • If we vary ww and bb, we get different loss values, which together create the “landscape”.

In-Depth Explanation: Loss Surfaces Demystified

For simple models (like a single-feature linear regression), the loss landscape can literally be plotted as a surface above the (w,b)(w, b) plane:

  • Smooth valleys: means gradient descent can find its way to the minimum easily.
  • Steep slopes: indicate rapid changes in loss — we need careful step sizes.
  • Flat areas: can mean slow progress.

By moving ww and bb, we can watch how loss changes. The lowest point is where our model fits the data best.

When we “watch convergence by hand,” we’re seeing how gradient descent walks step by step down the loss surface toward the bottom of a valley — the minimum-loss solution.


Demonstration: Visualizing the Loss Landscape in PyTorch

We’ll generate simple data, compute the loss over a grid of (w,b)(w, b), and then watch the optimization steps.

1. Install Dependencies

uv pip install torch matplotlib numpy

2. Code Walkthrough

Generate Simple Data

Let’s make a dataset of points y=2x+1y = 2x + 1 with a little noise.

import torch
import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
torch.manual_seed(42)

# Generate data
def create_data(n_samples: int = 30) -> tuple[torch.Tensor, torch.Tensor]:
    x = torch.linspace(-3, 3, n_samples)
    y = 2 * x + 1 + torch.randn(n_samples) * 0.5
    return x, y

x, y = create_data()

Compute the Loss Surface

Create a grid of plausible (w,b)(w, b) values, compute loss for each pair.

# Set grid ranges
w_range = torch.linspace(0.0, 4.0, 100)
b_range = torch.linspace(-1.0, 3.0, 100)

W, B = torch.meshgrid(w_range, b_range, indexing='ij')
loss_surface = torch.zeros_like(W)

for i in range(W.shape[0]):
    for j in range(B.shape[1]):
        w, b = W[i, j], B[i, j]
        y_pred = w * x + b
        loss = torch.mean((y - y_pred) ** 2)
        loss_surface[i, j] = loss

Plot the Loss Landscape

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(W.numpy(), B.numpy(), loss_surface.numpy(), cmap='viridis', alpha=0.8)
ax.set_xlabel('Weight (w)')
ax.set_ylabel('Bias (b)')
ax.set_zlabel('MSE Loss')
ax.set_title('Loss Surface of Linear Regression')
plt.show()

Visualize Gradient Descent Steps

Let’s randomly initialize (w,b)(w, b) and “walk” down the landscape.

def gradient_descent_steps(
    x: torch.Tensor,
    y: torch.Tensor,
    w_start: float = 3.5,
    b_start: float = 2.5,
    lr: float = 0.1,
    steps: int = 15
) -> list[tuple[float, float]]:
    w = torch.tensor(w_start, requires_grad=True)
    b = torch.tensor(b_start, requires_grad=True)
    path = [(w.item(), b.item())]
    for _ in range(steps):
        y_pred = w * x + b
        loss = torch.mean((y - y_pred) ** 2)
        loss.backward()
        with torch.no_grad():
            w -= lr * w.grad
            b -= lr * b.grad
        w.grad.zero_()
        b.grad.zero_()
        path.append((w.item(), b.item()))
    return path

# Run optimization
path = gradient_descent_steps(x, y)
ws, bs = zip(*path)

# Overlay the path on contour plot
plt.figure(figsize=(8, 6))
plt.contourf(W.numpy(), B.numpy(), loss_surface.numpy(), levels=50, cmap='viridis')
plt.plot(ws, bs, marker='o', color='red', label='Gradient Descent Path')
plt.scatter(ws[0], bs[0], color='white', edgecolor='black', s=100, label='Start')
plt.scatter(ws[-1], bs[-1], color='yellow', edgecolor='black', s=100, label='End')
plt.xlabel('Weight (w)')
plt.ylabel('Bias (b)')
plt.title('Gradient Descent Path on the Loss Landscape')
plt.legend()
plt.show()

Exercises

1. Try New Starting Points

Task: Pick a new starting point (w,b)(w, b) for gradient descent (try something like w=0.0,b=1.0w = 0.0, b = -1.0). Plot the resulting path on the loss landscape.

# Example solution
path_new = gradient_descent_steps(x, y, w_start=0.0, b_start=-1.0)
ws_new, bs_new = zip(*path_new)

plt.figure(figsize=(8, 6))
plt.contourf(W.numpy(), B.numpy(), loss_surface.numpy(), levels=50, cmap='magma')
plt.plot(ws_new, bs_new, marker='o', color='limegreen', label='Gradient Descent Path')
plt.scatter(ws_new[0], bs_new[0], color='white', edgecolor='black', s=100, label='Start')
plt.scatter(ws_new[-1], bs_new[-1], color='orange', edgecolor='black', s=100, label='End')
plt.xlabel('Weight (w)')
plt.ylabel('Bias (b)')
plt.title('GD Path with New Starting Point')
plt.legend()
plt.show()

2. Visualize Loss as a Function of One Parameter

Task: Hold bb fixed (say b=1.0b=1.0) and plot loss as a function of ww only.

b_fixed: float = 1.0
losses = []
for w_val in w_range:
    loss = torch.mean((y - (w_val * x + b_fixed)) ** 2)
    losses.append(loss.item())
plt.plot(w_range.numpy(), losses)
plt.xlabel('Weight (w)')
plt.ylabel('MSE Loss')
plt.title('Loss Curve with b fixed at 1.0')
plt.show()

3. 3D Animation (Bonus)

Task: Try to create a “movie” where a point moves along the gradient descent path over the surface. For simplicity, just use a loop of plt.show() calls (or save as images).

(This is optional for further exploration.)


Summary

  • The loss landscape is the “terrain” that our optimizer navigates while training a model; valleys are low-loss (good fit), hills are high-loss (bad fit).
  • For linear regression with one input, the loss surface can be visualized in 3D as a smooth bowl.
  • Gradient descent follows the slope downwards towards the minimum — you can literally see every step on the loss surface!
  • Visualization gives powerful intuition for why training works, and how parameter choices impact learning.

Key Takeaways

  • Loss landscapes are a graphical way to understand model fitting and optimizer behavior.
  • Even for simple models, seeing the loss bowl makes gradient descent much less mysterious.
  • PyTorch makes it easy to compute and plot the loss surface and optimization paths.
  • In deeper models, landscapes get complicated — but the intuition remains invaluable.

In the next post, we’ll dig into how gradients are computed, both numerically and analytically. Stay curious and keep exploring!

USER:

when writing a linear equation with scalars, we might write: y=wx+by = wx + b. however, when writing a linear equation with vectors, we might write: y=Xw+b\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b}. in the first case, yy is a scalar, while in the second case, y\mathbf{y} is a vector. furthermore, in the first case, ww is a scalar, while in the second case, w\mathbf{w} is a vector. finally, in the first case, xx is a scalar, while in the second case, X\mathbf{X} is a matrix. in the second case, why is it that the weights are the vector, and the input data is the matrix? this is the opposite of what i would expect.

ASSISTANT:

That’s a great and important question! It often feels counterintuitive at first, but there’s a logical reason for why, in the standard vectorized linear regression setup, the data is a matrix and the weights are a vector.

Let’s break it down step by step:


Scalar Case

Equation:
y=wx+by = wx + b

Here:

  • xx is a single input scalar (just a number, e.g., 3.73.7).
  • ww is a single weight (also a number).
  • bb is a single bias.
  • yy is the output.

Vector (Batch) Case

Equation:
y=Xw+b\mathbf{y} = \mathbf{X} \mathbf{w} + \mathbf{b}

Here:

  • X\mathbf{X} is a matrix of input data.
  • w\mathbf{w} is a vector of weights.
  • b\mathbf{b} is a vector (or sometimes a scalar, but often a vector if using multiple outputs).
  • y\mathbf{y} is the output vector.

Let’s look at what each one represents:

What does X\mathbf{X} represent?

X\mathbf{X} is usually a data matrix of shape N×DN \times D:

  • NN = number of data points (samples)
  • DD = number of features/dimensions per sample

For example, say you have a dataset of N=3N=3 samples, and each sample has D=2D=2 features:

X=[x11x12x21x22x31x32]\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ x_{31} & x_{32} \end{bmatrix}

Each row is one sample, each column is one feature.

What about w\mathbf{w}?

w\mathbf{w} is a vector of length DD—one weight per feature.
E.g.,

w=[w1w2]\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}

What happens when we calculate Xw\mathbf{X}\mathbf{w}?

Matrix-vector multiplication:

  • X\mathbf{X}: shape N×DN \times D
  • w\mathbf{w}: shape D×1D \times 1
  • Result: y\mathbf{y} with shape N×1N \times 1 (one output per data point)

In other words, for each row/sample in X\mathbf{X} (which is a vector of length DD), we compute the dot product with w\mathbf{w}, just like wxwx in the scalar case, but for all samples simultaneously.


Why is the input data the matrix and the weights the vector?

  • Input Data X\mathbf{X} is a Matrix: Because you have multiple samples (rows), each with multiple features (columns).
  • Weights w\mathbf{w} is a Vector: Because you need one weight for each feature — each weight multiplies a column (feature).
  • Multiplying X\mathbf{X} by w\mathbf{w}: For each data point, sum up (dot product) all features times their corresponding weights.

It looks “opposite” only if you expect the weights to be laid out across many samples. But remember, we learn weights per feature, not per sample.


Intuitive Mapping

  • In the scalar case: y=wx+by = wx + b
    • ww matches to the feature xx.
  • In the vector case: Each output yiy_i for the ii-th sample is: yi=xi1w1+xi2w2++xiDwD+by_i = x_{i1}w_1 + x_{i2}w_2 + \dots + x_{iD}w_D + b But in matrix form, this is all samples at once: [x11x12x1Dx21x22x2DxN1xN2xND][w1w2wD]\begin{bmatrix} x_{11} & x_{12} & \dots & x_{1D} \\ x_{21} & x_{22} & \dots & x_{2D} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N1} & x_{N2} & \dots & x_{ND} \end{bmatrix} \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix} which gives: [y1y2yN]\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}
  • So, for each sample, you’re just getting the sum over (feature ×\times weight) plus bias.

Summary Table

CaseInput(s)Weight(s)Output(s)Equation
Scalarxx (scalar)ww (scalar)yy (scalar)y=wx+by = wx + b
VectorizedX\mathbf{X} (N×DN \times D matrix)w\mathbf{w} (DD-vector)y\mathbf{y} (NN-vector)y=Xw+b\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b}

Key Takeaway

  • You use a weight for each feature, not each sample. The weights “act on” feature columns, not sample rows.
  • The matrix X\mathbf{X} holds all inputs (all samples and all features), while w\mathbf{w} holds one value per feature.

This might be counter to what you expect if you mix up samples and features, but with this view, hopefully it’s clear why in the vectorized version the data is a matrix and the weights remain a vector!

USER:

i understand the simple case for a line, because (x,y)(x,y) is a point on the line, and ww is the slope of the line, and bb is the y-intercept. can you give me a simple example of the next case, where XX is a 2-d matrix, ww is a 2-d vector, and yy is a 2-d vector? mostly, i’m interested in why the data points are now (X,y)(X,y) where XX is a matrix, and yy is a vector, and why the weights are now a vector.

ASSISTANT:

Absolutely, and this is a key bridge between thinking about lines (1D) and hyperplanes or multidimensional linear models!

Let’s take it step by step with a concrete example, using real numbers so you can see how everything fits together.


1. The Simple (1D) Case — Recap

If you have one data point (x,y)(x, y) and your model is y=wx+by = wx + b:

  • xx is a scalar (e.g., 33)
  • yy is a scalar (e.g., 77)
  • ww is a scalar (e.g., 22)
  • bb is a scalar (e.g., 11)

2. The Next Case — Two Features Per Data Point

Suppose instead each data point has TWO features.
Think of a data point describing a home:

  • x1x_1 = square footage
  • x2x_2 = number of bedrooms

And the target yy is the price.

Now one data point is a vector:

  • x=[x1,x2]\mathbf{x} = [x_1, x_2]

And the linear model is:

y=w1x1+w2x2+by = w_1 x_1 + w_2 x_2 + b

Here:

  • w=[w1,w2]\mathbf{w} = [w_1, w_2] (weights for each feature)

But usually you have MULTIPLE data points (rows)!

Let’s walk through a mini-dataset with 2 data points:

Let

X=[x11x12x21x22]=[100038002]\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \end{bmatrix} = \begin{bmatrix} 1000 & 3 \\ 800 & 2 \end{bmatrix}
  • Each row is one data point (e.g., a house)
    • Row 1: [1000 sq. ft, 3 beds]
    • Row 2: [800 sq. ft, 2 beds]
  • Each column is a feature
    • Column 1: Square footage
    • Column 2: Bedrooms

Let

y=[300200]\mathbf{y} = \begin{bmatrix} 300 \\ 200 \end{bmatrix}

(e.g., prices in $K)

And the weights:

w=[0.210]\mathbf{w} = \begin{bmatrix} 0.2 \\ 10 \end{bmatrix}

(so, w1=0.2w_1=0.2 means 0.20.2K per sq. ft, w2=10w_2=10 means 1010K per bedroom)

Let’s set b=0b=0 for clarity.


Matrix Multiplication Step-by-Step

The linear model for ALL data points at once is:

y=Xw+b\mathbf{y} = \mathbf{X}\mathbf{w} + b

Let’s compute Xw\mathbf{X}\mathbf{w}:

Xw=[100038002][0.210]=[(1000×0.2)+(3×10)(800×0.2)+(2×10)]=[200+30160+20]=[230180]\mathbf{X}\mathbf{w} = \begin{bmatrix} 1000 & 3 \\ 800 & 2 \end{bmatrix} \begin{bmatrix} 0.2 \\ 10 \end{bmatrix} = \begin{bmatrix} (1000 \times 0.2) + (3 \times 10) \\ (800 \times 0.2) + (2 \times 10) \end{bmatrix} = \begin{bmatrix} 200 + 30 \\ 160 + 20 \end{bmatrix} = \begin{bmatrix} 230 \\ 180 \end{bmatrix}

So:

  • For the first house, predicted price: 230230
  • For the second house, predicted price: 180180

Why is the data a matrix?

  • You have multiple data points (rows).
  • Each data point has multiple features (columns).
  • Thus: Matrix!

Why are the weights a vector?

  • There’s one weight per feature.
  • In our case: two features \rightarrow two weights, arranged in a column vector.

Why is the target yy a vector?

  • There’s one target per data point (the output variable for each row/data point).
  • Thus: Vector with as many entries as there are data points.

Summary Table

Input Data (X\mathbf{X})Weights (w\mathbf{w})Output (y\mathbf{y})
Shape(number of data, features) (N,D)(N, D)(features) (D,1)(D, 1)(number of data) (N,1)(N, 1)
  • Each data point ⟶ row of the matrix
  • Each feature ⟶ column of the matrix
  • Weights: one per feature
  • Targets: one per data point

Key Point

As soon as you have multiple features, and/or multiple data points, it’s most efficient to “stack” them into a matrix (X\mathbf{X}).
The weights remain a vector, one per feature. When you multiply, you get predictions for all data points at once:

ypred=Xw+b\mathbf{y}_{\text{pred}} = \mathbf{X} \mathbf{w} + b

Let me know if you’d like to see a visual or a small PyTorch demo of this!

USER:

for large neural networks such as in deep learning which have many layers, is it ALWAYS the case that the data is in the form of a matrix XX, and the weights are a vector ww, and the predictions are a vector yy? or can the data be a vector, and the weights be a matrix?

ASSISTANT:

Great question! As neural networks get larger and deeper, the shapes of data and weights actually evolve as information passes through the layers. Sometimes data is a matrix and weights are a vector, but in neural networks, most often the weights are themselves matrices (or even higher-dimensional tensors), especially in fully connected (dense) layers.

Let’s break this down:


1. Input Data:

  • A single sample: Usually represented as a vector (e.g., an image flattened into a vector, a list of features, etc.).
  • A batch of samples: Represented as a matrix. Each row is one data sample; columns are features.

So, if you have BB samples, each with DD features:

  • Data: X\mathbf{X}, shape (B,D)(B, D)

2. Weights in Neural Networks:

In deep learning, almost all layers use weights that are matrices (or more complex tensors).

Fully Connected (Dense) Layer:

Suppose you have:

  • Input: vector x\mathbf{x} of size DD
  • Output: vector y\mathbf{y} of size HH (‘H’ can be any size; it’s the number of output neurons)

The operation is:

y=Wx+b\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}

where:

  • W\mathbf{W} is a matrix of shape (H,D)(H, D) (each row corresponds to the weights for one output feature/neuron)
  • x\mathbf{x} is a vector (D,)(D,)
  • y\mathbf{y} is a vector (H,)(H,)

For batched input:

Y=XWT+b\mathbf{Y} = \mathbf{X}\mathbf{W}^T + \mathbf{b}
  • X\mathbf{X}: matrix, (B,D)(B, D)
  • W\mathbf{W}: matrix, (H,D)(H, D) (sometimes transposed, depending on library convention)
  • Y\mathbf{Y}: matrix, (B,H)(B, H)

3. Key Point: Is Data Always a Matrix and Weights Always a Vector?

No!

  • The data can be a vector (single input) or a matrix (input batch).
  • Weights are usually a matrix in fully connected layers, and in even more dimensions (tensors) for convolutional layers, transformers, etc.

Examples:

a) First layer (single sample):

  • x\mathbf{x} is a vector, W\mathbf{W} is a matrix, y\mathbf{y} is a vector.

b) Hidden layer (single sample):

  • Input: vector of activations from previous layer.
  • Weights: matrix.
  • Output: vector.

c) Input batch:

  • X\mathbf{X} is a matrix (samples × features)
  • W\mathbf{W} is a matrix (output_features × input_features)
  • Output is also a matrix (samples × output_features)

d) Convolutional/Transformer layers:

  • Weights can be 3D or 4D tensors!

4. Why? The Generalized Principle

  • Linear layer (dense):
    • [output vector]=[weights matrix]×[input vector]+[biases][\text{output vector}] = [\text{weights matrix}] \times [\text{input vector}] + [\text{biases}]
  • With batch:
    • Every row in the input matrix is multiplied by weights to produce outputs for all samples.

So, in deep learning:

  • Data: vector (for single sample) or matrix (for batch)
  • Weights: almost always a matrix (or higher-order tensor)
  • Output/prediction: vector (single) or matrix (batch)

5. Quick Example (PyTorch):

import torch

# batch_size = 2, input_dim = 3, output_dim = 4
X = torch.tensor([[1., 2., 3.], [4., 5., 6.]])   # shape: (2, 3)
W = torch.randn(4, 3)                            # shape: (4, 3)
b = torch.randn(4)                               # shape: (4,)
Y = X @ W.T + b                                  
# Y shape: (2, 4)

Summary Table

CaseDataWeightsPredictions
Scalar regressionscalarscalarscalar
Vector regressionvectorvectorscalar
Multiple samplesmatrixvectorvector
Fully connectedmatrixmatrixmatrix
Deep/Large modelsmatrixmatrix/tensormatrix/tensor

Key Takeaway

In deep neural networks, weights are usually matrices (or higher-dimensional tensors), not just vectors, because they map between different sizes of input and output at each layer. The data can be a vector or a matrix (if using batches).

You do NOT always have data as a matrix and weights as a vector!
Most deep learning layers use weights as matrices (or tensors) to enable flexible, powerful transformations from one layer to the next.

Let me know if you’d like to see more specific examples — e.g., for CNNs or transformers!

USER:

so if i understand correctly, we can use either equation: y=Xw+by = Xw + b or y=Wx+by = Wx + b. whether XX or WW is a matrix depends on how we approach a given problem. is that correct? and when it comes to deep neural networks, it is possible to take either approach, although we normally use xx as vector and WW as a matrix?

ASSISTANT:

Exactly, you’re right! Let’s clarify both the notation and the reasoning behind these two equivalent-looking equations, and how they’re used in deep learning:


1. Both Equations Are Valid, Depending on the Context

  • y=Xw+by = Xw + b

    • Typically:
      • XX is a matrix of input data (e.g., shape [N,D][N, D], NN samples, DD features),
      • ww is a weight vector (D,)(D,),
      • bb is a scalar or vector (for bias),
      • yy is the output vector (N,)(N,) (one for each sample).
  • y=Wx+by = Wx + b

    • Typically:
      • WW is a weight matrix (output dim,input dim)(\text{output dim}, \text{input dim}),
      • xx is a single input vector (length input dim),
      • bb is a bias vector (output dim),
      • yy is a vector of outputs (output dim).

2. Which is Which?

  • When you are talking about applying your model to a batch of data (many samples at once), your input is a data matrix XX (each row is a sample), and your weights are a vector ww (for plain linear regression) or a matrix WW (for multi-output).
  • When you’re talking about one data point at a time, your input is a vector xx, and the weights are a matrix WW (for multi-output) or a vector ww (for single-output), and yy is a scalar or vector.

3. In Deep Neural Networks

  • The standard notation is:

    y=Wx+b\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}

    where:

    • x\mathbf{x} is a vector input (from previous layer/inputs),
    • W\mathbf{W} is a matrix of learnable weights (one row per output neuron; one column per input),
    • b\mathbf{b} is a bias vector,
    • y\mathbf{y} is a vector (the activations before activation function for the next layer).
  • For a batch: You often see

    Y=XWT+b\mathbf{Y} = \mathbf{X} \mathbf{W}^T + \mathbf{b}
    • Here, X\mathbf{X} is a batch of input vectors (a matrix),
    • WT\mathbf{W}^T may be used to ensure shapes align (e.g., in PyTorch, where weight matrices are (output dim,input dim)(\text{output dim}, \text{input dim})).

4. Which Do We “Usually” Use in Deep Learning?

  • Single data point: y=Wx+b\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}
  • Batch: Y=XWT+b\mathbf{Y} = \mathbf{X}\mathbf{W}^T + \mathbf{b}
  • In code, most common is: input vectors (x\mathbf{x}) fed into layers with weight matrices (W\mathbf{W}).
  • It’s not about “correct or incorrect” — it’s about what is most convenient for description and computation in your context.

5. Summary Table

InputWeightsOutputTypical EquationUsed for
Vector xxMatrix WWVector yyy=Wx+by = W x + bOne sample, deep NN
Matrix XXVector wwVector yyy=Xw+by = X w + bBatch, linear reg.
Matrix XXMatrix WWMatrix YYY=XWT+bY = X W^T + bBatch, deep NN

Key Takeaway

  • Which symbol is data and which is weights is a matter of convention.
  • In deep learning, the most standard convention is: x\mathbf{x} is an input vector and W\mathbf{W} is a weight matrix.
  • For batching, data becomes a matrix (X\mathbf{X}), and multiplication is adapted accordingly.
  • Either y=Xwy = Xw or y=Wxy = Wx can be used depending on your setup, as long as the dimensions match!

TL;DR

In deep learning, we usually use x\mathbf{x} as an input vector and W\mathbf{W} as a weight matrix at each layer (y=Wx+b\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}). For batches, the data input X\mathbf{X} becomes a matrix, and the weights remain a matrix or tensor. The shapes just need to align for proper multiplication!

Let me know if you’d like a visual or code example showing both conventions!

USER:

what does this line mean?

W, B = torch.meshgrid(w_range, b_range, indexing='ij')

ASSISTANT:

Great question! That line is essential for building the loss landscape grid in your visualization example.

Let’s break it down:


What does torch.meshgrid do?

  • Given two 1D tensors, torch.meshgrid creates two 2D grids (matrices), where each coordinate in the grid is made from every possible pair of values from the input vectors.

  • Think of it as laying two axes (ww and bb) out in 2D, and then producing a full grid of all possible (w,b)(w, b) combinations.


Breaking Down the Line

W, B = torch.meshgrid(w_range, b_range, indexing='ij')
  • w_range: 1D tensor of possible weight values.
  • b_range: 1D tensor of possible bias values.
  • torch.meshgrid(w_range, b_range, indexing='ij'):
    • Produces two 2D tensors, WW and BB, both with shape (len(w_range), len(b_range)).
    • W[i, j] is set to w_range[i] (the ww value for row ii).
    • B[i, j] is set to b_range[j] (the bb value for column jj).
  • indexing='ij':
    • Specifies “matrix indexing” (like NumPy’s default), which is more natural for mathematical operations and plotting (row index, then column index).

Example

Suppose:

w_range = torch.tensor([1, 2])
b_range = torch.tensor([10, 20, 30])
W, B = torch.meshgrid(w_range, b_range, indexing='ij')

Then:

  • W is:
    [[1, 1, 1],
     [2, 2, 2]]
    
  • B is:
    [[10, 20, 30],
     [10, 20, 30]]
    

Each grid cell represents one particular pair (w,b)(w, b).


Why is this useful?

  • You can conveniently compute the loss for every combination of ww and bb across your specified ranges. Later, you use WW and BB (along with your data) to calculate a loss surface for plotting or analysis.

TL;DR

torch.meshgrid(w_range, b_range, indexing='ij') creates 2D grids of all (w,b)(w, b) pairs, so you can evaluate and plot the loss for any combination of weight and bias. The indexing='ij' ensures the “row = ww” and “column = bb” layout matches math/matrix convention.

Let me know if you’d like a code demo or visualization!

USER:

this code doesn’t work:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(W.numpy(), B.numpy(), loss_surface.numpy(), cmap='viridis', alpha=0.8)
ax.set_xlabel('Weight (w)')
ax.set_ylabel('Bias (b)')
ax.set_zlabel('MSE Loss')
ax.set_title('Loss Surface of Linear Regression')
plt.show()

the function plot_surface is not defined. what should i do?

ASSISTANT:

No worries! The issue is most likely due to the fact that plot_surface is only available on 3D axes in Matplotlib (more exactly, on ax created by specifying projection='3d') and requires importing the 3D toolkit. However, since you already created the axis correctly, the real problem might arise because you haven’t imported the mpl_toolkits.mplot3d module, which enables 3D plotting in Matplotlib.

To fix this:


1. Make Sure to Import the 3D Toolkit

At the top of your Python script, add this import:

from mpl_toolkits.mplot3d import Axes3D  # This line can be required in some setups for 3D plotting

Even though you don’t seem to use Axes3D directly, importing it ensures Matplotlib properly initializes 3D plotting.


2. Ensure Your Data is All Numpy Arrays

Your variables WW, BB, and loss_surface should be convertible to NumPy arrays, which you’re already doing with .numpy() calls.


3. Complete Example

Here’s a minimal version for 3D plotting:

import torch
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # <-- Add this line

# ...your code to define w_range, b_range, W, B, loss_surface...

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(W.numpy(), B.numpy(), loss_surface.numpy(), cmap='viridis', alpha=0.8)
ax.set_xlabel('Weight (w)')
ax.set_ylabel('Bias (b)')
ax.set_zlabel('MSE Loss')
ax.set_title('Loss Surface of Linear Regression')
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()

4. Additional Troubleshooting

  • If you get an error like AttributeError: 'AxesSubplot' object has no attribute 'plot_surface', check that you created the axis with projection='3d' (which you did) and that you imported Axes3D.
  • Make sure your Matplotlib version is 3.1 or higher (check with pip show matplotlib or uv pip list).

Summary

  • Import Axes3D from mpl_toolkits.mplot3d to properly set up 3D plots.
  • Make sure you use a 3D axis (projection='3d').
  • Call .plot_surface(...) on that axis.

Let me know if you still have an error message, and paste it here—I’ll help debug!

USER:



Next Blog Posts

Earlier Blog Posts


Back to Blog

Copyright © 2025 Identellica LLC
Home · Blog · Source Code