Learn the Training Loop with PyTorch, Part 2.2: Forward and Backward Passes

2025-06-16 · Artintellica

USER:

i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.

first, let me give you the outline of the series:


Module 1: The Elementary Training Loop

Goal: Build foundational intuition by hand-coding a simple training loop from scratch.

1. Introduction: What is a Training Loop?

  • Definition and importance
  • Loss, optimization, and learning in ML

2. The Simplest Case: Linear Regression

  • What is linear regression?
  • The mean squared error loss
  • Deriving the update rules manually

3. Batch vs. Stochastic Gradient Descent

  • Differences and practical implications
  • Hands-on: See both in Python/numpy

4. Visualizing the Loss Landscape

  • Plotting loss as a function of parameters
  • Demo: Watching convergence by hand

5. Numerical vs. Analytical Gradients

  • How are gradients computed?
  • Manual calculation and confirmation with numerical gradient

6. Recap and Key Takeaways


Module 2: The Training Loop for Neural Networks

Goal: Extend the basic loop to more powerful models; introduce neural network concepts.

1. From Linear to Nonlinear: Why Neural Networks?

  • Multi-layer perceptrons (MLPs)
  • Activation functions (ReLU, sigmoid, tanh)

2. Forward and Backward Passes

  • The chain rule of differentiation
  • Backpropagation intuition (no need for full rigor yet)

3. Implementing a Simple Neural Net from Scratch

  • Single hidden layer network in numpy/Python
  • Coding forward, loss, backward, and parameter updates

4. The Role of Activations

  • Intuitive explanation and code examples
  • Effect on learning and expressiveness

5. Mini-batching and Data Pipelines

  • How batching speeds up and stabilizes learning

6. Regularization and Overfitting

  • Techniques: L2, dropout (conceptual intro)

7. Recap: Comparing Our Simple Network with Linear Regression


Module 3: Advanced Training Loops and Modern Tricks

Goal: Connect foundational understanding to the large-scale models used in industry.

1. Optimization Algorithms Beyond SGD

  • Adam, RMSProp, etc.: How they work, why they’re useful

2. Learning Rate Scheduling

  • How/why learning rates are changed during training
  • Demo: Effect of learning rate on convergence and results

3. Weight Initialization

  • Why it matters, common strategies

4. Deeper Networks and Backprop Challenges

  • Vanishing/exploding gradients
  • Solutions: normalization, skip connections

5. Large-Scale Training: Data Parallelism and Hardware

  • How the same training loop runs efficiently on GPUs and across many machines

6. Monitoring and Debugging the Training Loop

  • Visualizing loss/accuracy, diagnosing common problems

7. Modern Regularization and Generalization Techniques

  • Early stopping, batch/layer normalization, data augmentation (conceptual intro)

8. The Training Loop in Practice: Case Studies

  • High-level overview of how OpenAI/Google/Meta might train LLMs: what’s the same, what’s different.
  • What “scaling up” really means: dataset, parameters, hardware, tricks.

9. Conclusion: What’s Next After the Training Loop?

  • Discussion: fine-tuning, transfer learning, and unsupervised/self-supervised methods.

General Tips for Each Module:

  • Start simple, build incrementally: Keep code simple at first, then show how to layer complexity.
  • Accompany every lesson with practical code examples.
  • Focus on intuition: Use diagrams, plots, and hands-on demo to make concepts tangible.
  • Summarize and review regularly: Add checklists or quizzes at the end of each module.

great. now, for each blog post, i am hoping to follow the following structure:

  • the course is called “Learn the Training Loop with PyTorch”
  • always provide a brief introduction to the topic at hand
  • then always provide a very simple “ELI5” explanation of the topic, that theortically anyone could understand
  • then provide the mathematical foundations. for math, always use dollar delimited LaTeX, and always provide a brief explanation of the math. for instance, y=Xw+b\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b} is the equation for linear regression, where y\mathbf{y} is the output vector, X\mathbf{X} is the input matrix, w\mathbf{w} is the weight vector, and b\mathbf{b} is the bias vector.
  • for block-level math, always use two dollar signs, like this: y=Xw+b\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b}
  • after the mathematical foundations, such as definitions, then always provide an in-depth explanation, covering the topic in detail, and expanding on the explanation of the math if necessary.
  • for other formatting, use markdown like headings, bullet points, and code blocks to make the content readable. do not use HTML tags.
  • after the math is complete, then provide one or more demonstrations of the topic using PyTorch. if possible, always use PyTorch, not numpy, just to keep it simple and consistent. you can also use matplotlib to visualize the results.
  • for all python code, always use type hints, and, if necessary, describe installation instructions using “uv” (not pip).
  • then, create a series of exercises, somewhere from 1 - 4 total exercises that cover the material. please provide the soruce code for each exercise to make sure the student can easily compare their solution to the correct answer. do not hide the exercises and do not put HTML in the exercises.
  • finally, provide a summary of the topic, and a list of key takeaways.

that’s it, now can you please write blog post 2.2?

the description is:

2. Forward and Backward Passes

  • The chain rule of differentiation
  • Backpropagation intuition (no need for full rigor yet)

please respond with the full blog post only, and no other information.

ASSISTANT:

Learn the Training Loop with PyTorch

Module 2.2: Forward and Backward Passes

Introduction

Welcome back to Learn the Training Loop with PyTorch!

In this post, we’ll explore two essential concepts at the heart of neural network training: the forward pass and the backward pass. These steps are the backbone of how neural networks learn from data. We’ll also introduce the chain rule—a simple but crucial mathematical tool that powers backpropagation, the algorithm that helps neural networks adjust their weights and get smarter with each step.


ELI5: Forward and Backward Passes

  • Forward Pass: Imagine a factory assembly line—data comes in, and at each station (layer), it’s processed, until we get an output (like a finished product). In neural networks, this means feeding an input through the network to produce a prediction.

  • Backward Pass: Now, after seeing how good or bad the prediction is (using a loss function), we want to improve the factory. The backward pass is like tracking back through each station and figuring out how much each needs to be tweaked to reduce errors next time. This is done by calculating gradients—essentially, how much each parameter (weight) affects the error.


Mathematical Foundations

The Forward Pass

For a simple neural network (say, one hidden layer), each layer performs:

z(1)=XW(1)+b(1)\mathbf{z}^{(1)} = \mathbf{X}\mathbf{W}^{(1)} + \mathbf{b}^{(1)} a(1)=σ(z(1))\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) z(2)=a(1)W(2)+b(2)\mathbf{z}^{(2)} = \mathbf{a}^{(1)}\mathbf{W}^{(2)} + \mathbf{b}^{(2)} y^=z(2)\mathbf{\hat{y}} = \mathbf{z}^{(2)}
  • X\mathbf{X}: input data
  • W(1), W(2)\mathbf{W}^{(1)},\ \mathbf{W}^{(2)}: weights for layers 1 and 2
  • b(1), b(2)\mathbf{b}^{(1)},\ \mathbf{b}^{(2)}: bias vectors
  • σ\sigma: activation function (like ReLU or sigmoid)
  • y^\mathbf{\hat{y}}: predicted output

The Chain Rule

Suppose you have functions “nested” inside each other:

y=f(g(x))y = f(g(x))

The chain rule allows us to compute the derivative of yy with respect to xx:

dydx=f(g(x))g(x)\frac{dy}{dx} = f'(g(x)) \cdot g'(x)

This rule allows us to work backwards through each layer to compute how the loss changes with each model parameter.

The Backward Pass (Backpropagation)

The backward pass computes gradients of the loss function with respect to each weight and bias (using the chain rule), so parameters can be updated (typically by gradient descent). For a loss function L\mathcal{L}:

LW=?\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = ?

For each layer, gradients are backpropagated from the output towards the input, using partial derivatives.

Why Does This Matter?

To improve a model, we need to know how changing each parameter will affect its predictions. The chain rule gives us an efficient way to systematically calculate all these effects, no matter how deep the network.


In-Depth Explanation

The Forward Pass: Data Flows In

  • Input vectors are fed into the network.
  • Every layer processes its inputs: multiplies by weights, adds biases, passes through activation functions.
  • Final output (the prediction) is produced.

Example: Suppose we have an image, and a neural net predicts whether it’s a cat or a dog. The forward pass is the act of generating that prediction from pixel values.

The Backward Pass: Gradients Flow Out

  • After we compute the loss (a measure of how “bad” our prediction was), we want to adjust our weights to do better next time.
  • The backward pass calculates how much each weight and bias contributed to the error.
  • This process repeatedly applies the chain rule:
    • For each layer (starting from the output layer), we compute the gradient of the loss with respect to its parameters.
    • We use these gradients to update model parameters using an optimizer (like SGD or Adam).

A Simple Example: Two-Layer Network

Imagine you have the following sequence:

InputW(1), b(1)HiddenReLUW(2), b(2)Output\text{Input} \xrightarrow{\mathbf{W}^{(1)},\ \mathbf{b}^{(1)}} \text{Hidden} \xrightarrow{\text{ReLU}} \xrightarrow{\mathbf{W}^{(2)},\ \mathbf{b}^{(2)}} \text{Output}

During forward, all we do is plug the input in and calculate the output.

During backward, let’s say our output was too high. The backward pass figures out:

  • Was it because W(2)\mathbf{W}^{(2)} was too large? Or maybe W(1)\mathbf{W}^{(1)}?
  • It breaks down the error gradient for each parameter.
  • Optimizer nudges weights and biases in the right direction to reduce the loss next time.

Backpropagation is just the systematic application of the chain rule so we can efficiently update all parameters, no matter how many layers we have.


PyTorch Demonstration

Let’s see this in practice with PyTorch. We’ll build a tiny neural network, make a dummy prediction, and watch the gradients flow!

Installation (if needed)

uv pip install torch matplotlib

Example Code: Forward and Backward in PyTorch

import torch
from torch import nn
from typing import Tuple

# Set manual seed for reproducible results
torch.manual_seed(0)

# Define a simple feedforward neural network with 1 hidden layer
class SimpleNN(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.activation = nn.ReLU()
        self.layer2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.layer1(x)
        x = self.activation(x)
        x = self.layer2(x)
        return x

# Dummy data: 2 samples, 3 features each
X = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
y_true = torch.tensor([[10.0], [20.0]])

# Instantiate the network
model = SimpleNN(input_dim=3, hidden_dim=4, output_dim=1)

# Forward Pass
y_pred = model(X)
print('Predictions:', y_pred.detach().numpy())

# Compute loss (Mean Squared Error)
criterion = nn.MSELoss()
loss = criterion(y_pred, y_true)
print('Loss:', loss.item())

# Backward Pass
model.zero_grad()     # Clears old gradients
loss.backward()       # Computes gradients

# Inspect gradients
print('Gradient of layer1 weights:')
print(model.layer1.weight.grad)
print('Gradient of layer2 weights:')
print(model.layer2.weight.grad)

What’s happening above?

  • First, we create a mini neural net with one hidden layer (layer1).
  • We run two simple data points through the model (forward pass).
  • We compute the loss.
  • .backward() backpropagates the error, populating .grad attributes of the parameters via the chain rule!
  • We print the gradients—these tell us how to update our weights during training.

Exercises

Exercise 1: Manually Step through Forward and Backward Pass

Write your own code to:

  1. Create an input tensor of your choice (shape (1, 3)).
  2. Pass it through the model.
  3. Compute the loss with respect to your chosen target.
  4. Perform backward pass.
  5. Print the gradients.

Solution:

# 1. Input and target
X = torch.tensor([[7.0, 8.0, 9.0]])
y_true = torch.tensor([[30.0]])

# 2. Forward pass
y_pred = model(X)

# 3. Compute loss
loss = criterion(y_pred, y_true)

# 4. Backward pass
model.zero_grad()
loss.backward()

# 5. Print gradients
print('Grad for layer1:', model.layer1.weight.grad)
print('Grad for layer2:', model.layer2.weight.grad)

Exercise 2: Visualize How Loss Changes with Manual Weight Tweaks

Pick a single parameter in layer 2, slightly adjust its value, run a forward pass, and plot how the loss changes.

Solution:

import matplotlib.pyplot as plt

# Grab original weight and clone for restoration
original_weight = model.layer2.weight.data.clone()
weights = []
losses = []

for delta in torch.linspace(-2, 2, steps=50):
    # change just the first entry in layer2 weights
    model.layer2.weight.data[0,0] = original_weight[0,0] + delta
    y_pred = model(X)
    loss = criterion(y_pred, y_true)
    weights.append(model.layer2.weight.data[0,0].item())
    losses.append(loss.item())

# Restore original weight
model.layer2.weight.data[0,0] = original_weight[0,0]

plt.plot(weights, losses)
plt.xlabel('layer2.weight[0,0] value')
plt.ylabel('Loss')
plt.title('Effect of changing a single weight on loss')
plt.show()

Summary

  • The forward pass sends input data through the network to get predictions.
  • The backward pass starts from the loss and measures how much each network parameter contributed to the error, by calculating gradients.
  • The chain rule of calculus lets us efficiently compute these gradients even for deep, complex networks—a process called backpropagation.
  • PyTorch automates these steps using its computational graph and the .backward() method.

Key Takeaways

  • Forward pass: data flows forward to make a prediction.
  • Backward pass: errors flow backward; gradients measure sensitivity of loss to each parameter.
  • Chain rule: the fuel behind backpropagation; breaks complex derivatives into manageable pieces.
  • PyTorch’s autograd: automatically handles all math for derivatives and backward passes!
  • Understanding these steps gives you deep insight into how and why neural networks learn.

Next time, we’ll build on this knowledge and hand-code a simple neural network from scratch!

USER:

i have a question about weights and biases. how are the initial weights and biases determined? are they random?

ASSISTANT:

Great question!

Initialization of Weights and Biases in Neural Networks

TL;DR:
Weights in neural networks are usually initialized randomly (but in a controlled way), and biases are often initialized to zeros or small constants. This randomness is crucial for training, and the specific method can impact learning speed and model performance.


ELI5 Explanation

  • Think of initial weights like the starting positions in a “treasure hunt”:
    If everyone started at the same spot (say, all weights = 0), they’d all take the same path and never learn uniquely useful information.
  • By starting from different (random) positions, the neurons in the network learn to detect different patterns in the data.

Common Methods

Weights

  • Random Initialization:
    Each weight is set to a small random value (not always just random; see below).

  • Why not all zeros?
    If all weights are zero, every neuron in a layer behaves identically. This is called the symmetry problem—no individual neuron can learn something unique.

  • Popular Schemes:

    • Uniform or normal distribution:
      wU(a,a)orN(0,σ2)w \sim \mathcal{U}(-a, a) \quad \text{or} \quad \mathcal{N}(0, \sigma^2)
    • Xavier/Glorot initialization (for tanh, sigmoid):
      wU(6nin+nout,6nin+nout)w \sim \mathcal{U}\left(-\frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}, \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}\right)
    • Kaiming/He initialization (for ReLU):
      wN(0,2nin)w \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)
      • ninn_{in} = number of input connections

Biases

  • Usually Initialized to Zero:
    Biases don’t have the symmetry issue that weights have, so it’s safe (and common) to initialize all biases to 00 or to a small constant.

PyTorch Example

Here’s how PyTorch handles this under the hood:

layer = torch.nn.Linear(in_features=10, out_features=5)
print("Initial weights:", layer.weight)
print("Initial biases:", layer.bias)
  • torch.nn.Linear layers use Kaiming Uniform by default for weights.
  • Biases are initialized to zeros.

Key Takeaways

  • Weights: Randomized (with statistical care!) to break symmetry and ensure diversity in learning.
  • Biases: Often zeros or small constants; no need for randomness.
  • Initialization strategy matters: Good initial values make training faster and more stable, while poor initialization can slow or even prevent learning.

If you’re curious about customizing initialization in PyTorch, just ask!

USER:

on each backwards pass, are all weights and biases updated?

ASSISTANT:

Great question!
In nearly all standard neural network training (including with PyTorch), yes:

On each backward pass, all trainable weights and biases in the network receive gradient updates.


Step-by-Step Breakdown

  1. Forward Pass:

    • The network computes the output and the loss for the current batch of data.
  2. Backward Pass:

    • The network computes gradients of the loss with respect to every trainable parameter (weights and biases) using backpropagation.
    • Mathematically, for each parameter θ\theta, Lθ\frac{\partial \mathcal{L}}{\partial \theta} gets calculated, where L\mathcal{L} is the loss.
  3. Parameter Update:

    • The optimizer (e.g., SGD, Adam) uses these gradients to update every parameter (weight and bias) that requires gradients.
    • Example update rule (for gradient descent): θθηLθ\theta \leftarrow \theta - \eta \cdot \frac{\partial \mathcal{L}}{\partial \theta} where η\eta is the learning rate.

Are There Any Exceptions?

  • If you freeze parameters:
    Parameters with requires_grad=False (e.g., in transfer learning) won’t receive gradients and won’t be updated.
  • Some layers (like BatchNorm/Dropout):
    Have parameters that may or may not be updated depending on how they’re set up.
  • Optimizers can target only some parameters:
    But in a typical freshly built network, all weights and biases are included.

PyTorch Example

for param in model.parameters():
    print(param.requires_grad)  # Default is True (so all will be updated)
  • All True? All will be updated.

Key Takeaway

By default, every weight and bias in the network is updated on every backward pass.

If you want to “freeze” some layers or parameters (not update them), you set .requires_grad = False for those parameters.

Let me know if you want an example on how to freeze specific weights or layers!

USER:

i’m still confused about stochastic gradient descent vs. mini-batch gradient descent. in both cases, are all weights and biases updated on each backward pass?

ASSISTANT:

Fantastic question! This is something a lot of learners wonder about.

Stochastic Gradient Descent vs. Mini-Batch Gradient Descent

Definitions

  • Stochastic Gradient Descent (SGD):
    Each update uses one training sample.

  • Mini-batch Gradient Descent:
    Each update uses a small batch of samples (e.g., 32 or 64 at a time).

What These Algorithms Control

The main difference is how much data they use to compute the loss and gradients for each parameter update.

  • SGD: computes gradients based on just ONE data point.
  • Mini-batch: computes gradients as the average over several data points.

Do All Weights and Biases Get Updated in Each Case?

Yes! In both cases:

On each backward pass, all weights and biases in the model (that require gradients) are updated, regardless of how many samples are in the batch.

The only difference:

  • The batch size determines how the loss and gradients are computed,
  • But the result is still a gradient for every parameter, and every parameter gets updated on that step.

A Visual Analogy

  • Imagine every update as tugging every knob (parameter) a little bit—sometimes based on noisy, single-sample feedback (SGD), sometimes on a more stable average opinion (mini-batch).
  • But you’re tugging every knob every time!

How It Looks in PyTorch

# Data loader (mini-batch of size 32)
for X_batch, y_batch in dataloader:
    optimizer.zero_grad()
    y_pred = model(X_batch)
    loss = criterion(y_pred, y_batch)
    loss.backward()
    optimizer.step()
    # -> All parameters get updated here

# For true SGD, you set batch_size=1 in the DataLoader.

Summary Table

MethodSamples per updateGradients/StepParameters updated per step
SGD1per-sampleAll
Mini-batch GDN (e.g., 32)averageAll

Key Point

No matter the batch size, every weight and bias gets updated each backward pass—just with gradients computed from different amounts of data.

If you’re curious about the real effect of batch size, it’s mostly about how “noisy” those gradients are, and the speed/efficiency of training—not about what gets updated!


Still have questions about batching, data loaders, or parameter updates? I’m happy to keep exploring!

USER:

so, in mini-batch and SGD, we are not using the full matrix X of data, correct? but we are nonetheless updating all weights and biases in the model, right? and in full batch, we are using the full matrix X, and we are also updating all weights and biases in the model, right?

ASSISTANT:

Absolutely correct! Here’s a concise breakdown:


1. Stochastic Gradient Descent (SGD)

  • Data used per update: One sample (a single row/vector from X\mathbf{X}).
  • Weights/biases updated: All weights and biases are updated each step.
  • X\mathbf{X} used: Only one row at a time, not the full matrix.

2. Mini-batch Gradient Descent

  • Data used per update: A small batch of samples from X\mathbf{X} (e.g., 32, 64 rows at a time).
  • Weights/biases updated: All weights and biases are updated each step.
  • X\mathbf{X} used: Only a slice (a batch) at a time, not the full matrix.

3. Full-Batch Gradient Descent

  • Data used per update: The full dataset X\mathbf{X} (all rows).
  • Weights/biases updated: All weights and biases are updated each step.
  • X\mathbf{X} used: The entire matrix is used for one giant step.

Table Summary

MethodData used per updateAll weights/biases updated per step?
SGDOne sampleYES
Mini-batch GDSmall batch (e.g., 32)YES
Full-batch GDAll data (XX)YES

Key Point

No matter the batch size (1, mini-batch, or full batch):

  • You always update all weights and all biases after each backward pass.
  • The difference is just how much of your data you use to compute the gradients for that step.

Let me know if you want a visual demo or PyTorch code illustrating these differences!

USER:

how many weights and biases are there for large models, like modern LLMs? presumably the data, X, is truly enormous, such as all internet text data. but also, if all weights and biases are updated on each backward pass, this presumably fits into a single computer, or maybe a single cluster of computers. how big are the weights and biases in these models?

ASSISTANT:

Outstanding series of questions! Let’s break it down step by step:


1. How Many Weights and Biases in Large LLMs?

Modern large language models (LLMs) like GPT-3, GPT-4, or Google’s PaLM have billions to trillions of parameters.
Parameters = weights + biases (and sometimes other learnable values, like in layer norm).

  • GPT-3 (175B):
    ~175 billion parameters
  • GPT-4:
    Number not fully public, but possibly trillions according to some estimates
  • PaLM (540B):
    ~540 billion parameters

A small BERT base model for reference:

  • BERT base: ~110 million parameters

2. How Large is That in Memory?

Let’s estimate for GPT-3:

  • 175,000,000,000 parameters
  • Typically stored as float32 or float16:
    • float32: 4 bytes each
    • float16: 2 bytes each

In float16 (common in large models):

175 billion×2 bytes=350 GB175 \text{ billion} \times 2 \text{ bytes} = 350 \text{ GB}

In float32:

175 billion×4 bytes=700 GB175 \text{ billion} \times 4 \text{ bytes} = 700 \text{ GB}

That’s just for weights—other memory is needed for activations, gradients, optimizer states, etc.


3. How Big is XX (the Data)?

  • The training dataset for models like GPT-3 is on the order of hundreds of billions of words—essentially a filtered snapshot of the publicly available internet, books, Wikipedia, coding repositories, etc.
  • The entire raw dataset could be terabytes to tens of terabytes in size, but you don’t feed all of XX into the model at once. You use mini-batches.

4. How Do They Fit the Model and Training on Hardware?

  • All model weights and biases must be accessible for updating on each backward pass.

  • For LLMs, the set of weights doesn’t fit on a single GPU (which typically has 80GB or less of memory).

  • Solution: Model parallelism and distributed training:

    • Parameters are split across multiple GPUs/TPUs/servers—each only stores and updates a “slice” of the entire set.
    • Communication black-magic: Specialized software ensures gradients are shared and weights are updated together.
  • Data parallelism:

    • Multiple batches are trained simultaneously on different slices of data, then their gradients are averaged.
  • Training LLMs often require thousands of high-end GPUs working together for weeks or months.


5. Key Points

  • Weights/biases in LLMs number in the hundreds of billions or more.
  • Raw dataset XX is huge, but only small batches (mini-batches) are used per training step.
  • All weights/biases are updated every step (for every mini-batch), but they are often split up (sharded) across many devices for parallel training.
  • You cannot fit all weights and all of XX into a single GPU or standard computer in these scenarios.
    Instead, large clusters act as a single “virtual super-computer.”

Quick Table

Model# ParametersFloat16 Memory
BERT Base110 million220 MB
GPT-21.5 billion3 GB
GPT-3175 billion350 GB
PaLM540 billion1.08 TB

References


Summary

  • LLMs have hundreds of billions or more weights/biases.
  • All are updated each step, but are distributed across huge clusters.
  • The training data XX is enormous, but only a tiny slice is used for each backward pass.

Want to know more about distributed/model parallelism or how this is implemented in PyTorch? Just ask!

USER:

ok, so although all weights and biases might not fit on a single computer at a time, they are still held in memory on a computing cluster of some kind, because on every backwards pass, every weight and bias is updated, correct? presumably, every update, they are saved to disk, so that they are saved in case the computers crash.

ASSISTANT:

You’ve got it exactly right!
Let’s clarify the process for large models like LLMs:


Where Are Weights and Biases Stored?

  • During training:
    • All weights and biases must be present in memory (RAM or GPU memory) somewhere in the computing cluster so that gradients can be calculated and updates performed.
    • In the case of LLMs, these parameters are partitioned (“sharded”) across many GPUs and/or nodes, so that each machine is responsible for a slice of the total parameters.
    • Each backward pass: The necessary synchronization/communication ensures that every parameter gets its gradient computed and updated, even though no single node may have all parameters at once.

What about persistent storage (saving to disk)?

  • Checkpoints:

    • During training, models are routinely saved to disk (“checkpointing”). This often happens every few minutes, hours, or after a certain number of steps/epochs.
    • Checkpoints include:
      • All weights and biases (and sometimes optimizer state, learning rate schedule, etc).
      • This way, if training halts (crash, power outage, maintenance), it can resume from the last checkpoint.
  • Why not update disk every step?

    • Updating to disk every backward pass would be way too slow—writing hundreds of GB or even TBs to disk at each step would grind training to a halt.
    • So, weight/bias updates are performed “in memory” (across the cluster) each step, and periodically written to disk.

What happens if something crashes?

  • If a crash occurs, training resumes from the most recent checkpoint.
    Some progress may be lost, but not everything.

Key Points

  • All weights/biases are present and updated in memory during each backward pass, even if they’re split across hundreds or thousands of machines.
  • They’re not saved to disk every update, but are checkpointed regularly to protect against catastrophic failure.
  • Each update requires all parameters to be available somewhere in memory (spread across the computing cluster), so gradients and updates can happen for all parameters.

Summary Statement

On every backward pass, all weights and biases are in memory across the computing cluster and are updated in place.
Checkpointing to disk is done at intervals (not every step), so training can resume if there’s a failure.


Let me know if you’d like to see a diagram or code-related illustrations of this process!

USER:

when i do the final exercise, changing one weight has no discernable effect on loss. is this intentional?

ASSISTANT:

Fantastic observation!
This can indeed be surprising as a beginner. Here’s what’s happening and why you’re seeing this effect:


Why Does Tweaking a Single Weight Barely Affect Loss?

1. Network Scale:

Your exercise is on a small, randomly initialized neural network, likely with several weights in each layer (e.g., hidden_dim=4 or more), and with arbitrary input and target values.

2. Magnitude of Change:

If you only tweak a single weight by a very small amount (say, ±0.01), the effect on the final output and on the loss can be almost imperceptible—especially if:

  • The weight isn’t in a dominant position (e.g., not the only nonzero weight).
  • The activations passing through that weight are small or zero (as can happen with ReLU).

3. Layers and Nonlinearities:

In a multilayer network, the effect of changing one weight can become “diluted” by subsequent computations and activation functions.

4. Random Initializations:

Depending on current weight values, some neurons may not contribute much (e.g., “dead” ReLU units with zero activation for a given input). Adjusting weights going into such units might not affect loss much for the current example.

5. Batch Size:

If you run the exercise with a single data point (as in the provided code), and the network is untrained, outputs may be far from the target and highly insensitive to a tiny change.


How To See an Effect?

  • Try changing the selected weight by a larger amount (e.g., ±0.5, ±1.0, or more).
  • Try multiple forward passes with very different input data or target values.
  • Increase plotting resolution: Plot the loss for a much wider range of weight adjustments (as in the demonstration code, which varies a weight over a range like -2 to +2).
  • Print the output along with the loss: Sometimes, output changes but the loss curve is very flat far from the target.

Is This Intentional?

Yes, in a sense—this is real neural network behavior!

  • Not all weights have a strong effect on the loss for a given batch/input and network state.
  • That’s why neural networks use all gradients together, and rely on multiple steps to gradually move all parameters towards better values.

Takeaway

Individual parameters often have a subtle or imperceptible effect on the total loss—especially early in training, with random inputs and initializations.
This is why neural networks need many updates and many data points to learn effectively.


Want to see a more sensitive example or how gradient size relates to parameter effect? I can show you more code or plots if you like!

USER:

how do i change this code to do, say, 10 forward passes?

for delta in torch.linspace(-200, 200, steps=50):
    # change just the first entry in layer2 weights
    model.layer2.weight.data[0,0] = original_weight[0,0] + delta
    y_pred = model(X)
    loss = criterion(y_pred, y_true)
    weights.append(model.layer2.weight.data[0,0].item())
    losses.append(loss.item())

ASSISTANT:

Great question!
To perform 10 forward passes (e.g., on 10 different inputs/targets) while sweeping a weight, you have a few options. Here’s the most direct way:

Option 1: Loop Over Deltas and Over 10 Different Inputs

First, create a batch of 10 input vectors and 10 corresponding targets:

# Example: 10 random input rows (shape: [10, input_dim])
num_samples = 10
input_dim = 3

X_10 = torch.randn(num_samples, input_dim)
y_true_10 = torch.randn(num_samples, 1) * 10  # arbitrary targets

Now, for each delta, apply it to the weight, do a forward pass on all 10 samples at once, and compute the average loss (or sum, as you prefer):

weights = []
losses = []

for delta in torch.linspace(-200, 200, steps=50):
    model.layer2.weight.data[0,0] = original_weight[0,0] + delta
    y_pred = model(X_10)                 # Forward on all 10 inputs
    loss = criterion(y_pred, y_true_10)  # MSELoss averages by default
    weights.append(model.layer2.weight.data[0,0].item())
    losses.append(loss.item())

This will show you how changing the weight affects loss across a mini-batch of 10 samples.


Option 2: Loop Over Each Delta and Each Input (Collect All Results)

If you want to see the effect of weight changes on each input individually (i.e., plot 10 loss curves for each input):

all_losses = []

for i in range(num_samples):
    # Keep each sample's losses separately
    sample_losses = []
    for delta in torch.linspace(-200, 200, steps=50):
        model.layer2.weight.data[0,0] = original_weight[0,0] + delta
        y_pred = model(X_10[i:i+1])  # Forward on one sample
        loss = criterion(y_pred, y_true_10[i:i+1])
        sample_losses.append(loss.item())
    all_losses.append(sample_losses)

# Now, all_losses[i][j] is the loss for sample i at delta j

Example Plot for Option 1:

import matplotlib.pyplot as plt

plt.plot(weights, losses)
plt.xlabel('layer2.weight[0,0] value')
plt.ylabel('Loss (avg over 10 samples)')
plt.title('Effect of changing a single weight on loss (10 forward passes)')
plt.show()

Summary

  • To do “10 forward passes” at each weight, either use a batch of 10 inputs/targets (as shown above), or loop individually over 10 samples if you want details per sample.
  • You’ll get a much smoother, less noisy loss curve by using a batch.

Let me know if you’d like a notebook-ready, complete example!

USER:

that didn’t work. let’s go back to the original code. this code plots a perfectly straight line, AFAICT, indicating that changing the weight has no effect on loss. first, do you see anything wrong with this code? and second, what is the minimal way to change it to see an effect on loss?

import torch
from torch import nn
from typing import Tuple

# Set manual seed for reproducible results
torch.manual_seed(0)

# Define a simple feedforward neural network with 1 hidden layer
class SimpleNN(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.activation = nn.ReLU()
        self.layer2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.layer1(x)
        x = self.activation(x)
        x = self.layer2(x)
        return x

# Dummy data: 2 samples, 3 features each
X = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
y_true = torch.tensor([[10.0], [20.0]])

# Instantiate the network
model = SimpleNN(input_dim=3, hidden_dim=4, output_dim=1)

criterion = nn.MSELoss()

# 1. Input and target
X = torch.tensor([[7.0, 8.0, 9.0]])
y_true = torch.tensor([[30.0]])

# 2. Forward pass
y_pred = model(X)

# 3. Compute loss
loss = criterion(y_pred, y_true)

# 4. Backward pass
model.zero_grad()
loss.backward()

# 5. Print gradients
print('Grad for layer1:', model.layer1.weight.grad)
print('Grad for layer2:', model.layer2.weight.grad)

import matplotlib.pyplot as plt

# Grab original weight and clone for restoration
original_weight = model.layer2.weight.data.clone()
weights = []
losses = []

for delta in torch.linspace(-2, 2, steps=50):
    # change just the first entry in layer2 weights
    model.layer2.weight.data[0,0] = original_weight[0,0] + delta
    y_pred = model(X)
    loss = criterion(y_pred, y_true)
    weights.append(model.layer2.weight.data[0,0].item())
    losses.append(loss.item())

# Restore original weight
model.layer2.weight.data[0,0] = original_weight[0,0]

plt.plot(weights, losses)
plt.xlabel('layer2.weight[0,0] value')
plt.ylabel('Loss')
plt.title('Effect of changing a single weight on loss')
plt.show()

ASSISTANT:

Great debugging! You’re right—the plot appears nearly flat or even exactly flat.

Let’s break down what’s actually happening in your code, why you see this surprising behavior, and the minimal fix.


1. Diagnosis: What’s Actually Happening and Why a Flat Line?

You have:

  • A 2-layer network (ReLU after layer 1).
  • Single input sample: X = [[7.0, 8.0, 9.0]]
  • Only layer2.weight[0,0] is being modified.

But here’s the key:
layer2.weight[0,0] affects only the output of unit 0 in the hidden layer.
But depending on your network’s random initialization and the specific input, that unit’s activation might be zero after the ReLU—meaning changing this weight has no effect at all on the output!

This is called the “dead ReLU” problem. If the pre-activation to that neuron is negative, its output is zero, and the weight leading from it is multiplied by zero in the output, so it doesn’t matter what you set it to.

You can check this with:

with torch.no_grad():
    h = model.layer1(X)
    print("layer1 output:", h)
    print("ReLU output:", torch.relu(h))

You’ll probably see that the first value in the hidden layer after ReLU is zero.


2. Minimal Fix: Make Sure the First Hidden Unit Is Active

To see an effect, you want to change:

  • The input so that the first hidden neuron’s pre-activation value is positive, OR
  • The initialization to make it likely to be positive, OR
  • Change which hidden unit’s output you connect to output with your sweep, i.e., change [0,0] to [0,k] where k is an active hidden unit.

Try this minimal fix:
After building the model, force its first-layer bias to a large positive value, so ReLU will activate the first hidden neuron:

with torch.no_grad():
    model.layer1.bias[0] = 100.0  # make first neuron always fire positive

# Now rerun the sweep

Or, you can check which hidden units are positive after ReLU and pick accordingly, like:

with torch.no_grad():
    h = model.layer1(X)
    print("pre-activations:", h)
    print("post-ReLU:", torch.relu(h))

3. Minimal Working Example:

Here’s the code with the minimal change:

import torch
from torch import nn
import matplotlib.pyplot as plt

torch.manual_seed(0)

class SimpleNN(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.activation = nn.ReLU()
        self.layer2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.layer1(x)
        x = self.activation(x)
        x = self.layer2(x)
        return x

X = torch.tensor([[7.0, 8.0, 9.0]])
y_true = torch.tensor([[30.0]])
model = SimpleNN(input_dim=3, hidden_dim=4, output_dim=1)
criterion = nn.MSELoss()

# Force the first hidden unit to be positive after ReLU:
with torch.no_grad():
    model.layer1.bias[0] = 100.0  # ensure activation > 0

original_weight = model.layer2.weight.data.clone()
weights = []
losses = []

for delta in torch.linspace(-2, 2, steps=50):
    model.layer2.weight.data[0,0] = original_weight[0,0] + delta
    y_pred = model(X)
    loss = criterion(y_pred, y_true)
    weights.append(model.layer2.weight.data[0,0].item())
    losses.append(loss.item())

model.layer2.weight.data[0,0] = original_weight[0,0]
plt.plot(weights, losses)
plt.xlabel('layer2.weight[0,0] value')
plt.ylabel('Loss')
plt.title('Effect of changing a single weight on loss')
plt.show()

Summary

  • The flat line was due to the first hidden neuron’s output being always zero after ReLU (inactive).
  • By ensuring that unit is active (forcing bias > 0), you make the mapped output depend on layer2.weight[0,0].
  • Now, when you sweep this weight, the loss curve will respond as expected.

Let me know if you want further insight or variants on this experiment!

USER:

great, thanks. that helped me debug. i changed the element to 0,2 instead of 0,0 and that has a noticeable effect. i take this to mean that the ReLU activation in the 0,0 position was not active for the input you were using, but the 0,2 position was. is that correct?

ASSISTANT:

Yes, your understanding is exactly right!

Let’s summarize:

  • In your code, layer2.weight[0, k] corresponds to the weight connecting the kk‑th unit in the hidden layer (after ReLU) to the output.
  • If the kk‑th hidden unit’s activation is zero for your chosen input (X), then changing layer2.weight[0, k] will have no effect on the network output for that input, because: output=j(weightjhiddenj)+bias\text{output} = \sum_{j} (\text{weight}_{j} \cdot \text{hidden}_j) + \text{bias} and if hiddenk=0\text{hidden}_k = 0, then changing weightk\text{weight}_k does nothing.
  • This is a key characteristic of ReLU: if the pre-activation is negative, output is zero (“dead” unit for this input).

So by changing from 0,0 to 0,2, you accidentally (or cleverly!) picked a weight attached to a hidden neuron that was actually active for your input, so adjusting this weight changes the prediction and loss.


In summary:

The noticeable effect when you changed to 0,2 confirms that the third hidden unit (index 2) was active (nonzero output after ReLU), while the first hidden unit (index 0) was inactive (output zero) for your chosen X.


If you want to be systematic, you can always print out the values after the ReLU:

with torch.no_grad():
    hidden = model.activation(model.layer1(X))
    print("Hidden activations after ReLU:", hidden)

Then pick a weight (column in layer2.weight) corresponding to any nonzero value you see.


Let me know if you’d like to dig deeper or try more experiments!

USER:



Next Blog Posts

Earlier Blog Posts


Back to Blog

Copyright © 2025 Identellica LLC
Home · Blog · Source Code