Learn the Training Loop with PyTorch, Part 3.3: Weight Initialization

2025-06-18 · Artintellica

View Source Code on GitHub

USER:

i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.

first, let me give you the outline of the series:

Module 1: The Elementary Training Loop

Goal: Build foundational intuition by hand-coding a simple training loop from scratch.

1. Introduction: What is a Training Loop?

Definition and importance
Loss, optimization, and learning in ML

2. The Simplest Case: Linear Regression

What is linear regression?
The mean squared error loss
Deriving the update rules manually

3. Batch vs. Stochastic Gradient Descent

Differences and practical implications
Hands-on: See both in Python/numpy

4. Visualizing the Loss Landscape

Plotting loss as a function of parameters
Demo: Watching convergence by hand

5. Numerical vs. Analytical Gradients

How are gradients computed?
Manual calculation and confirmation with numerical gradient

6. Recap and Key Takeaways

Module 2: The Training Loop for Neural Networks

Goal: Extend the basic loop to more powerful models; introduce neural network concepts.

1. From Linear to Nonlinear: Why Neural Networks?

Multi-layer perceptrons (MLPs)
Activation functions (ReLU, sigmoid, tanh)

2. Forward and Backward Passes

The chain rule of differentiation
Backpropagation intuition (no need for full rigor yet)

3. Implementing a Simple Neural Net from Scratch

Single hidden layer network in numpy/Python
Coding forward, loss, backward, and parameter updates

4. The Role of Activations

Intuitive explanation and code examples
Effect on learning and expressiveness

5. Mini-batching and Data Pipelines

How batching speeds up and stabilizes learning

6. Regularization and Overfitting

Techniques: L2, dropout (conceptual intro)

7. Recap: Comparing Our Simple Network with Linear Regression

Module 3: Advanced Training Loops and Modern Tricks

Goal: Connect foundational understanding to the large-scale models used in industry.

1. Optimization Algorithms Beyond SGD

Adam, RMSProp, etc.: How they work, why they’re useful

2. Learning Rate Scheduling

How/why learning rates are changed during training
Demo: Effect of learning rate on convergence and results

3. Weight Initialization

Why it matters, common strategies

4. Deeper Networks and Backprop Challenges

Vanishing/exploding gradients
Solutions: normalization, skip connections

5. Large-Scale Training: Data Parallelism and Hardware

How the same training loop runs efficiently on GPUs and across many machines

6. Monitoring and Debugging the Training Loop

Visualizing loss/accuracy, diagnosing common problems

7. Modern Regularization and Generalization Techniques

Early stopping, batch/layer normalization, data augmentation (conceptual intro)

8. The Training Loop in Practice: Case Studies

High-level overview of how OpenAI/Google/Meta might train LLMs: what’s the same, what’s different.
What “scaling up” really means: dataset, parameters, hardware, tricks.

9. Conclusion: What’s Next After the Training Loop?

Discussion: fine-tuning, transfer learning, and unsupervised/self-supervised methods.

General Tips for Each Module:

Start simple, build incrementally: Keep code simple at first, then show how to layer complexity.
Accompany every lesson with practical code examples.
Focus on intuition: Use diagrams, plots, and hands-on demo to make concepts tangible.
Summarize and review regularly: Add checklists or quizzes at the end of each module.

great. now, for each blog post, i am hoping to follow the following structure:

the course is called “Learn the Training Loop with PyTorch”
always provide a brief introduction to the topic at hand
then always provide a very simple “ELI5” explanation of the topic, that theortically anyone could understand
then provide the mathematical foundations. for math, always use dollar delimited LaTeX, and always provide a brief explanation of the math. for instance, $\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b}$ is the equation for linear regression, where $\mathbf{y}$ is the output vector, $\mathbf{X}$ is the input matrix, $\mathbf{w}$ is the weight vector, and $\mathbf{b}$ is the bias vector.
for block-level math, always use two dollar signs, like this: $\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b}$
after the mathematical foundations, such as definitions, then always provide an in-depth explanation, covering the topic in detail, and expanding on the explanation of the math if necessary.
for other formatting, use markdown like headings, bullet points, and code blocks to make the content readable. do not use HTML tags.
after the math is complete, then provide one or more demonstrations of the topic using PyTorch. if possible, always use PyTorch, not numpy, just to keep it simple and consistent. you can also use matplotlib to visualize the results.
for all python code, always use type hints, and, if necessary, describe installation instructions using “uv” (not pip).
then, create a series of exercises, somewhere from 1 - 4 total exercises that cover the material. please provide the soruce code for each exercise to make sure the student can easily compare their solution to the correct answer. do not hide the exercises and do not put HTML in the exercises.
finally, provide a summary of the topic, and a list of key takeaways.

that’s it, now can you please write blog post 3.3?

the description is:

3. Weight Initialization

Why it matters, common strategies

please respond with the full blog post only, and no other information.

ASSISTANT:

Learn the Training Loop with PyTorch

Module 3.3: Weight Initialization

Why It Matters, and Common Strategies

Welcome back to the “Learn the Training Loop with PyTorch” series! In today’s post, we’ll dive into a deceptively simple—yet incredibly important—aspect of neural network training: weight initialization.

Properly initializing the weights of your neural network can mean the difference between smooth, effective learning and a model that gets “stuck,” never learning at all. We’ll explore the why, the how, and the hands-on PyTorch code to get you started.

ELI5: Why Does Weight Initialization Matter?

Imagine you are trying to find your way through a dark maze. If you start very close to the exit, it’s easy. If you start somewhere terrible—deep in a confusing section—you might wander aimlessly, never getting out.

Weights in a neural network are like the starting points in this maze. If you choose bad starting weights, your network (guided by gradient descent) could get lost, learn very slowly, or maybe never find the right solution.

Good weight initialization gives your learning process the best possible head start—so your model can learn fast and well, and avoid big problems like “vanishing” (or “exploding”) gradients.

Mathematical Foundations

Let’s formalize what’s happening.

Suppose you have a neural network layer with weights $\mathbf{W}$ , biases $\mathbf{b}$ , taking inputs $\mathbf{x}$ :

\mathbf{y} = f(\mathbf{W}\mathbf{x} + \mathbf{b})

where:

$\mathbf{x}$ is the input vector,
$\mathbf{W}$ is the weight matrix,
$\mathbf{b}$ is the bias vector,
$f$ is the activation function (like ReLU or sigmoid).

Problem: Bad Initialization

If the elements of $\mathbf{W}$ are too large, the outputs after each layer can become huge (“exploding gradients”).
If the elements are too small, every output shrinks toward zero (“vanishing gradients”).

The gradients of the loss with respect to the weights are affected by the product of many weights. If these are too big or small, the signal gets lost across layers during training.

Common Initialization Strategies

Researchers developed initialization recipes to give each neuron the right amount of starting signal by controlling the variance of weights.

1. Zero Initialization (don’t do this!)

Setting all weights to $0$ :

\mathbf{W} = 0

But this causes all neurons to learn the same thing, without diversity (“symmetry breaking”).

2. Random Initialization (Normal/Uniform)

Random, but careful with the distribution.

Standard normal: $W_{ij} \sim \mathcal{N}(0,1)$
Uniform: $W_{ij} \sim \mathcal{U}(-a, a)$ for some $a$

But, as networks get deeper, this can still lead to vanishing or exploding gradients.

3. Xavier/Glorot Initialization

Designed for sigmoid/tanh activations.

Weights are sampled as:

W_{ij} \sim \mathcal{U}\left( -\frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}, \ \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}} \right)

where $n_{in}$ and $n_{out}$ are the number of input and output units.

4. Kaiming/He Initialization

Designed for ReLU activations.

Sampled as:

W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{in}} \right)

This keeps the variance of activations constant throughout the network.

In-Depth: How Initialization Influences Training

If you initialize all your weights the same (say, all zeros), symmetry means every neuron receives the same updates—they can’t learn different features!

If weights are large or small, during forward or backward pass (due to multiplication in the chain rule), the signals explode or vanish exponentially with the number of layers:

If $|w| > 1$ , growing activations/gradients
If $|w| < 1$ , shrinking activations/gradients

The goal is to balance the signal so that every layer receives gradients that are neither too large nor too small. Xavier and He initializations do this based on mathematical analysis of how variance propagates through layers, tailored to specific activation functions.

Demonstration: Initializing Weights in PyTorch

Let’s see initialization in action, and visualize what can go wrong — and how to make it go right!

If you don’t have PyTorch or matplotlib, install with:
uv pip install torch matplotlib

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from typing import Type

def plot_activations_hist(
    layer_cls: Type[nn.Module],
    init_fn,
    input_shape: tuple,
    n_layers: int,
    N: int = 1000
) -> None:
    """Visualize distribution of activations across layers after different initializations."""
    torch.manual_seed(42)
    layers = []
    for _ in range(n_layers):
        layer = layer_cls(*input_shape)
        init_fn(layer.weight)
        nn.init.zeros_(layer.bias)
        layers.append(layer)
    x = torch.randn(N, input_shape[0])
    activations = [x]
    for layer in layers:
        x = torch.relu(layer(x))
        activations.append(x)
    plt.figure(figsize=(12, 4))
    for i, a in enumerate(activations[1:], 1):
        plt.subplot(1, n_layers, i)
        plt.hist(a.detach().numpy().ravel(), bins=40, color='skyblue')
        plt.title(f'Layer {i}')
        plt.xlabel('Activation')
        plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

# 1. Poor initialization (Normal with large std)
print("Normal initialization (std=1):")
plot_activations_hist(
    layer_cls=nn.Linear,
    init_fn=lambda w: nn.init.normal_(w, std=1.0),
    input_shape=(128, 128),
    n_layers=4
)

# 2. Xavier (Glorot) initialization
print("Xavier/Glorot initialization:")
plot_activations_hist(
    layer_cls=nn.Linear,
    init_fn=lambda w: nn.init.xavier_uniform_(w),
    input_shape=(128, 128),
    n_layers=4
)

# 3. Kaiming (He) initialization for ReLU
print("Kaiming/He initialization (ReLU):")
plot_activations_hist(
    layer_cls=nn.Linear,
    init_fn=lambda w: nn.init.kaiming_normal_(w, nonlinearity="relu"),
    input_shape=(128, 128),
    n_layers=4
)

What to look for:

Normal, std=1: Values explode, later layers give mostly very high/low values.
Xavier/Glorot: Values remain moderate, distribution stays spread but not extreme.
Kaiming/He: Best for ReLU—keeps variance steady across layers.

How To Set Initialization in PyTorch

By default, PyTorch’s layers use good initializations, but you can set your own:

import torch.nn as nn

class MyMLP(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.fc1 = nn.Linear(128, 128)
        self.fc2 = nn.Linear(128, 64)

        nn.init.kaiming_uniform_(self.fc1.weight, nonlinearity='relu')
        nn.init.xavier_uniform_(self.fc2.weight)
        # Bias defaults to zeros

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return x

Exercises

1. Visualize the Problem with Zero Initialization

Modify the plotting code to use zero weights. What happens?

plot_activations_hist(
    layer_cls=nn.Linear,
    init_fn=lambda w: nn.init.constant_(w, 0),
    input_shape=(128, 128),
    n_layers=4
)

Expected outcome: All activations (except possibly for the bias) collapse to the same value.

2. Implement a Custom Initialization

Write a function that initializes every weight in a layer by sampling from $\mathcal{U}(-0.1, 0.1)$ , and plot the activations.

def my_uniform_init(w: torch.Tensor) -> None:
    nn.init.uniform_(w, a=-0.1, b=0.1)

plot_activations_hist(
    layer_cls=nn.Linear,
    init_fn=my_uniform_init,
    input_shape=(128, 128),
    n_layers=4
)

3. Experiment: Change Number of Layers

Use Xavier initialization, but increase the number of layers to 8. Does the shape of the activations stay reasonable as depth increases?

plot_activations_hist(
    layer_cls=nn.Linear,
    init_fn=lambda w: nn.init.xavier_uniform_(w),
    input_shape=(128, 128),
    n_layers=8
)

4. Build a Network and Set All Initializations Manually

Construct a 3-layer network (using nn.Linear) where you use Kaiming/He for layers 1 and 2 (with ReLU nonlinearity) and Xavier for the last layer (no nonlinearity). Confirm using print(model) that the inits are set.

class ManualInitNet(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.fc1 = nn.Linear(64, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

        nn.init.kaiming_uniform_(self.fc1.weight, nonlinearity='relu')
        nn.init.kaiming_uniform_(self.fc2.weight, nonlinearity='relu')
        nn.init.xavier_uniform_(self.fc3.weight)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = ManualInitNet()
print(model)

Summary

Weight initialization may look trivial, but it’s foundational for neural network success. Correct initialization ensures that learning signals don’t vanish or explode as they travel through deep models, setting your training process up for success. PyTorch makes it easy to apply best-practice initialization methods with its built-in utilities.

Key Takeaways

Good initialization matters: Bad initial weights can cripple learning.
Common strategies: Xavier/Glorot for sigmoid/tanh, Kaiming/He for ReLU, tailored to maintain stable variance.
PyTorch makes it easy: Use nn.init.xavier_uniform_, nn.init.kaiming_normal_, etc., directly on layer weights.
Don’t initialize everything to zero: Always break symmetry.
Understanding initialization helps debug tricky training issues (vanishing/exploding gradients).

Next Up: We’ll explore deeper networks and why even with good initialization it’s still tricky to train very deep models—leading us to tricks like normalization and skip connections!

USER:

in your first section, you use the equation $y = f(Wx + b)$ , and suggest that $W$ is a matrix. but at other times, $w$ is a vector and the matrix ix $X$ , so the equation is $y = f(Xw + b)$ . under what circumstances is $W$ a matrix and when it is a vector? Likewise, under what circumstances is $X$ a matrix and when is it a vector? Also, considering a neural network may have many layers, and the output of each layer is a vector, and the output of that layer is $x$ in subsequent layers, doesn’t this imply $x$ is usuall a vector, not a matrix, at least internally to most neural networks?

ASSISTANT:

Great question! This is a key point of confusion, especially when switching contexts between classical linear regression and neural networks. Let’s break it down carefully.

1. Vectors vs Matrices: What’s What?

Linear Regression/Batched Data

$\mathbf{X}$ (Matrix):
Usually represents a batch of input data, where each row is a data sample, and each column is a feature.
Shape: $(\text{batch\_size}, \text{num\_features})$
$\mathbf{w}$ (Vector):
The weights for a single output (e.g., in scalar linear regression, just one output).
Shape: $(\text{num\_features}, )$
Equation:
$\mathbf{y} = \mathbf{X}\mathbf{w} + b$
Here, $\mathbf{X}$ is a matrix, $\mathbf{w}$ is a vector, $b$ is a scalar, and $\mathbf{y}$ is a vector of predicted values (one per input in the batch).

Neural Networks/Fully-Connected (Dense) Layers

$\mathbf{x}$ (Vector):
The input to a layer (sometimes a minibatch of vectors, but let’s focus on one input for clarity).
Shape: $(\text{input\_dim}, )$
$\mathbf{W}$ (Matrix):
The weight matrix for one layer, mapping an input vector to an output vector (possibly with different dimensions).
Shape: $(\text{output\_dim}, \text{input\_dim})$
$\mathbf{b}$ (Vector):
Bias for each output unit.
Shape: $(\text{output\_dim}, )$
Equation:
$\mathbf{y} = f(\mathbf{W}\mathbf{x} + \mathbf{b})$
$\mathbf{W}$ is a matrix transforming input to output; $\mathbf{x}$ and $\mathbf{y}$ are (column) vectors.

If working with a batch, then you’d use $\mathbf{X}$ as a matrix of multiple input vectors (shape: $(\text{batch}, \text{input\_dim})$ ), and the operation becomes matrix multiplication over the batch: (\mathbf{Y} = f(\mathbf{X} \mathbf{W}^\top + \mathbf{b})), taking care with dimensions.

2. So, When is $W$ a Matrix, and When a Vector?

$W$ is a vector:
- In simple linear regression (single output), $w$ is a vector mapping features to a single output.
- In notation: $y = Xw + b$ ( $w$ : shape $(\text{num\_features}, )$ , $X$ : shape $(\text{batch}, \text{num\_features})$ ).
$W$ is a matrix:
- In multi-output regression or neural networks (layers with multiple outputs), $W$ is a matrix mapping from input dim to output dim.
- In notation: $y = Wx + b$ ( $W$ : shape $(\text{output\_dim}, \text{input\_dim})$ , $x$ : shape $(\text{input\_dim}, )$ ).

3. When is $X$ a Matrix, and When a Vector?

$X$ is a matrix:
- Whenever representing multiple data samples at once (e.g., during batch processing): shape $(\text{batch}, \text{num\_features})$ .
$x$ is a vector:
- When representing a single data sample (input to a layer), or the output of a neural network layer (for one sample).

4. Inside a Neural Network: $x$ is (Usually) a Vector

The output of each layer is typically a vector (or, for a batch, a matrix where each row is a vector for a sample).
This vector then becomes the input to the next layer.
So, inside a neural network, $x$ and $y$ are usually vectors, and $W$ is a matrix.

Illustration (Single Forward Pass):

x: input vector (dim = input_dim)
W1: weight matrix (shape: hidden_dim x input_dim)
b1: bias (shape: hidden_dim)
=> h = relu(W1 x + b1)           # h is now (hidden_dim,)
W2: (output_dim x hidden_dim)
b2: (output_dim,)
=> y = W2 h + b2

5. Batch Processing in Neural Networks

When processing batches, the same math applies, but now $x$ and $y$ become matrices (one row per input):

$\mathbf{X}$ : shape $(\text{batch}, \text{input\_dim})$
$\mathbf{W}$ : shape $(\text{output\_dim}, \text{input\_dim})$

The forward pass becomes:

\mathbf{Y} = f(\mathbf{X} \mathbf{W}^\top + \mathbf{b})

Summary Table

Context	$X$ or $x$	$W$ or $w$	$y$ or $Y$
Linear regression (single)	matrix $X$	vector $w$	vector $y$
Single neural net layer	vector $x$	matrix $W$	vector $y$
Neural net, batch	matrix $X$	matrix $W$	matrix $Y$

Key Takeaways

In linear regression: $X$ (matrix, data), $w$ (vector, parameters), $y$ (vector, output).
In a neural net layer: $x$ (vector, input), $W$ (matrix, parameters), $y$ (vector, output).
During batch processing, $x$ and $y$ become matrices ( $X$ , $Y$ ), but inside a network, activations are typically vectors (per sample).
$W$ is almost always a matrix inside a neural network layer (unless the layer has a single output).