Learn the Training Loop with PyTorch, Part 3.5: Large-Scale Training: Data Parallelism and Hardware

2025-06-18 · Artintellica

View Source Code on GitHub

USER:

i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.

first, let me give you the outline of the series:

Module 1: The Elementary Training Loop

Goal: Build foundational intuition by hand-coding a simple training loop from scratch.

1. Introduction: What is a Training Loop?

Definition and importance
Loss, optimization, and learning in ML

2. The Simplest Case: Linear Regression

What is linear regression?
The mean squared error loss
Deriving the update rules manually

3. Batch vs. Stochastic Gradient Descent

Differences and practical implications
Hands-on: See both in Python/numpy

4. Visualizing the Loss Landscape

Plotting loss as a function of parameters
Demo: Watching convergence by hand

5. Numerical vs. Analytical Gradients

How are gradients computed?
Manual calculation and confirmation with numerical gradient

6. Recap and Key Takeaways

Module 2: The Training Loop for Neural Networks

Goal: Extend the basic loop to more powerful models; introduce neural network concepts.

1. From Linear to Nonlinear: Why Neural Networks?

Multi-layer perceptrons (MLPs)
Activation functions (ReLU, sigmoid, tanh)

2. Forward and Backward Passes

The chain rule of differentiation
Backpropagation intuition (no need for full rigor yet)

3. Implementing a Simple Neural Net from Scratch

Single hidden layer network in numpy/Python
Coding forward, loss, backward, and parameter updates

4. The Role of Activations

Intuitive explanation and code examples
Effect on learning and expressiveness

5. Mini-batching and Data Pipelines

How batching speeds up and stabilizes learning

6. Regularization and Overfitting

Techniques: L2, dropout (conceptual intro)

7. Recap: Comparing Our Simple Network with Linear Regression

Module 3: Advanced Training Loops and Modern Tricks

Goal: Connect foundational understanding to the large-scale models used in industry.

1. Optimization Algorithms Beyond SGD

Adam, RMSProp, etc.: How they work, why they’re useful

2. Learning Rate Scheduling

How/why learning rates are changed during training
Demo: Effect of learning rate on convergence and results

3. Weight Initialization

Why it matters, common strategies

4. Deeper Networks and Backprop Challenges

Vanishing/exploding gradients
Solutions: normalization, skip connections

5. Large-Scale Training: Data Parallelism and Hardware

How the same training loop runs efficiently on GPUs and across many machines

6. Monitoring and Debugging the Training Loop

Visualizing loss/accuracy, diagnosing common problems

7. Modern Regularization and Generalization Techniques

Early stopping, batch/layer normalization, data augmentation (conceptual intro)

8. The Training Loop in Practice: Case Studies

High-level overview of how OpenAI/Google/Meta might train LLMs: what’s the same, what’s different.
What “scaling up” really means: dataset, parameters, hardware, tricks.

9. Conclusion: What’s Next After the Training Loop?

Discussion: fine-tuning, transfer learning, and unsupervised/self-supervised methods.

General Tips for Each Module:

Start simple, build incrementally: Keep code simple at first, then show how to layer complexity.
Accompany every lesson with practical code examples.
Focus on intuition: Use diagrams, plots, and hands-on demo to make concepts tangible.
Summarize and review regularly: Add checklists or quizzes at the end of each module.

great. now, for each blog post, i am hoping to follow the following structure:

the course is called “Learn the Training Loop with PyTorch”
always provide a brief introduction to the topic at hand
then always provide a very simple “ELI5” explanation of the topic, that theortically anyone could understand
then provide the mathematical foundations. for math, always use dollar delimited LaTeX, and always provide a brief explanation of the math. for instance, $\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b}$ is the equation for linear regression, where $\mathbf{y}$ is the output vector, $\mathbf{X}$ is the input matrix, $\mathbf{w}$ is the weight vector, and $\mathbf{b}$ is the bias vector.
for block-level math, always use two dollar signs, like this: $\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b}$
after the mathematical foundations, such as definitions, then always provide an in-depth explanation, covering the topic in detail, and expanding on the explanation of the math if necessary.
for other formatting, use markdown like headings, bullet points, and code blocks to make the content readable. do not use HTML tags.
after the math is complete, then provide one or more demonstrations of the topic using PyTorch. if possible, always use PyTorch, not numpy, just to keep it simple and consistent. you can also use matplotlib to visualize the results.
for all python code, always use type hints, and, if necessary, describe installation instructions using “uv” (not pip).
if you need to use a gpu, note that the user may use apple, or nvidia, or no
gpu at all, meaning in pytorch you should support cuda or mps or cpu.
fallback from cuda to mps to cpu.
then, create a series of exercises, somewhere from 1 - 4 total exercises that cover the material. please provide the soruce code for each exercise to make sure the student can easily compare their solution to the correct answer. do not hide the exercises and do not put HTML in the exercises.
finally, provide a summary of the topic, and a list of key takeaways.

that’s it, now can you please write blog post 3.5?

the description is:

5. Large-Scale Training: Data Parallelism and Hardware

How the same training loop runs efficiently on GPUs and across many machines

please respond with the full blog post only, and no other information.

ASSISTANT:

Learn the Training Loop with PyTorch

Module 3, Post 5: Large-Scale Training — Data Parallelism and Hardware

Introduction

So far in our journey through the training loop, everything has happened on a single computer, often just using the CPU (or a single GPU, if available). But as machine learning models and datasets have grown, researchers and engineers have needed to run training faster, across multiple GPUs or even multiple machines. This is called data parallelism, and it’s at the core of modern AI training practices.

In this post, you’ll learn:

Why and how we use multiple GPUs/CPUs (and even multiple machines!) to train models
What “data parallelism” means in practice
The hardware involved in large-scale training, from GPUs to clusters
How you can leverage simple data parallelism in PyTorch with just a few lines of code

ELI5: Why Train on More Than One Computer (or GPU)?

Imagine you need to teach a class of 1,000 students, but you’re just one teacher—it’s going to take a long time! What if you had 10 teachers, all teaching a separate group of 100 students? The job would get done in a fraction of the time, as each teacher works in parallel and shares updates with the rest.

The same idea applies to training machine learning models. If you have a HUGE dataset, you can split it up and let multiple GPUs (the “teachers”) each process a slice. Then, they combine what they learned, move on to the next slice, and repeat. This approach is called data parallelism.

Mathematical Foundations

Let’s formalize this intuition.

Suppose you have a dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N$ and a model parameterized by $\theta$ . The standard mini-batch gradient descent update is:

\theta \leftarrow \theta - \eta \cdot \frac{1}{|\mathcal{B}|} \sum_{(x_i, y_i)\in \mathcal{B}} \nabla_\theta \mathcal{L}(f(x_i; \theta), y_i)

Here,

$\mathcal{B}$ is a mini-batch, a subset of $\mathcal{D}$
$f(x_i; \theta)$ is your model’s prediction for input $x_i$
$\mathcal{L}$ is the loss function
$\eta$ is the learning rate

In data parallelism, we divide each batch among P parallel workers (GPUs or machines):

Suppose for 4 GPUs:

Each GPU gets its own mini-batch: $\mathcal{B}_1,\, \mathcal{B}_2,\, \mathcal{B}_3,\, \mathcal{B}_4$
Each computes its own gradient: $\mathbf{g}_j = \frac{1}{|\mathcal{B}_j|} \sum \nabla_\theta \mathcal{L}(\cdot)$
All GPUs synchronize and average their gradients:
$\mathbf{g}_{\text{avg}} = \frac{1}{4} \sum_{j=1}^4 \mathbf{g}_j$
Parameter update is performed using this average gradient:
$\theta \leftarrow \theta - \eta \cdot \mathbf{g}_{\text{avg}}$

This averaging ensures the model updates just as if a single giant batch was processed. This is the essence of data-parallel synchronous SGD.

In-Depth Explanation: Data Parallelism & Hardware

Why GPUs?

CPU cores are great at handling a few complex tasks. GPUs, however, are made for running thousands of similar calculations at once—perfect for deep learning, where multiplying huge matrices and applying nonlinearities can all be done in parallel.

Large models and datasets push even one GPU to its limits. That’s where multiple GPUs (or even multiple machines, each with its own GPUs) come in.

Types of Parallelism

Data Parallelism:
- Dataset is split among workers; each computes gradients on different data slices.
- Gradients are aggregated, then weights are updated synchronously.
Model Parallelism (for really huge models):
- Different layers or parts of the model are run on different devices.
- We’ll focus on data parallelism in this post.

Communication & Synchronization

When multiple GPUs work together, they must communicate—usually by sharing gradients after each mini-batch. PyTorch handles this automatically in most settings.

Single-machine, multi-GPU: Fast communication using PCIe/NVLink
Multi-node (across multiple computers): Communication via high-speed networking (Ethernet, InfiniBand)

PyTorch & Data Parallelism

PyTorch provides two primary ways to train on multiple GPUs:

torch.nn.DataParallel (easiest, single machine)
torch.nn.parallel.DistributedDataParallel (DDP) (more scalable, for multi-machine setups, but requires setup)

We’ll focus on DataParallel here, since it’s easy to start with.

PyTorch Demo: Data Parallel Training

Let’s see how to use multiple GPUs (if you have them) to train a simple neural network.

Installation:
Ensure you have PyTorch and matplotlib:
uv pip install torch matplotlib

1. Device Selection

First, detect available hardware:

import torch

def get_best_device() -> torch.device:
    if torch.cuda.is_available():
        print("Using CUDA GPU!")
        return torch.device("cuda")
    elif torch.backends.mps.is_available():
        print("Using Apple M1/M2 GPU (MPS)!")
        return torch.device("mps")
    else:
        print("Using CPU.")
        return torch.device("cpu")

2. Define a Simple Model

import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

3. Dummy Dataset

from torch.utils.data import Dataset, DataLoader
import torch

class RandomDataset(Dataset):
    def __init__(self, n_samples: int, input_dim: int) -> None:
        self.X = torch.randn(n_samples, input_dim)
        self.y = (self.X.sum(dim=1, keepdim=True) > 0).float()  # Simple binary target

    def __len__(self) -> int:
        return self.X.shape[0]

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        return self.X[idx], self.y[idx]

4. Training Loop with Data Parallel

import torch.optim as optim

def train_parallel(
    n_gpus: int = 2,  # Number of GPUs to parallelize over (will auto-use available)
    epochs: int = 5,
    batch_size: int = 128
) -> None:
    device = get_best_device()
    dataset = RandomDataset(10_000, input_dim=20)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    model = SimpleNet(input_dim=20, hidden_dim=64, output_dim=1)
    if torch.cuda.device_count() > 1 and device.type == "cuda":
        print(f"Using DataParallel on {torch.cuda.device_count()} GPUs!")
        model = nn.DataParallel(model)
    model = model.to(device)

    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.01)

    for epoch in range(epochs):
        running_loss = 0.0
        for X_batch, y_batch in dataloader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * X_batch.size(0)
        avg_loss = running_loss / len(dataset)
        print(f"Epoch {epoch+1} | Loss: {avg_loss:.4f}")

Run the Training

if __name__ == "__main__":
    train_parallel()

On laptops/desktops with 2+ GPUs, this will use both. Otherwise, it’ll fall back gracefully to single GPU or CPU.

For a serious multi-node/multi-machine job, see PyTorch DistributedDataParallel documentation.

Visualization: Speedup with Multiple Devices

It’s tricky to demo true speedup unless you have multiple GPUs, but you can try timing runs on CPU vs. GPU (if available):

import time

def time_training_on_device(device: torch.device) -> float:
    dataset = RandomDataset(5000, 20)
    dataloader = DataLoader(dataset, batch_size=256)
    model = SimpleNet(20, 64, 1).to(device)
    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.Adam(model.parameters())
    model.train()
    start = time.time()
    for _ in range(2):  # Just a couple epochs
        for X_batch, y_batch in dataloader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            optimizer.zero_grad()
            loss = criterion(model(X_batch), y_batch)
            loss.backward()
            optimizer.step()
    end = time.time()
    return end - start

if __name__ == "__main__":
    for device_str in ["cpu", "cuda", "mps"]:
        try:
            device = torch.device(device_str)
            if (device_str == "cuda" and not torch.cuda.is_available()) or \
               (device_str == "mps" and not torch.backends.mps.is_available()):
                continue
            time_taken = time_training_on_device(device)
            print(f"{device_str.upper()}: Training took {time_taken:.2f}s")
        except Exception as e:
            print(f"Skipping {device_str}: {e}")

Exercises

Exercise 1: Test Parallel Training on Your Machine

Modify the batch size and number of epochs in train_parallel() to see how it affects training speed and convergence.
Try running on CPU and (if you have it) GPU. Compare loss vs. time taken.

Reference

Use the train_parallel() and time_training_on_device() functions above.

Exercise 2: Visualize Gradient Synchronization

Add a print statement inside the loop, just after loss.backward(), showing the .grad of the first layer’s weights. What do you observe about synchronization when using multiple GPUs (vs. CPU)?

Reference

Insert:

print(model.module.net[0].weight.grad if hasattr(model, "module") else model.net[0].weight.grad)

right after loss.backward().

Exercise 3: Try a Larger Model

Modify SimpleNet to add a second hidden layer. Does DataParallel still work? (Hint: It should—PyTorch abstracts the parallelism away for you!)

Reference

class BiggerNet(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

Replace SimpleNet with BiggerNet in train_parallel().

Exercise 4: Bonus — Explore DistributedDataParallel

If you want an extra challenge and have access to multiple machines or wish to simulate multi-node training, follow the official PyTorch DDP tutorial.
Try to set up a minimal script that launches 2 workers (--nproc_per_node=2) on a single machine.

Summary

In this post, you learned how training loops can be scaled up to handle massive datasets and larger models by splitting work across multiple GPUs or machines. This is data parallelism, where each processor handles a piece of the data and synchronizes updates every step, making training both faster and scalable.

PyTorch’s tools make this easy, letting you use DataParallel on a single machine or DistributedDataParallel for clusters, with minimal code changes.

Key Takeaways

Data parallelism splits batches among parallel processors (GPUs, CPUs, or nodes), synchronizing gradients each iteration.
Hardware matters: More GPUs means faster model training (up to a point).
PyTorch’s DataParallel makes multi-GPU use nearly seamless for single-machine setups.
Distributed training (across multiple servers) is standard for the world’s largest models, leveraging sophisticated distributed hardware and networks.
The core training loop logic stays almost exactly the same, thanks to the abstractions provided.
Always be hardware-aware: not all machines have GPUs (or the same kind), so code should gracefully fall back to CPU if needed.

Congratulations! You now grasp the backbone of large-scale machine learning and how the same training loop can scale from your laptop to gigantic clusters powering today’s AI breakthroughs.

USER: