USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 3.5?
the description is:
5. Large-Scale Training: Data Parallelism and Hardware
please respond with the full blog post only, and no other information.
ASSISTANT:
So far in our journey through the training loop, everything has happened on a single computer, often just using the CPU (or a single GPU, if available). But as machine learning models and datasets have grown, researchers and engineers have needed to run training faster, across multiple GPUs or even multiple machines. This is called data parallelism, and it’s at the core of modern AI training practices.
In this post, you’ll learn:
Imagine you need to teach a class of 1,000 students, but you’re just one teacher—it’s going to take a long time! What if you had 10 teachers, all teaching a separate group of 100 students? The job would get done in a fraction of the time, as each teacher works in parallel and shares updates with the rest.
The same idea applies to training machine learning models. If you have a HUGE dataset, you can split it up and let multiple GPUs (the “teachers”) each process a slice. Then, they combine what they learned, move on to the next slice, and repeat. This approach is called data parallelism.
Let’s formalize this intuition.
Suppose you have a dataset and a model parameterized by . The standard mini-batch gradient descent update is:
Here,
In data parallelism, we divide each batch among P parallel workers (GPUs or machines):
Suppose for 4 GPUs:
Each GPU gets its own mini-batch:
Each computes its own gradient:
All GPUs synchronize and average their gradients:
Parameter update is performed using this average gradient:
This averaging ensures the model updates just as if a single giant batch was processed. This is the essence of data-parallel synchronous SGD.
CPU cores are great at handling a few complex tasks. GPUs, however, are made for running thousands of similar calculations at once—perfect for deep learning, where multiplying huge matrices and applying nonlinearities can all be done in parallel.
Large models and datasets push even one GPU to its limits. That’s where multiple GPUs (or even multiple machines, each with its own GPUs) come in.
When multiple GPUs work together, they must communicate—usually by sharing gradients after each mini-batch. PyTorch handles this automatically in most settings.
PyTorch provides two primary ways to train on multiple GPUs:
torch.nn.DataParallel
(easiest, single machine)torch.nn.parallel.DistributedDataParallel (DDP)
(more scalable, for multi-machine setups, but requires setup)We’ll focus on DataParallel here, since it’s easy to start with.
Let’s see how to use multiple GPUs (if you have them) to train a simple neural network.
Installation:
Ensure you have PyTorch and matplotlib:uv pip install torch matplotlib
First, detect available hardware:
import torch
def get_best_device() -> torch.device:
if torch.cuda.is_available():
print("Using CUDA GPU!")
return torch.device("cuda")
elif torch.backends.mps.is_available():
print("Using Apple M1/M2 GPU (MPS)!")
return torch.device("mps")
else:
print("Using CPU.")
return torch.device("cpu")
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self, input_dim: int, hidden_dim: int, output_dim: int) -> None:
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)
from torch.utils.data import Dataset, DataLoader
import torch
class RandomDataset(Dataset):
def __init__(self, n_samples: int, input_dim: int) -> None:
self.X = torch.randn(n_samples, input_dim)
self.y = (self.X.sum(dim=1, keepdim=True) > 0).float() # Simple binary target
def __len__(self) -> int:
return self.X.shape[0]
def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
return self.X[idx], self.y[idx]
import torch.optim as optim
def train_parallel(
n_gpus: int = 2, # Number of GPUs to parallelize over (will auto-use available)
epochs: int = 5,
batch_size: int = 128
) -> None:
device = get_best_device()
dataset = RandomDataset(10_000, input_dim=20)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
model = SimpleNet(input_dim=20, hidden_dim=64, output_dim=1)
if torch.cuda.device_count() > 1 and device.type == "cuda":
print(f"Using DataParallel on {torch.cuda.device_count()} GPUs!")
model = nn.DataParallel(model)
model = model.to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
for epoch in range(epochs):
running_loss = 0.0
for X_batch, y_batch in dataloader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
optimizer.zero_grad()
outputs = model(X_batch)
loss = criterion(outputs, y_batch)
loss.backward()
optimizer.step()
running_loss += loss.item() * X_batch.size(0)
avg_loss = running_loss / len(dataset)
print(f"Epoch {epoch+1} | Loss: {avg_loss:.4f}")
if __name__ == "__main__":
train_parallel()
- On laptops/desktops with 2+ GPUs, this will use both. Otherwise, it’ll fall back gracefully to single GPU or CPU.
- For a serious multi-node/multi-machine job, see PyTorch DistributedDataParallel documentation.
It’s tricky to demo true speedup unless you have multiple GPUs, but you can try timing runs on CPU vs. GPU (if available):
import time
def time_training_on_device(device: torch.device) -> float:
dataset = RandomDataset(5000, 20)
dataloader = DataLoader(dataset, batch_size=256)
model = SimpleNet(20, 64, 1).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters())
model.train()
start = time.time()
for _ in range(2): # Just a couple epochs
for X_batch, y_batch in dataloader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
optimizer.zero_grad()
loss = criterion(model(X_batch), y_batch)
loss.backward()
optimizer.step()
end = time.time()
return end - start
if __name__ == "__main__":
for device_str in ["cpu", "cuda", "mps"]:
try:
device = torch.device(device_str)
if (device_str == "cuda" and not torch.cuda.is_available()) or \
(device_str == "mps" and not torch.backends.mps.is_available()):
continue
time_taken = time_training_on_device(device)
print(f"{device_str.upper()}: Training took {time_taken:.2f}s")
except Exception as e:
print(f"Skipping {device_str}: {e}")
train_parallel()
to see how it affects training speed and convergence.Use the train_parallel()
and time_training_on_device()
functions above.
loss.backward()
, showing the .grad
of the first layer’s weights. What do you observe about synchronization when using multiple GPUs (vs. CPU)?Insert:
print(model.module.net[0].weight.grad if hasattr(model, "module") else model.net[0].weight.grad)
right after loss.backward()
.
SimpleNet
to add a second hidden layer. Does DataParallel still work? (Hint: It should—PyTorch abstracts the parallelism away for you!)class BiggerNet(nn.Module):
def __init__(self, input_dim: int, hidden_dim: int, output_dim: int) -> None:
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)
Replace SimpleNet
with BiggerNet
in train_parallel()
.
--nproc_per_node=2
) on a single machine.In this post, you learned how training loops can be scaled up to handle massive datasets and larger models by splitting work across multiple GPUs or machines. This is data parallelism, where each processor handles a piece of the data and synchronizes updates every step, making training both faster and scalable.
PyTorch’s tools make this easy, letting you use DataParallel on a single machine or DistributedDataParallel for clusters, with minimal code changes.
Congratulations! You now grasp the backbone of large-scale machine learning and how the same training loop can scale from your laptop to gigantic clusters powering today’s AI breakthroughs.
USER: