USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 2.1?
the description is:
1. From Linear to Nonlinear: Why Neural Networks?
please respond with the full blog post only, and no other information.
ASSISTANT:
Welcome back to our series, “Learn the Training Loop with PyTorch”! In Module 1, we learned how simple models like linear regression learn from data. Now we’re taking the next leap: from linear models to the world of neural networks. But why do we need neural networks? What makes them special? In this post, we’ll explore the motivation, introduce the basic architecture (multi-layer perceptron, or MLP), and demystify one of the key reasons neural networks work: activation functions.
Imagine you have a toy with wheels. You can push it in a straight line (that’s what a linear model does—it finds straight-line patterns in data). But what if you want your toy to turn corners, loop around, or follow a complex path? That’s where neural networks come in!
Neural networks are like supercharged toys that can twist, turn, and wiggle their way through complicated patterns. They can learn not just straight lines, but curves, zig-zags, and all sorts of shapes. At the heart of this ability are hidden layers and special activation functions—think of them as clever bends and joints you add to your toy so it can flex and move in new ways.
Recall the linear regression model:
Limitation: This setup only learns straight-line relationships between input and output.
A neural network (specifically, a feedforward neural network or multi-layer perceptron, MLP) adds one or more hidden layers between input and output.
The simplest MLP with one hidden layer looks like this:
More generally, for layers:
The activation function introduces nonlinearity; without it, stacking layers is no better than a single linear layer. The most common activation functions are:
These allow the network to bend and shape data in ways linear models cannot.
Linear models can only learn relationships like “if variable X goes up, variable Y goes up (or down) by a fixed amount.” However, most interesting patterns in the real world are messy: for example, predicting house prices, recognizing objects in photos, or streaming speech into text. These relationships are nonlinear (they don’t look like a straight line).
A neural network with hidden layers and nonlinear activation functions can learn decision boundaries and patterns that are wavy, irregular, and complex.
Without activation functions, stacking linear layers just results in another big linear transformation—a waste! It’s the nonlinear activations that let neural networks become super-flexible.
Let’s quickly summarize three popular ones:
Let’s see how to implement (and visualize) a simple MLP in PyTorch that can fit data that a linear model cannot.
We’ll:
If you don’t have PyTorch and matplotlib yet, install them:
uv pip install torch matplotlib scikit-learn
from typing import Tuple
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
import torch
def generate_moons(n_samples: int = 200, noise: float = 0.20) -> Tuple[torch.Tensor, torch.Tensor]:
X_np, y_np = make_moons(n_samples=n_samples, noise=noise, random_state=42)
X = torch.tensor(X_np, dtype=torch.float32)
y = torch.tensor(y_np, dtype=torch.float32).unsqueeze(1)
return X, y
X, y = generate_moons()
plt.scatter(X[:, 0], X[:, 1], c=y.squeeze(), cmap="coolwarm", s=40)
plt.title("Synthetic Nonlinear Dataset: Two Moons")
plt.xlabel("x1")
plt.ylabel("x2")
plt.show()
import torch.nn as nn
class LinearModel(nn.Module):
def __init__(self, input_dim: int, output_dim: int) -> None:
super().__init__()
self.fc = nn.Linear(input_dim, output_dim)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return torch.sigmoid(self.fc(x)) # For binary classification
model = LinearModel(input_dim=2, output_dim=1)
loss_fn = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# Training loop (very simple!)
for epoch in range(1000):
optimizer.zero_grad()
y_pred = model(X)
loss = loss_fn(y_pred, y)
loss.backward()
optimizer.step()
print(f"Final training loss: {loss.item():.4f}")
import numpy as np
def plot_decision_boundary(model: nn.Module, X: torch.Tensor, y: torch.Tensor) -> None:
x_min, x_max = X[:,0].min() - 0.5, X[:,0].max() + 0.5
y_min, y_max = X[:,1].min() - 0.5, X[:,1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
grid = torch.tensor(np.c_[xx.ravel(), yy.ravel()], dtype=torch.float32)
with torch.no_grad():
probs = model(grid).reshape(xx.shape)
plt.contourf(xx, yy, probs, alpha=0.3, cmap="coolwarm")
plt.scatter(X[:,0], X[:,1], c=y.squeeze(), cmap="coolwarm", s=40)
plt.title("Linear Model: Can't Capture Nonlinearity")
plt.show()
plot_decision_boundary(model, X, y)
class MLP(nn.Module):
def __init__(self, input_dim: int, hidden_dim: int, output_dim: int) -> None:
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.act = nn.ReLU() # Try swapping to nn.Tanh() or nn.Sigmoid()
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x: torch.Tensor) -> torch.Tensor:
h = self.act(self.fc1(x))
return torch.sigmoid(self.fc2(h)) # For binary classification
mlp = MLP(input_dim=2, hidden_dim=8, output_dim=1)
optimizer = torch.optim.Adam(mlp.parameters(), lr=0.05)
loss_fn = nn.BCELoss()
# Training loop
for epoch in range(1000):
optimizer.zero_grad()
y_pred = mlp(X)
loss = loss_fn(y_pred, y)
loss.backward()
optimizer.step()
print(f"Final training loss: {loss.item():.4f}")
plot_decision_boundary(mlp, X, y)
You should now see that the neural network successfully separates the two moons, learning a complex, nonlinear boundary—something a linear model simply cannot do.
MLP
class above and replace nn.ReLU()
with nn.Tanh()
and then nn.Sigmoid()
. Retrain the model each time and visualize the boundary. What changes?# Swap activation function in the MLP class:
# self.act = nn.Tanh()
# or
# self.act = nn.Sigmoid()
MLP
class:mlp = MLP(input_dim=2, hidden_dim=32, output_dim=1)
Linear
layer. Confirm that it can’t fit the nonlinear data.class SimpleLinear(nn.Module):
def __init__(self, input_dim: int, output_dim: int) -> None:
super().__init__()
self.fc = nn.Linear(input_dim, output_dim)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return torch.sigmoid(self.fc(x))
model = SimpleLinear(2, 1)
In this post, you learned:
In the next lesson, we’ll dig deeper into forward and backward passes—the gears that power neural network training. Stay tuned!
USER:
is there a way to make a neural network that can learn any function, no matter how complex? or do we need different numbers of layers and different architectures for different functions?
ASSISTANT:
Great question! This touches on a core concept in neural networks called the Universal Approximation Theorem—and a lot of the practical realities of modern deep learning.
A neural network, if it’s big enough (with enough neurons and the right activation functions), can in theory learn any function, no matter how squiggly or complex that function is! Imagine you had a magical Lego set: with enough blocks, you could build any shape—but if your blocks are too big or you don’t have the right connectors, you might have a hard time building something really detailed.
The Universal Approximation Theorem says that:
In math terms:
If is any continuous function, and your neural net is big enough,
for all in your domain.
BUT:
These architectures are designed to take advantage of patterns and structures (like the fact that nearby pixels in images matter) that a plain MLP can’t “see” as easily.
TL;DR:
Neural networks are universal function fitters—in theory. In practice, depth, size, and architecture matter for learning real-world complex functions!
USER:
if just one hidden layer can learn any continuous function, why do we need deep networks with many layers? can’t we just use a single layer with enough neurons?
ASSISTANT:
Fantastic question! This confusion is very common—let’s clear it up:
Imagine you want to build a super-detailed Lego sculpture. You could use millions of tiny blocks in a single layer to copy every contour. But what if, instead, you built it in stages—with blocks forming big shapes, then smaller shapes, then details? You’d need far fewer blocks and it would be easier, too!
That’s why we build “deep” (many-layered) networks: they learn in stages—from simple to complex patterns—making them much more efficient.
The Universal Approximation Theorem tells us a neural network with one hidden layer can approximate any continuous function—if it has enough hidden units (neurons).
But in practice:
Modern deep learning research shows that deep networks (many hidden layers, each with manageable width):
There are functions that a deep network can represent compactly (with a reasonable number of neurons/layers), but a shallow network would need an exponentially larger number of neurons to achieve similar performance.
TL;DR:
Deep networks aren’t just a fad—they’re essential because they can learn more with less, build up understanding in stages, and make practical deep learning possible!
USER:
is there one activation function that most deep learning models use, such as ReLU, or do they use different ones for different layers?
ASSISTANT:
Great question! The activation function is a crucial part of neural networks—and choosing the right one can make a big difference.
Most deep learning models use the same activation function (usually ReLU) for most of their hidden layers. Think of ReLU as a simple switch that only lets “positive energy” through—which makes it fast and easy for your model to learn.
But for the output layer, we often use different activation functions depending on the problem (for example, using sigmoid for probabilities).
Rectified Linear Unit
) is overwhelmingly the default choice for hidden layers in most deep neural networks:
Here, the activation function is chosen based on the task:
TL;DR:
Today’s deep nets usually “stack” many layers with ReLU, and select the output activation to fit the task. 🎯
USER:
ok, back to the article. in this equation:
both X and W are capitalized. does that mean they are both matrices? what are they?
ASSISTANT:
Fantastic observation! Yes—the capitalization in math is usually intentional. In this equation:
let’s break down each term:
(capital X):
(capital W):
(lowercase b):
:
(lowercase h):
If you have:
Then:
Symbol | Shape | What it means |
---|---|---|
Input data (examples x features) | ||
Weights in first layer | ||
or | Bias for first layer | |
Hidden activations |
Bottom line:
USER:
ok, another question. to be clear, each layer in a multi-layer neural network has separate weights and biases, correct? does this mean we need to solve for optimal weights and biases for every layer separately during training?
ASSISTANT:
Great question! Yes, each layer in a multi-layer neural network has its own separate weights and biases.
But the important thing is how those get “solved” or updated during training—let’s break it down:
Think of a neural network as a team of workers passing along a message. Each worker (layer) has their own set of secret instructions (weights and biases). Together, they’re learning how to transform the input in the best way possible to get the right answer at the end. During training, the whole team works together to adjust all their instructions so the final answer improves—not just one worker at a time!
Because the outputs of earlier layers become the inputs to later layers. Changing an early layer’s weights affects the whole chain! The gradients are computed through the entire network (“chain rule” of calculus) so that each weight/bias gets the best update for the final prediction.
TL;DR:
Each layer has its own parameters, but all are learned together in the training loop using backpropagation and gradient descent!
USER:
ok, but then what is the purpose of the difference between batch, mini-batch, and stochastic gradient descent? i thought these all updates different weights and biases, but now it seems like they all update the same weights and biases, just at different times. so what is the difference between these three methods?
ASSISTANT:
Excellent question! This trips up a lot of newcomers. The difference between batch, mini-batch, and stochastic gradient descent (SGD) is not about which weights or which layers get updated—they all update all the learnable parameters (weights and biases) of the entire network each step! The difference is about how much data is used to compute each update, and thus how often/how “noisily” updates are made.
Imagine you’re learning to shoot basketball hoops:
You’re always adjusting your whole technique (not just parts!), but you use different amounts of feedback each time.
In modern deep learning, “mini-batch” is the standard practice.
Method | Data per update | Updates per epoch | Speed | Noise |
---|---|---|---|---|
Batch | All | 1 | Slow | Very low |
Stochastic (SGD) | 1 | Fast | Very high | |
Mini-batch | Few samples | /batch size | Just right | Medium |
TL;DR:
Batch, mini-batch, and stochastic GD differ only in how much data is used to compute each parameter update, not in which parameters are updated. All update the entire network!
USER:
what does “MLP” mean in the exercises?
ASSISTANT:
Great question! MLP stands for Multi-Layer Perceptron.
Think of an MLP as a “classic” or “plain” neural network: It’s just a bunch of layers—one after another—where each layer is fully connected to the next (“dense” layers), and each uses an activation function like ReLU or sigmoid. When you see “MLP,” read it as a basic, stack-of-layers neural network.
In code (like the exercises in your article), when you see a class MLP
, it’s just shorthand for a “standard feedforward neural network with one or more hidden layers.”
Input → Dense Layer (weights & biases) → ReLU → Dense Layer → Output
Bottom line:
When you see “MLP” in an exercise, it means “the neural network model/class defined above (with input, hidden, and output layers, as shown).” It’s the “vanilla” or “standard” neural net!
USER:
and what does fc1 and fc2 mean? are these properties that a class that extends nn.Module must have? does it call these functions automatically?
ASSISTANT:
Great questions! Let’s break them down:
fc1
and fc2
Mean?fc
stands for “fully connected” (also called “dense layer”).fc1
, fc2
, etc.:
fc1
is the first fully connected layer (input to hidden).fc2
is the second fully connected layer (hidden to output, or hidden to hidden if deeper).So,
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, output_dim)
means you have two “fully connected” layers, one after the other.
nn.Module
?fc1
and fc2
.layer1
, dense
, my_hidden
, etc.fc1
, etc.) directly.forward(self, x)
.def forward(self, x):
h = self.fc1(x)
out = self.fc2(h)
return out
forward
method is automatically called by PyTorch when you do output = model(input)
.Name | Meaning | Required? | Called Automatically? |
---|---|---|---|
fc1 , fc2 | ”Fully connected” layers | No, just a habit | No, you call them inside your forward() |
Layer Names | Any (e.g. layer1 , dense ) | No, any valid name | No, call them in forward() |
forward() | Neural net computation method | Yes (required) | Yes, by PyTorch when you use model(x) |
fc1
/fc2
are just conventional names for your fully connected layers.forward()
method automatically;USER:
in that case, i don’t understand how the hidden dimensions worked. i thought after each layer, it is multiplied by a a nonlinear function. however, in this case, the first “hidden layer” seems to have a “hidden dimension”. what does this mean? is that not the number of layers? is that the dimension of the matrix?
ASSISTANT:
Awesome questions—let’s clarify these ideas:
So, if you see:
self.fc1 = nn.Linear(input_dim, hidden_dim)
input_dim
: Number of input features (e.g., pixels for images, or whatever your input vector size is)hidden_dim
: Number of neurons in the first hidden layer (how “wide” that layer is)Each fully connected (linear) layer does:
output = input @ weights + bias
where input
is a matrix of size [batch_size, input_dim]
and weights
is [input_dim, output_dim]
.
Activation function: After the linear layer, you usually apply a nonlinear “activation function” (like ReLU, Tanh, etc.):
activated = activation_function(output)
So, in forward pass code:
h = self.fc1(x) # Linear transformation: [batch, hidden_dim]
a = self.act(h) # Nonlinear activation - same shape
out = self.fc2(a) # Next linear layer
Diagram:
input_dim ----(fc1)----> hidden_dim ----(activation, e.g. ReLU)----> output_dim
\_ [fc1 has: input_dim x hidden_dim weights] _/
hidden_dim
controls how many features (“neurons”) the hidden layer computes.More hidden units (wider hidden dimension) lets the network represent more complex functions, but makes it bigger and potentially slower.
Term | What it means | Controls |
---|---|---|
Layer | One computation step (fc + nonlinearity) | Depth |
Hidden Layer | Any layer between input/output | Depth |
Hidden Dimension / Hidden Size | # of neurons (units) in EACH layer | Width |
Analogy:
Think of “hidden layers” as assembly lines.
Bottom line:
USER: