USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 2.4?
the description is:
4. The Role of Activations
please respond with the full blog post only, and no other information.
ASSISTANT:
Welcome back to “Learn the Training Loop with PyTorch”!
In the last blog post, you built a neural network from scratch and learned how it could learn more complex relationships than linear regression. But you might be wondering: what really makes a neural network “non-linear,” and why is that so important?
Today, we’re zooming in on activation functions—the key ingredient that makes neural networks more expressive and capable than linear models. We’ll intuitively explain what activations are, why they’re needed, look at some of the most common ones, see their math, and finally visualize their impact with code.
Think of a neural network like an assembly line. At each step (layer), worker robots (neurons) add their ingredient (multiply input by weights and add bias), then pass it to the next worker. If each worker only ever demanded the exact recipe (just multiplying and adding), the final product would always be like the ingredients mixed in the same boring way, no matter how many steps you have.
An activation function is like a taste-tester in between workers. Each time, the taste-tester decides whether and how much of the new ingredient should go on, sometimes amplifying, squashing, or cutting things off entirely!
Without an activation function, no matter how many layers you add, the network can only make linear changes to the input—like a single-layer recipe, just longer and more expensive. Activation functions let your network “think outside the line”—mixing, squashing, and transforming signals so it can learn curves, corners, and real-life complexities.
Let’s formalize where activation functions fit into the neural network workflow.
A single neuron computes:
where
A layer of neurons vectorizes this:
where
Some common activation functions:
Why does it matter?
If you “stack” linear layers without a non-linearity between them, the result is still a linear function of the input—no more powerful than a single layer! Activations unlock non-linear transformations, giving networks the capacity to model almost any function.
Let’s break it down:
Linear Layers Without Activation = Linear Model
Suppose you have two linear layers:
You can rewrite this as:
where ,
— still just a single linear transformation! So, no matter how many you stack, without activations they’re just one big linear operation.
Introducing Non-Linearity When you wrap a non-linear activation between the layers:
Now, the relationship between and can curve, bend, or even oscillate—depending on . This makes neural networks universal function approximators.
Practical Impacts
Choosing the Right Activation
Let’s visualize activation functions and see their impact on network expressiveness!
Install requirements (if needed):
uv pip install torch matplotlib
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
def plot_activations() -> None:
z = torch.linspace(-5, 5, steps=200)
activations = {
"ReLU": F.relu(z),
"Sigmoid": torch.sigmoid(z),
"Tanh": torch.tanh(z),
}
plt.figure(figsize=(8, 5))
for name, activation in activations.items():
plt.plot(z.numpy(), activation.numpy(), label=name)
plt.title("Common Activation Functions")
plt.xlabel("Input z")
plt.ylabel("Activation(z)")
plt.legend()
plt.grid(True)
plt.show()
plot_activations()
Let’s fit a toy dataset with and without non-linearity.
import torch
import matplotlib.pyplot as plt
import torch.nn.functional as F
def make_toy_data(n: int = 100) -> tuple[torch.Tensor, torch.Tensor]:
# Data: input x, labels y = sin(x)
x = torch.linspace(-2 * torch.pi, 2 * torch.pi, n).unsqueeze(1)
y = torch.sin(x)
return x, y
class SimpleNet(torch.nn.Module):
def __init__(self, use_activation: bool):
super().__init__()
self.fc1 = torch.nn.Linear(1, 10)
self.use_activation = use_activation
self.fc2 = torch.nn.Linear(10, 1)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.fc1(x)
if self.use_activation:
x = F.relu(x) # Try also with torch.tanh(x)
x = self.fc2(x)
return x
def train_model(use_activation: bool, epochs: int = 1000) -> torch.nn.Module:
x, y = make_toy_data()
model = SimpleNet(use_activation)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for _ in range(epochs):
y_pred = model(x)
loss = F.mse_loss(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return model
def plot_fit_with_and_without_activation():
x, y = make_toy_data()
models = {
"No Activation (Linear Only)": train_model(use_activation=False, epochs=1000),
"With ReLU Activation": train_model(use_activation=True, epochs=1000),
}
plt.figure(figsize=(8, 5))
plt.plot(x.numpy(), y.numpy(), label="True sin(x)", color="black", linewidth=2)
for label, model in models.items():
y_pred = model(x).detach()
plt.plot(x.numpy(), y_pred.numpy(), label=label)
plt.title("Linear v.s. Non-linear (ReLU) Neural Network Fit")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.grid(True)
plt.show()
plot_fit_with_and_without_activation()
Try replacing F.relu(x)
with torch.tanh(x)
in the model above, and see how much better the fit gets for a sinusoidal function!
Modify the given activation plotting code to plot not just each function, but its derivative as well.
Why? The gradient affects learning speed and stability!
def plot_activation_and_derivative() -> None:
z = torch.linspace(-5, 5, steps=200)
activations = {
"ReLU": (F.relu(z), (z > 0).float()),
"Sigmoid": (torch.sigmoid(z), torch.sigmoid(z) * (1 - torch.sigmoid(z))),
"Tanh": (torch.tanh(z), 1 - torch.tanh(z) ** 2),
}
plt.figure(figsize=(10, 6))
for idx, (name, (val, grad)) in enumerate(activations.items()):
plt.subplot(2, 3, idx+1)
plt.plot(z.numpy(), val.numpy())
plt.title(f"{name}")
plt.subplot(2, 3, idx+4)
plt.plot(z.numpy(), grad.numpy())
plt.title(f"{name} Derivative")
plt.tight_layout()
plt.show()
plot_activation_and_derivative()
In the toy function-fitting code above, swap the F.relu(x)
to torch.tanh(x)
and retrain. Which activation works better for approximating ?
Answer: Tanh matches the shape much better, because its output range matches ‘s output!
Create a custom PyTorch activation function called “Swish,” defined as .
def swish(z: torch.Tensor) -> torch.Tensor:
return z * torch.sigmoid(z)
# Visualize Swish
z = torch.linspace(-5, 5, steps=200)
plt.plot(z.numpy(), swish(z).numpy(), label="Swish")
plt.legend()
plt.grid(True)
plt.title("Swish Activation Function")
plt.show()
Next up: Mini-batching and data pipelines—how and why we feed data in small chunks to networks for efficient, robust learning!
USER:
in your mathematical examples, you seem to use W as a matrix and x as a vector. however, in other examples, i have done it the other way around: with X as a matrix and w as a vector. how do we know when approach is useful? does this make a big difference, such as whether there are more weights than data points, or vice versa?
ASSISTANT:
Great question! This is a common source of confusion when reading machine learning and deep learning materials, since different authors use different indexing conventions. Here’s how to make sense of it:
When we process one example at a time, we often write:
This matches the typical “linear algebra” convention and is easy for writing math involving one input.
When we process multiple data points at once (a batch), we usually stack the data into a matrix:
So in this notation:
This batch notation is standard in PyTorch, TensorFlow, and other frameworks, because it enables efficient computation for lots of data at once.
Yes, with care. The meanings are compatible—they’re just different levels of abstraction:
You can always write the “batched” case with matrices/vectors, and the “single” case by picking out one row.
The shape of and will depend on your data:
It doesn’t matter if you have more data points than weights, or vice versa, in terms of which convention you use. What matters:
For a dense (fully connected) neural network layer in PyTorch, the input is always a matrix of shape (batch size , input features ), and the layer’s weights are a matrix of shape (where is the output size). The basic operation:
This is why you often see as a matrix (inputs), and as a matrix (weights from each input to each output unit/neuron).
Symbol | Typical shape | Meaning | Usage |
---|---|---|---|
Single input vector | Single input math | ||
Matrix of inputs, each with features | Batch processing | ||
Weight vector for a simple linear model | Linear regression | ||
Weight matrix for multilayer/NN | Neural networks | ||
, | scalar or vector | Bias term | Both |
, | scalar or | Predicted output(s) | Both |
If you ever get confused, write out the shapes for each term and check that the multiplication/addition makes sense! This habit will save you endless debugging time in both math and PyTorch.
USER:
ok, you’ve mentioned “dense” neural networks. i have encounntered this term before, but i don’t know what this means. this is a bit of a tangent, but can you please briefly explain the difference between “dense” and “sparse” neural networks? and when are sparse networks useful?
ASSISTANT:
Absolutely! This is a great foundational question, and understanding it will help you grasp both deep learning architectures and practical efficiency tricks.
A dense neural network layer (also called a fully connected layer) is one where every neuron in one layer is connected to every neuron in the next.
Mathematically, for input (with features) and output (with features):
where is a weight matrix of shape , and every entry in multiplies an input to an output.
In PyTorch:
import torch.nn as nn
layer = nn.Linear(in_features=d, out_features=h)
This is a dense (fully connected) linear layer.
A sparse neural network is designed so that most of the possible connections between neurons are simply not there (they’re missing, or, mathematically, set to zero).
Why make it sparse?
Efficiency:
Overfitting Avoidance:
After Pruning:
Special Structure:
Type | Connections | When to use |
---|---|---|
Dense | Every neuron to every neuron | Simple problems, default, when interpretability isn’t a concern; fits most use cases in small/medium networks |
Sparse | Only a subset of possible connections | Large-scale models, after pruning, domain-specific structure, efficiency important |
If you’re getting started, think of “dense” as the standard. Once you deal with large models or have special constraints, “sparse” becomes important!
If you want to play with sparsity in PyTorch, check out torch.nn.utils.prune
for basic pruning tools.
USER:
it’s a bit odd to me that we add these seemingly arbitrary nonlinear functions to the network. is there some underlying conceptual reason for this, or was it just discovered that this was the key to making it so that neural networks can approximate any function?
ASSISTANT:
This is a really insightful question! It does seem a bit “magical” at first—that adding non-linear “squash” functions like sigmoid, tanh, or ReLU between matrix multiplications suddenly gives neural networks their remarkable power.
Let’s address why we need nonlinearity conceptually, and how this idea arose in the theory and practice of neural networks.
Think of building with LEGO blocks:
Without Nonlinearity | With Nonlinearity |
---|---|
Only linear mappings | Capable of complex, flexible mappings |
Cannot model curves, logical “OR”/“AND”, etc. | Can model arbitrary continuous functions, logical gates, images, etc. |
More layers = same as one layer | More layers = more power (depth matters) |
Easy to optimize, limited utility | Requires nonlinearity to be trainable, but is vastly more expressive |
Nonlinearity is not arbitrary—it is essential for function approximation.
Without it, neural nets cannot distinguish themselves from simple regression!
So, this “add a nonlinear activation and magic happens” is not just an accident of history:
It is the mathematically necessary way to turn a “linear stretcher” into a “universal function machine.” The specific choice (sigmoid, tanh, ReLU) is practical, but having nonlinearity is foundational.
USER: