USER:
i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.
first, let me give you the outline of the series:
Goal: Build foundational intuition by hand-coding a simple training loop from scratch.
1. Introduction: What is a Training Loop?
2. The Simplest Case: Linear Regression
3. Batch vs. Stochastic Gradient Descent
4. Visualizing the Loss Landscape
5. Numerical vs. Analytical Gradients
6. Recap and Key Takeaways
Goal: Extend the basic loop to more powerful models; introduce neural network concepts.
1. From Linear to Nonlinear: Why Neural Networks?
2. Forward and Backward Passes
3. Implementing a Simple Neural Net from Scratch
4. The Role of Activations
5. Mini-batching and Data Pipelines
6. Regularization and Overfitting
7. Recap: Comparing Our Simple Network with Linear Regression
Goal: Connect foundational understanding to the large-scale models used in industry.
1. Optimization Algorithms Beyond SGD
2. Learning Rate Scheduling
3. Weight Initialization
4. Deeper Networks and Backprop Challenges
5. Large-Scale Training: Data Parallelism and Hardware
6. Monitoring and Debugging the Training Loop
7. Modern Regularization and Generalization Techniques
8. The Training Loop in Practice: Case Studies
9. Conclusion: What’s Next After the Training Loop?
great. now, for each blog post, i am hoping to follow the following structure:
that’s it, now can you please write blog post 1.5? the description is:
5. Numerical vs. Analytical Gradients
please respond with the full blog post only, and no other information.
ASSISTANT:
In the previous lessons, we learned what gradients are and why they’re essential for training models using methods like gradient descent. But how are these gradients actually calculated? In this lesson, we’ll demystify two main ways to compute gradients: analytically (using calculus rules) and numerically (using finite differences). Understanding both is extremely important—manual checks with numerical gradients can catch bugs and help you build confidence in your implementation.
Imagine you’re on a hike and want to descend a hill as quickly as possible. The gradient at your feet tells you which direction is the steepest way down. But you don’t have a map!
Analytical is fast and exact (if you do the math right), while numerical is foolproof but a bit slow and less precise. Checking analytical gradients with numerical ones is like double-checking your directions on the trail using both a map and a compass.
Suppose you have a function and you want to know how changes when changes. The gradient is simply the derivative: .
Example: For ,
In machine learning, our “function” is usually the loss as a function of parameters (weights), and we want — the gradient of loss with respect to weights.
Numerical gradients use finite differences: we make a tiny change and see how the output changes.
For a function , the numerical gradient at is approximated as:
where is a very small number (e.g., ).
Gradient calculation is at the heart of neural network training. In frameworks like PyTorch, gradients are typically computed automatically using something called autograd (automatic differentiation).
But what if you want to check if those gradients are right? Or you just want to build intuition and test your code? Here’s where numerical gradients are invaluable. They are computed by slightly nudging each parameter, one at a time, and observing how much the loss changes. This is reliable—but slow—since it requires re-evaluating the function many times.
Let’s recap the two approaches:
Let’s put this into practice with PyTorch!
We will:
uv pip install torch matplotlib
import torch
import matplotlib.pyplot as plt
from typing import Tuple
def simple_linear_regression_data(n: int = 10) -> Tuple[torch.Tensor, torch.Tensor]:
torch.manual_seed(0)
# y = 2x + 3 + noise
X = torch.linspace(-1, 1, n).unsqueeze(1)
y = 2 * X + 3 + 0.1 * torch.randn(n, 1)
return X, y
def mse_loss(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
return ((y_pred - y_true) ** 2).mean()
def model(X: torch.Tensor, w: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
return X @ w + b
def numerical_gradient(
func,
params: Tuple[torch.Tensor, ...],
param_idx: int,
h: float = 1e-5
) -> torch.Tensor:
grads = torch.zeros_like(params[param_idx])
it = torch.nditer(params[param_idx], flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
orig = params[param_idx][idx].item()
params_plus = [p.clone() for p in params]
params_minus = [p.clone() for p in params]
params_plus[param_idx][idx] = orig + h
params_minus[param_idx][idx] = orig - h
f_plus = func(*params_plus).item()
f_minus = func(*params_minus).item()
grad = (f_plus - f_minus) / (2 * h)
grads[idx] = grad
it.iternext()
return grads
# Generate data
X, y = simple_linear_regression_data(20)
# Initialize parameters with grad enabled
w = torch.randn(1, 1, requires_grad=True)
b = torch.randn(1, requires_grad=True)
def loss_fn(w_: torch.Tensor, b_: torch.Tensor) -> torch.Tensor:
y_pred = model(X, w_, b_)
return mse_loss(y_pred, y)
# Compute analytical gradients
loss = loss_fn(w, b)
loss.backward() # This fills in w.grad and b.grad
# Compute numerical gradients
num_grad_w = numerical_gradient(loss_fn, (w, b), param_idx=0)
num_grad_b = numerical_gradient(loss_fn, (w, b), param_idx=1)
print("Analytical grad (w):", w.grad)
print("Numerical grad (w): ", num_grad_w)
print("Analytical grad (b):", b.grad)
print("Numerical grad (b): ", num_grad_b)
The two gradients should be very close. If you see a big difference, there’s likely a bug somewhere—either in your code or in the choice of (if it’s too big or too small, numerical gradients get inaccurate).
# Quadratic model: y = w2 * x^2 + w1 * x + b
def quadratic_model(X: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
return w2 * X ** 2 + w1 * X + b
# New parameters
w1 = torch.randn(1, requires_grad=True)
w2 = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)
def loss_fn_quad(w1_: torch.Tensor, w2_: torch.Tensor, b_: torch.Tensor) -> torch.Tensor:
y_pred = quadratic_model(X, w1_, w2_, b_)
return mse_loss(y_pred, y)
loss_quad = loss_fn_quad(w1, w2, b)
loss_quad.backward()
num_grad_w1 = numerical_gradient(loss_fn_quad, (w1, w2, b), param_idx=0)
num_grad_w2 = numerical_gradient(loss_fn_quad, (w1, w2, b), param_idx=1)
num_grad_b_quad = numerical_gradient(loss_fn_quad, (w1, w2, b), param_idx=2)
print("Analytical grad (w1):", w1.grad)
print("Numerical grad (w1):", num_grad_w1)
print("Analytical grad (w2):", w2.grad)
print("Numerical grad (w2):", num_grad_w2)
print("Analytical grad (b):", b.grad)
print("Numerical grad (b):", num_grad_b_quad)
import numpy as np
b_fixed = b.detach().item()
ws = np.linspace(-4, 4, 100)
losses = []
for w_test in ws:
w_tensor = torch.tensor([[w_test]])
l = loss_fn(w_tensor, torch.tensor([b_fixed]))
losses.append(l.item())
plt.figure()
plt.plot(ws, losses, label='Loss vs w')
plt.xlabel('w')
plt.ylabel('Loss')
plt.title('Loss as a function of w')
plt.legend()
plt.show()
h_list = [1e-3, 1e-5, 1e-7]
for h in h_list:
num_grad_w = numerical_gradient(loss_fn, (w, b), param_idx=0, h=h)
print(f"Numerical grad (w) with h={h:.0e}:", num_grad_w)
In this post, we explored the two major ways to compute gradients for loss functions: analytical gradients (using calculus or autograd in PyTorch) and numerical gradients (using the finite difference method).
Key Takeaways:
Next up: We’ll summarize Module 1 and set up for nonlinear models and neural networks. See you in the next lesson!
USER:
this line has an error:
it = torch.nditer(params[param_idx], flags=['multi_index'], op_flags=['readwrite'])
it says nditer is not available in torch. can i add this somehow? do i need another dependency?
ASSISTANT:
Great question! Actually, torch.nditer
does not exist in PyTorch—nditer
is a NumPy utility. In PyTorch, you simply need to loop over tensor indices in a regular Pythonic way.
Here’s how you can fix the numerical_gradient
function:
def numerical_gradient(
func,
params: Tuple[torch.Tensor, ...],
param_idx: int,
h: float = 1e-5
) -> torch.Tensor:
grads = torch.zeros_like(params[param_idx])
param_shape = params[param_idx].shape
# Iterate over each index (works for any shape)
for idx in torch.ndindex(param_shape):
orig = params[param_idx][idx].item()
# Make copies for plus/minus h changes
params_plus = [p.clone() for p in params]
params_minus = [p.clone() for p in params]
params_plus[param_idx][idx] = orig + h
params_minus[param_idx][idx] = orig - h
f_plus = func(*params_plus).item()
f_minus = func(*params_minus).item()
grad = (f_plus - f_minus) / (2 * h)
grads[idx] = grad
return grads
Note: torch.ndindex()
is available from PyTorch 1.11.0 onwards. If you’re on an older version, you can use Python’s itertools.product
:
import itertools
def numerical_gradient(
func,
params: Tuple[torch.Tensor, ...],
param_idx: int,
h: float = 1e-5
) -> torch.Tensor:
grads = torch.zeros_like(params[param_idx])
param_shape = params[param_idx].shape
for idx in itertools.product(*(range(s) for s in param_shape)):
orig = params[param_idx][idx].item()
params_plus = [p.clone() for p in params]
params_minus = [p.clone() for p in params]
params_plus[param_idx][idx] = orig + h
params_minus[param_idx][idx] = orig - h
f_plus = func(*params_plus).item()
f_minus = func(*params_minus).item()
grad = (f_plus - f_minus) / (2 * h)
grads[idx] = grad
return grads
No extra dependencies needed—just replace the inner loop as above!
Summary:
torch.nditer
).torch.ndindex
(if available) or itertools.product
.Let me know if you want the full blog post updated with this correction!
USER:
in the second example. i’m getting an error. here is my code:
import torch
import matplotlib.pyplot as plt
from typing import Tuple
import itertools
def simple_linear_regression_data(n: int = 10) -> Tuple[torch.Tensor, torch.Tensor]:
torch.manual_seed(0)
# y = 2x + 3 + noise
X = torch.linspace(-1, 1, n).unsqueeze(1)
y = 2 * X + 3 + 0.1 * torch.randn(n, 1)
return X, y
def mse_loss(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
return ((y_pred - y_true) ** 2).mean()
def model(X: torch.Tensor, w: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
return X @ w + b
def numerical_gradient(
func, params: Tuple[torch.Tensor, ...], param_idx: int, h: float = 1e-5
) -> torch.Tensor:
grads = torch.zeros_like(params[param_idx])
param_shape = params[param_idx].shape
for idx in itertools.product(*(range(s) for s in param_shape)):
orig = params[param_idx][idx].item()
params_plus = [p.clone() for p in params]
params_minus = [p.clone() for p in params]
params_plus[param_idx][idx] = orig + h
params_minus[param_idx][idx] = orig - h
f_plus = func(*params_plus).item()
f_minus = func(*params_minus).item()
grad = (f_plus - f_minus) / (2 * h)
grads[idx] = grad
return grads
# Generate data
X, y = simple_linear_regression_data(20)
def loss_fn(w_: torch.Tensor, b_: torch.Tensor) -> torch.Tensor:
y_pred = model(X, w_, b_)
return mse_loss(y_pred, y)
# Quadratic model: y = w2 * x^2 + w1 * x + b
def quadratic_model(
X: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, b: torch.Tensor
) -> torch.Tensor:
return w2 * X**2 + w1 * X + b
# New parameters
w1 = torch.randn(1, requires_grad=True)
w2 = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)
import numpy as np
b_fixed = b.detach().item()
ws = np.linspace(-4, 4, 100)
losses = []
for w_test in ws:
w_tensor = torch.tensor([[w_test]])
l = loss_fn(w_tensor, torch.tensor([b_fixed]))
losses.append(l.item())
plt.figure()
plt.plot(ws, losses, label='Loss vs w')
plt.xlabel('w')
plt.ylabel('Loss')
plt.title('Loss as a function of w')
plt.legend()
plt.show()
here is the error:
Traceback (most recent call last):
File "/Users/ryan/dev/artintellica/py/blog-0100-training-loop-15/ex_2.py", line 89, in <module>
l = loss_fn(w_tensor, torch.tensor([b_fixed]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ryan/dev/artintellica/py/blog-0100-training-loop-15/ex_2.py", line 45, in loss_fn
y_pred = model(X, w_, b_)
^^^^^^^^^^^^^^^^
File "/Users/ryan/dev/artintellica/py/blog-0100-training-loop-15/ex_2.py", line 20, in model
return X @ w + b
~~^~~
RuntimeError: expected m1 and m2 to have the same dtype, but got: float != double
how do i fix this?
ASSISTANT:
Great job reproducing and reporting the error!
This is a common issue in PyTorch and NumPy interoperation: tensor dtype (floating point precision) mismatches.
Let’s break down the error:
RuntimeError: expected m1 and m2 to have the same dtype, but got: float != double
float32
(sometimes just called float
in code/messages).np.linspace
), w_test
is a Python float, which defaults to float64
(double).w_tensor = torch.tensor([[w_test]])
torch.float64
(double).float32
, so you get a type mismatch during the matrix multiplication (@
).Be explicit about your dtypes.
The simplest fix:
When you create new tensors from Python floats/NumPy numbers, set dtype=torch.float32
(or dtype=X.dtype
).
Update the loop as follows:
# This ensures the type matches your model/X tensors (float32)
for w_test in ws:
w_tensor = torch.tensor([[w_test]], dtype=X.dtype)
l = loss_fn(w_tensor, torch.tensor([b_fixed], dtype=X.dtype))
losses.append(l.item())
If you want, you can also ensure X and y are float64 (less common in ML settings).
Recap of key lines:
w_tensor = torch.tensor([[w_test]], dtype=X.dtype)
l = loss_fn(w_tensor, torch.tensor([b_fixed], dtype=X.dtype))
Now your code should work!
dtype
throughout your code, especially when mixing NumPy, Python numbers, and PyTorch.Let me know if you’d like a complete and revised code block or further troubleshooting!
USER: