Learn the Training Loop with PyTorch, Part 1.5: Numerical vs. Analytical Gradients

2025-06-16 · Artintellica

USER:

i need your help to write the next blog post in a series i am producing called “Learn the Training Loop with PyTorch”.

first, let me give you the outline of the series:


Module 1: The Elementary Training Loop

Goal: Build foundational intuition by hand-coding a simple training loop from scratch.

1. Introduction: What is a Training Loop?

  • Definition and importance
  • Loss, optimization, and learning in ML

2. The Simplest Case: Linear Regression

  • What is linear regression?
  • The mean squared error loss
  • Deriving the update rules manually

3. Batch vs. Stochastic Gradient Descent

  • Differences and practical implications
  • Hands-on: See both in Python/numpy

4. Visualizing the Loss Landscape

  • Plotting loss as a function of parameters
  • Demo: Watching convergence by hand

5. Numerical vs. Analytical Gradients

  • How are gradients computed?
  • Manual calculation and confirmation with numerical gradient

6. Recap and Key Takeaways


Module 2: The Training Loop for Neural Networks

Goal: Extend the basic loop to more powerful models; introduce neural network concepts.

1. From Linear to Nonlinear: Why Neural Networks?

  • Multi-layer perceptrons (MLPs)
  • Activation functions (ReLU, sigmoid, tanh)

2. Forward and Backward Passes

  • The chain rule of differentiation
  • Backpropagation intuition (no need for full rigor yet)

3. Implementing a Simple Neural Net from Scratch

  • Single hidden layer network in numpy/Python
  • Coding forward, loss, backward, and parameter updates

4. The Role of Activations

  • Intuitive explanation and code examples
  • Effect on learning and expressiveness

5. Mini-batching and Data Pipelines

  • How batching speeds up and stabilizes learning

6. Regularization and Overfitting

  • Techniques: L2, dropout (conceptual intro)

7. Recap: Comparing Our Simple Network with Linear Regression


Module 3: Advanced Training Loops and Modern Tricks

Goal: Connect foundational understanding to the large-scale models used in industry.

1. Optimization Algorithms Beyond SGD

  • Adam, RMSProp, etc.: How they work, why they’re useful

2. Learning Rate Scheduling

  • How/why learning rates are changed during training
  • Demo: Effect of learning rate on convergence and results

3. Weight Initialization

  • Why it matters, common strategies

4. Deeper Networks and Backprop Challenges

  • Vanishing/exploding gradients
  • Solutions: normalization, skip connections

5. Large-Scale Training: Data Parallelism and Hardware

  • How the same training loop runs efficiently on GPUs and across many machines

6. Monitoring and Debugging the Training Loop

  • Visualizing loss/accuracy, diagnosing common problems

7. Modern Regularization and Generalization Techniques

  • Early stopping, batch/layer normalization, data augmentation (conceptual intro)

8. The Training Loop in Practice: Case Studies

  • High-level overview of how OpenAI/Google/Meta might train LLMs: what’s the same, what’s different.
  • What “scaling up” really means: dataset, parameters, hardware, tricks.

9. Conclusion: What’s Next After the Training Loop?

  • Discussion: fine-tuning, transfer learning, and unsupervised/self-supervised methods.

General Tips for Each Module:

  • Start simple, build incrementally: Keep code simple at first, then show how to layer complexity.
  • Accompany every lesson with practical code examples.
  • Focus on intuition: Use diagrams, plots, and hands-on demo to make concepts tangible.
  • Summarize and review regularly: Add checklists or quizzes at the end of each module.

great. now, for each blog post, i am hoping to follow the following structure:

  • the course is called “Learn the Training Loop with PyTorch”
  • always provide a brief introduction to the topic at hand
  • then always provide a very simple “ELI5” explanation of the topic, that theortically anyone could understand
  • then provide the mathematical foundations. for math, always use dollar delimited LaTeX, and always provide a brief explanation of the math. for instance, y=Xw+b\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b} is the equation for linear regression, where y\mathbf{y} is the output vector, X\mathbf{X} is the input matrix, w\mathbf{w} is the weight vector, and b\mathbf{b} is the bias vector.
  • for block-level math, always use two dollar signs, like this: y=Xw+b\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{b}
  • after the mathematical foundations, such as definitions, then always provide an in-depth explanation, covering the topic in detail, and expanding on the explanation of the math if necessary.
  • for other formatting, use markdown like headings, bullet points, and code blocks to make the content readable. do not use HTML tags.
  • after the math is complete, then provide one or more demonstrations of the topic using PyTorch. if possible, always use PyTorch, not numpy, just to keep it simple and consistnt. you can also use matplotlib to visualize the results.
  • for all python code, always use type hints, and, if necessary, describe installation instructions using “uv” (not pip).
  • then, create a series of exercises, somewhere from 1 - 4 total exercises that cover the material. please provide the soruce code for each exercise to make sure the student can easily compare their solution to the correct answer. do not hide the exercises and do not put HTML in the exercises.
  • finally, provide a summary of the topic, and a list of key takeaways.

that’s it, now can you please write blog post 1.5? the description is:

5. Numerical vs. Analytical Gradients

  • How are gradients computed?
  • Manual calculation and confirmation with numerical gradient

please respond with the full blog post only, and no other information.

ASSISTANT:

Learn the Training Loop with PyTorch

Module 1.5: Numerical vs. Analytical Gradients

Introduction

In the previous lessons, we learned what gradients are and why they’re essential for training models using methods like gradient descent. But how are these gradients actually calculated? In this lesson, we’ll demystify two main ways to compute gradients: analytically (using calculus rules) and numerically (using finite differences). Understanding both is extremely important—manual checks with numerical gradients can catch bugs and help you build confidence in your implementation.


ELI5: What Are Gradients and How Can We Compute Them?

Imagine you’re on a hike and want to descend a hill as quickly as possible. The gradient at your feet tells you which direction is the steepest way down. But you don’t have a map!

  • Analytical gradients: This is like being given an exact formula for the hill so you always know which way is steepest.
  • Numerical gradients: This is like moving a tiny bit in each direction, measuring if you’ve gone up or down, and deducing which way is down.

Analytical is fast and exact (if you do the math right), while numerical is foolproof but a bit slow and less precise. Checking analytical gradients with numerical ones is like double-checking your directions on the trail using both a map and a compass.


Mathematical Foundations

Analytical Gradients

Suppose you have a function f(x)f(x) and you want to know how ff changes when xx changes. The gradient is simply the derivative: dfdx\frac{df}{dx}.

Example: For f(x)=x2f(x) = x^2,

dfdx=2x\frac{df}{dx} = 2x

In machine learning, our “function” is usually the loss as a function of parameters (weights), and we want wL(w)\nabla_{\mathbf{w}} L(\mathbf{w}) — the gradient of loss with respect to weights.

  • Analytical gradients are calculated using calculus—chain rule, product rule, etc.

Numerical Gradients

Numerical gradients use finite differences: we make a tiny change and see how the output changes.

For a function f(x)f(x), the numerical gradient at xx is approximated as:

dfdxf(x+h)f(xh)2h\frac{df}{dx} \approx \frac{f(x + h) - f(x - h)}{2h}

where hh is a very small number (e.g., 1e51e^{-5}).

  • This “symmetric difference quotient” gives a good estimate of the slope of ff at xx.

Why Compare Numerically and Analytically?

  • Numerical gradients give you a way to check if your analytical/automatic gradients (e.g., via PyTorch) are correct.
  • Useful for bug-hunting and for building trust in your implementation.

In-Depth Explanation

Gradient calculation is at the heart of neural network training. In frameworks like PyTorch, gradients are typically computed automatically using something called autograd (automatic differentiation).

But what if you want to check if those gradients are right? Or you just want to build intuition and test your code? Here’s where numerical gradients are invaluable. They are computed by slightly nudging each parameter, one at a time, and observing how much the loss changes. This is reliable—but slow—since it requires re-evaluating the function many times.

Let’s recap the two approaches:

  • Analytical (Automatic) Gradients: Computed mathematically using calculus, often by backpropagation and the chain rule, and done instantly by libraries like PyTorch.
  • Numerical Gradients: Computed by re-running the model with slightly perturbed parameters to estimate the slope. Not as fast, but almost guaranteed to be correct if you’re careful.

The Recipe:

  1. Pick a tiny value for hh (like 1e51e^{-5}).
  2. For each parameter:
    1. Slightly increase by hh: xi+hx_i + h.
    2. Slightly decrease by hh: xihx_i - h.
    3. Compute loss at both locations.
    4. Use the formula to estimate the derivative.
  3. Compare to the gradient calculated analytically (by PyTorch).

Demonstration: Gradient Checking for Linear Regression in PyTorch

Let’s put this into practice with PyTorch!

Setup

We will:

  • Define a simple 1D linear regression model.
  • Compute the loss and its gradient (analytically, via PyTorch autograd).
  • Compute the loss and gradient numerically for the weights.
  • Compare the two.

Install Dependencies

uv pip install torch matplotlib

Code: Analytical vs. Numerical Gradients

import torch
import matplotlib.pyplot as plt
from typing import Tuple

def simple_linear_regression_data(n: int = 10) -> Tuple[torch.Tensor, torch.Tensor]:
    torch.manual_seed(0)
    # y = 2x + 3 + noise
    X = torch.linspace(-1, 1, n).unsqueeze(1)
    y = 2 * X + 3 + 0.1 * torch.randn(n, 1)
    return X, y

def mse_loss(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
    return ((y_pred - y_true) ** 2).mean()

def model(X: torch.Tensor, w: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    return X @ w + b

def numerical_gradient(
    func, 
    params: Tuple[torch.Tensor, ...], 
    param_idx: int, 
    h: float = 1e-5
) -> torch.Tensor:
    grads = torch.zeros_like(params[param_idx])
    it = torch.nditer(params[param_idx], flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        idx = it.multi_index
        orig = params[param_idx][idx].item()

        params_plus = [p.clone() for p in params]
        params_minus = [p.clone() for p in params]
        params_plus[param_idx][idx] = orig + h
        params_minus[param_idx][idx] = orig - h
        f_plus = func(*params_plus).item()
        f_minus = func(*params_minus).item()
        grad = (f_plus - f_minus) / (2 * h)
        grads[idx] = grad

        it.iternext()
    return grads

# Generate data
X, y = simple_linear_regression_data(20)
# Initialize parameters with grad enabled
w = torch.randn(1, 1, requires_grad=True)
b = torch.randn(1, requires_grad=True)

def loss_fn(w_: torch.Tensor, b_: torch.Tensor) -> torch.Tensor:
    y_pred = model(X, w_, b_)
    return mse_loss(y_pred, y)

# Compute analytical gradients
loss = loss_fn(w, b)
loss.backward()  # This fills in w.grad and b.grad

# Compute numerical gradients
num_grad_w = numerical_gradient(loss_fn, (w, b), param_idx=0)
num_grad_b = numerical_gradient(loss_fn, (w, b), param_idx=1)

print("Analytical grad (w):", w.grad)
print("Numerical grad (w): ", num_grad_w)
print("Analytical grad (b):", b.grad)
print("Numerical grad (b): ", num_grad_b)

Analysis: Are They Close?

The two gradients should be very close. If you see a big difference, there’s likely a bug somewhere—either in your code or in the choice of hh (if it’s too big or too small, numerical gradients get inaccurate).


Exercises

1. Change the Model to Quadratic and Compare Gradients

  • Modify the model to y=w2x2+w1x+by = w_2 x^2 + w_1 x + b. Compute and compare gradients for w1w_1, w2w_2, and bb.
# Quadratic model: y = w2 * x^2 + w1 * x + b

def quadratic_model(X: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    return w2 * X ** 2 + w1 * X + b

# New parameters
w1 = torch.randn(1, requires_grad=True)
w2 = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)

def loss_fn_quad(w1_: torch.Tensor, w2_: torch.Tensor, b_: torch.Tensor) -> torch.Tensor:
    y_pred = quadratic_model(X, w1_, w2_, b_)
    return mse_loss(y_pred, y)

loss_quad = loss_fn_quad(w1, w2, b)
loss_quad.backward()

num_grad_w1 = numerical_gradient(loss_fn_quad, (w1, w2, b), param_idx=0)
num_grad_w2 = numerical_gradient(loss_fn_quad, (w1, w2, b), param_idx=1)
num_grad_b_quad = numerical_gradient(loss_fn_quad, (w1, w2, b), param_idx=2)

print("Analytical grad (w1):", w1.grad)
print("Numerical grad (w1):", num_grad_w1)
print("Analytical grad (w2):", w2.grad)
print("Numerical grad (w2):", num_grad_w2)
print("Analytical grad (b):", b.grad)
print("Numerical grad (b):", num_grad_b_quad)

2. Visualize the Loss as a Function of the Weight

  • Pick a fixed bb, sweep ww from -4 to 4, and plot the loss.
import numpy as np

b_fixed = b.detach().item()
ws = np.linspace(-4, 4, 100)
losses = []
for w_test in ws:
    w_tensor = torch.tensor([[w_test]])
    l = loss_fn(w_tensor, torch.tensor([b_fixed]))
    losses.append(l.item())

plt.figure()
plt.plot(ws, losses, label='Loss vs w')
plt.xlabel('w')
plt.ylabel('Loss')
plt.title('Loss as a function of w')
plt.legend()
plt.show()

3. Try Different hh Values for Numerical Gradient

  • Try h=1e3h = 1e^{-3} and h=1e7h = 1e^{-7} and observe how the numerical gradient changes. What happens if hh is too big or too small?
h_list = [1e-3, 1e-5, 1e-7]
for h in h_list:
    num_grad_w = numerical_gradient(loss_fn, (w, b), param_idx=0, h=h)
    print(f"Numerical grad (w) with h={h:.0e}:", num_grad_w)

Summary

In this post, we explored the two major ways to compute gradients for loss functions: analytical gradients (using calculus or autograd in PyTorch) and numerical gradients (using the finite difference method).

  • Analytical gradients are fast and exact (as long as your math or library is correct).
  • Numerical gradients are slower, but provide an invaluable way to check your work, especially in debugging or during implementation from scratch.

Key Takeaways:

  • Gradients are central to training—computing them correctly is crucial!
  • Analytical (autograd) and numerical gradients should match closely for well-behaved functions.
  • Numerical gradient checking can catch subtle bugs in your code.
  • For modern machine learning, autograd (analytical) is always used in practice, but gradient checking is a vital skill—especially when experimenting or learning.

Next up: We’ll summarize Module 1 and set up for nonlinear models and neural networks. See you in the next lesson!

USER:

this line has an error:

    it = torch.nditer(params[param_idx], flags=['multi_index'], op_flags=['readwrite'])

it says nditer is not available in torch. can i add this somehow? do i need another dependency?

ASSISTANT:

Great question! Actually, torch.nditer does not exist in PyTorch—nditer is a NumPy utility. In PyTorch, you simply need to loop over tensor indices in a regular Pythonic way.

Here’s how you can fix the numerical_gradient function:

def numerical_gradient(
    func,
    params: Tuple[torch.Tensor, ...],
    param_idx: int,
    h: float = 1e-5
) -> torch.Tensor:
    grads = torch.zeros_like(params[param_idx])
    param_shape = params[param_idx].shape
    # Iterate over each index (works for any shape)
    for idx in torch.ndindex(param_shape):
        orig = params[param_idx][idx].item()
        # Make copies for plus/minus h changes
        params_plus = [p.clone() for p in params]
        params_minus = [p.clone() for p in params]
        params_plus[param_idx][idx] = orig + h
        params_minus[param_idx][idx] = orig - h
        f_plus = func(*params_plus).item()
        f_minus = func(*params_minus).item()
        grad = (f_plus - f_minus) / (2 * h)
        grads[idx] = grad
    return grads

Note: torch.ndindex() is available from PyTorch 1.11.0 onwards. If you’re on an older version, you can use Python’s itertools.product:

import itertools

def numerical_gradient(
    func,
    params: Tuple[torch.Tensor, ...],
    param_idx: int,
    h: float = 1e-5
) -> torch.Tensor:
    grads = torch.zeros_like(params[param_idx])
    param_shape = params[param_idx].shape
    for idx in itertools.product(*(range(s) for s in param_shape)):
        orig = params[param_idx][idx].item()
        params_plus = [p.clone() for p in params]
        params_minus = [p.clone() for p in params]
        params_plus[param_idx][idx] = orig + h
        params_minus[param_idx][idx] = orig - h
        f_plus = func(*params_plus).item()
        f_minus = func(*params_minus).item()
        grad = (f_plus - f_minus) / (2 * h)
        grads[idx] = grad
    return grads

No extra dependencies needed—just replace the inner loop as above!


Summary:

  • You do not need another dependency.
  • Just change the loop code as shown (no torch.nditer).
  • Use either torch.ndindex (if available) or itertools.product.

Let me know if you want the full blog post updated with this correction!

USER:

in the second example. i’m getting an error. here is my code:

import torch
import matplotlib.pyplot as plt
from typing import Tuple
import itertools


def simple_linear_regression_data(n: int = 10) -> Tuple[torch.Tensor, torch.Tensor]:
    torch.manual_seed(0)
    # y = 2x + 3 + noise
    X = torch.linspace(-1, 1, n).unsqueeze(1)
    y = 2 * X + 3 + 0.1 * torch.randn(n, 1)
    return X, y


def mse_loss(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
    return ((y_pred - y_true) ** 2).mean()


def model(X: torch.Tensor, w: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    return X @ w + b


def numerical_gradient(
    func, params: Tuple[torch.Tensor, ...], param_idx: int, h: float = 1e-5
) -> torch.Tensor:
    grads = torch.zeros_like(params[param_idx])
    param_shape = params[param_idx].shape
    for idx in itertools.product(*(range(s) for s in param_shape)):
        orig = params[param_idx][idx].item()
        params_plus = [p.clone() for p in params]
        params_minus = [p.clone() for p in params]
        params_plus[param_idx][idx] = orig + h
        params_minus[param_idx][idx] = orig - h
        f_plus = func(*params_plus).item()
        f_minus = func(*params_minus).item()
        grad = (f_plus - f_minus) / (2 * h)
        grads[idx] = grad
    return grads


# Generate data
X, y = simple_linear_regression_data(20)

def loss_fn(w_: torch.Tensor, b_: torch.Tensor) -> torch.Tensor:
    y_pred = model(X, w_, b_)
    return mse_loss(y_pred, y)

# Quadratic model: y = w2 * x^2 + w1 * x + b
def quadratic_model(
    X: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor, b: torch.Tensor
) -> torch.Tensor:
    return w2 * X**2 + w1 * X + b


# New parameters
w1 = torch.randn(1, requires_grad=True)
w2 = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)

import numpy as np

b_fixed = b.detach().item()
ws = np.linspace(-4, 4, 100)
losses = []
for w_test in ws:
    w_tensor = torch.tensor([[w_test]])
    l = loss_fn(w_tensor, torch.tensor([b_fixed]))
    losses.append(l.item())

plt.figure()
plt.plot(ws, losses, label='Loss vs w')
plt.xlabel('w')
plt.ylabel('Loss')
plt.title('Loss as a function of w')
plt.legend()
plt.show()

here is the error:

Traceback (most recent call last):
  File "/Users/ryan/dev/artintellica/py/blog-0100-training-loop-15/ex_2.py", line 89, in <module>
    l = loss_fn(w_tensor, torch.tensor([b_fixed]))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/dev/artintellica/py/blog-0100-training-loop-15/ex_2.py", line 45, in loss_fn
    y_pred = model(X, w_, b_)
             ^^^^^^^^^^^^^^^^
  File "/Users/ryan/dev/artintellica/py/blog-0100-training-loop-15/ex_2.py", line 20, in model
    return X @ w + b
           ~~^~~
RuntimeError: expected m1 and m2 to have the same dtype, but got: float != double

how do i fix this?

ASSISTANT:

Great job reproducing and reporting the error!
This is a common issue in PyTorch and NumPy interoperation: tensor dtype (floating point precision) mismatches.

Let’s break down the error:

RuntimeError: expected m1 and m2 to have the same dtype, but got: float != double
  • PyTorch’s default floating point type is float32 (sometimes just called float in code/messages).
  • But when NumPy creates float numbers (e.g. via np.linspace), w_test is a Python float, which defaults to float64 (double).
  • When you do:
    w_tensor = torch.tensor([[w_test]])
    the resulting PyTorch tensor has type torch.float64 (double).
  • Your X and y are in float32, so you get a type mismatch during the matrix multiplication (@).

The Solution

Be explicit about your dtypes.
The simplest fix:
When you create new tensors from Python floats/NumPy numbers, set dtype=torch.float32 (or dtype=X.dtype).

Update the loop as follows:

# This ensures the type matches your model/X tensors (float32)
for w_test in ws:
    w_tensor = torch.tensor([[w_test]], dtype=X.dtype)
    l = loss_fn(w_tensor, torch.tensor([b_fixed], dtype=X.dtype))
    losses.append(l.item())

If you want, you can also ensure X and y are float64 (less common in ML settings).

Recap of key lines:

w_tensor = torch.tensor([[w_test]], dtype=X.dtype)
l = loss_fn(w_tensor, torch.tensor([b_fixed], dtype=X.dtype))

Now your code should work!

Key Takeaway

  • Always ensure your tensors have matching dtype throughout your code, especially when mixing NumPy, Python numbers, and PyTorch.

Let me know if you’d like a complete and revised code block or further troubleshooting!

USER:



Next Blog Posts

Earlier Blog Posts


Back to Blog

Copyright © 2025 Identellica LLC
Home · Blog · Source Code