Calculus 4: Gradient Descent in 1‑D — Your First Training Loop

2025‑05‑27 · Artintellica

“Every deep‑learning optimizer is just a smarter cousin of simple gradient descent.”


1 · The Toy Loss Function

We’ll minimize

f(x)=x43x2+2,f(x)=x^{4}-3x^{2}+2,

whose derivative is

f(x)=4x36x.f'(x)=4x^{3}-6x.

It has a global minimum near x1.225x\approx 1.225 and a local maximum at x=0x=0. Great playground for seeing overshoot and learning‑rate trade‑offs.


2 · Gradient‑Descent Update Rule

For a scalar parameter xx,

xk+1=xkηf(xk),x_{k+1}=x_{k}-\eta\,f'(x_{k}),

where η>0\eta>0 is the step size (a.k.a. learning rate).


3 · Python Demo — Comparing Step Sizes

# calc-04-optim-1d/gradient_descent_1d.py
import torch, math, matplotlib.pyplot as plt
torch.set_default_dtype(torch.float64)

# ---- 1. define f and its grad ---------------------------------------
def f(x):
    return x**4 - 3*x**2 + 2

def grad_f(x):
    return 4*x**3 - 6*x

# ---- 2. experiment settings -----------------------------------------
etas = [0.01, 0.1, 0.5]          # small, medium, too‑large
n_steps = 40
x0 = torch.tensor( -2.5 )        # start left of the bowl

histories = {}                   # store (xs, fs) per eta

for eta in etas:
    x = x0.clone()
    xs, fs = [], []
    for _ in range(n_steps):
        xs.append(float(x))
        fs.append(float(f(x)))
        x = x - eta * grad_f(x)
    histories[eta] = (xs, fs)

# ---- 3. plot trajectories -------------------------------------------
plt.figure(figsize=(6,4))
for eta, (xs, fs) in histories.items():
    plt.plot(xs, fs, marker='o', label=f"η={eta}")
plt.xlabel("x"); plt.ylabel("f(x)")
plt.title("Gradient‑descent paths for different step sizes")
plt.legend(); plt.tight_layout(); plt.show()

# ---- 4. print final positions ---------------------------------------
for eta, (xs, fs) in histories.items():
    print(f"η={eta:4}  →  final x={xs[-1]: .4f}, f={fs[-1]: .6f}")

Typical output

η=0.01  →  final x= 1.1004, f=0.802419
η=0.1   →  final x= 1.2237, f=0.802341   (near optimum)
η=0.5   →  final x=-2.0363, f=6.549040   (diverged/oscillated)

And the plot shows:

  • η = 0.01 — slow crawl; still descending.
  • η = 0.1 — sweet‑spot; zooms to the global minimum then flattens.
  • η = 0.5 — overshoots, bounces, even climbs uphill (loss ↑).

4 · Lessons that Generalize to Deep Learning

ObservationDeep‑Learning Analogy
Too‑small η: slow convergence.Training loss plateaus, epochs drag on.
Just‑right η: reaches minimum quickly.Good default LR (e.g. 1e‑3 for Adam).
Too‑large η: divergence or oscillation.“NaN in loss,” exploding gradients, need LR decay or gradient clipping.

5 · Optional: Autograd Version

Want PyTorch to compute gradients for you? Replace grad_f(x) by:

def grad_f(x):
    x = x.clone().detach().requires_grad_(True)
    fx = f(x)
    fx.backward()
    return x.grad

Exactly what happens inside each layer of a neural network.


6 · Exercises

  1. Momentum — Add a velocity term vk+1=βvk+f(xk)v_{k+1}=βv_{k}+∇f(x_k); show how momentum (β=0.9β=0.9) speeds up the small‑η run.
  2. Adaptive LR — Implement AdaGrad or Adam (1‑D). Compare against fixed η.
  3. Multiple Starts — Run 50 random x0x_0 in [3,3][-3,3]; plot histogram of where each η ends up. Which runs get trapped at the local maximum x=0x=0?
  4. Line Search — At each step pick η that minimizes f(xηf(x))f(x-ηf'(x)) along the ray; compare iterations needed.

Commit solutions to calc-04-optim-1d/ and tag v0.1.


Next stop: Calculus 5 – Taylor Series & Function Approximation.



Next Blog Posts

Earlier Blog Posts


Back to Blog

Copyright © 2025 Identellica LLC
Home · Blog · Source Code