“Every deep‑learning optimizer is just a smarter cousin of simple gradient descent.”
We’ll minimize
whose derivative is
It has a global minimum near and a local maximum at . Great playground for seeing overshoot and learning‑rate trade‑offs.
For a scalar parameter ,
where is the step size (a.k.a. learning rate).
# calc-04-optim-1d/gradient_descent_1d.py
import torch, math, matplotlib.pyplot as plt
torch.set_default_dtype(torch.float64)
# ---- 1. define f and its grad ---------------------------------------
def f(x):
return x**4 - 3*x**2 + 2
def grad_f(x):
return 4*x**3 - 6*x
# ---- 2. experiment settings -----------------------------------------
etas = [0.01, 0.1, 0.5] # small, medium, too‑large
n_steps = 40
x0 = torch.tensor( -2.5 ) # start left of the bowl
histories = {} # store (xs, fs) per eta
for eta in etas:
x = x0.clone()
xs, fs = [], []
for _ in range(n_steps):
xs.append(float(x))
fs.append(float(f(x)))
x = x - eta * grad_f(x)
histories[eta] = (xs, fs)
# ---- 3. plot trajectories -------------------------------------------
plt.figure(figsize=(6,4))
for eta, (xs, fs) in histories.items():
plt.plot(xs, fs, marker='o', label=f"η={eta}")
plt.xlabel("x"); plt.ylabel("f(x)")
plt.title("Gradient‑descent paths for different step sizes")
plt.legend(); plt.tight_layout(); plt.show()
# ---- 4. print final positions ---------------------------------------
for eta, (xs, fs) in histories.items():
print(f"η={eta:4} → final x={xs[-1]: .4f}, f={fs[-1]: .6f}")
η=0.01 → final x= 1.1004, f=0.802419
η=0.1 → final x= 1.2237, f=0.802341 (near optimum)
η=0.5 → final x=-2.0363, f=6.549040 (diverged/oscillated)
And the plot shows:
Observation | Deep‑Learning Analogy |
---|---|
Too‑small η: slow convergence. | Training loss plateaus, epochs drag on. |
Just‑right η: reaches minimum quickly. | Good default LR (e.g. 1e‑3 for Adam). |
Too‑large η: divergence or oscillation. | “NaN in loss,” exploding gradients, need LR decay or gradient clipping. |
Want PyTorch to compute gradients for you? Replace grad_f(x)
by:
def grad_f(x):
x = x.clone().detach().requires_grad_(True)
fx = f(x)
fx.backward()
return x.grad
Exactly what happens inside each layer of a neural network.
Commit solutions to calc-04-optim-1d/
and tag v0.1
.
Next stop: Calculus 5 – Taylor Series & Function Approximation.