A Deep Dive into Preconditioned Stochastic Gradient Descent (PSGD) and its Evolution
This survey explores the evolution of preconditioned stochastic gradient descent (PSGD) since its introduction in the seminal 2015 paper "Preconditioned Stochastic Gradient Descent" (https://arxiv.org/abs/1512.04202). We analyze its core concepts, diverse applications, and connections to modern research trends as of November 7, 2024, drawing from a comprehensive review of relevant literature.
Enhancing SGD Convergence Through Preconditioning: The Core Idea
Stochastic Gradient Descent (SGD) is a fundamental algorithm in large-scale machine learning, efficiently optimizing massive datasets. However, its convergence can be slow and erratic, especially in high-dimensional spaces prevalent in deep learning. This often stems from the uneven curvature of the loss function landscape—SGD takes inefficiently small steps in some directions and large steps in others, leading to oscillations and slow convergence. The 2015 PSGD paper (https://arxiv.org/abs/1512.04202) introduced preconditioning as a powerful solution.
The core idea involves transforming the gradient using a preconditioner matrix (P) before updating model parameters. Standard SGD updates parameters according to:
θₜ₊₁ = θₜ - η∇L(θₜ)
where η is the learning rate and ∇L(θt) represents the gradient at iteration t. PSGD modifies this to:
θₜ₊₁ = θₜ - ηP⁻¹∇L(θₜ)
This transformation rescales and rotates the gradient, aligning updates more closely with the local curvature of the loss function. Imagine navigating a rugged mountain range; the gradient points downhill, but direct descent can be inefficient and oscillatory. Preconditioning acts like a local map, suggesting a more efficient path down the mountain based on the terrain's curvature, similar to Newton's method, which uses the inverse Hessian for optimal descent. By incorporating local geometric information, PSGD guides the optimization process along a more efficient trajectory, thereby accelerating convergence. A key advantage is PSGD's ability to estimate the inverse Hessian directly from noisy stochastic gradients, avoiding the computationally expensive calculation of the full Hessian (https://arxiv.org/abs/1512.04202). This makes PSGD highly suitable for large-scale applications. Importantly, unlike some quasi-Newton methods that can be unstable in non-convex landscapes, PSGD demonstrates robustness in both convex and non-convex problems (https://arxiv.org/abs/1512.04202).