Algorithmic Lens

Algorithmic Lens

Share this post

Algorithmic Lens
Algorithmic Lens
A Deep Dive into Preconditioned Stochastic Gradient Descent (PSGD) and its Evolution

A Deep Dive into Preconditioned Stochastic Gradient Descent (PSGD) and its Evolution

Nov 07, 2024
∙ Paid

Share this post

Algorithmic Lens
Algorithmic Lens
A Deep Dive into Preconditioned Stochastic Gradient Descent (PSGD) and its Evolution
Share

This survey explores the evolution of preconditioned stochastic gradient descent (PSGD) since its introduction in the seminal 2015 paper "Preconditioned Stochastic Gradient Descent" (https://arxiv.org/abs/1512.04202). We analyze its core concepts, diverse applications, and connections to modern research trends as of November 7, 2024, drawing from a comprehensive review of relevant literature.

Enhancing SGD Convergence Through Preconditioning: The Core Idea

Stochastic Gradient Descent (SGD) is a fundamental algorithm in large-scale machine learning, efficiently optimizing massive datasets. However, its convergence can be slow and erratic, especially in high-dimensional spaces prevalent in deep learning. This often stems from the uneven curvature of the loss function landscape—SGD takes inefficiently small steps in some directions and large steps in others, leading to oscillations and slow convergence. The 2015 PSGD paper (https://arxiv.org/abs/1512.04202) introduced preconditioning as a powerful solution.

The core idea involves transforming the gradient using a preconditioner matrix (P) before updating model parameters. Standard SGD updates parameters according to:

θₜ₊₁ = θₜ - η∇L(θₜ)

where η is the learning rate and ∇L(θt) represents the gradient at iteration t. PSGD modifies this to:

θₜ₊₁ = θₜ - ηP⁻¹∇L(θₜ)

This transformation rescales and rotates the gradient, aligning updates more closely with the local curvature of the loss function. Imagine navigating a rugged mountain range; the gradient points downhill, but direct descent can be inefficient and oscillatory. Preconditioning acts like a local map, suggesting a more efficient path down the mountain based on the terrain's curvature, similar to Newton's method, which uses the inverse Hessian for optimal descent. By incorporating local geometric information, PSGD guides the optimization process along a more efficient trajectory, thereby accelerating convergence. A key advantage is PSGD's ability to estimate the inverse Hessian directly from noisy stochastic gradients, avoiding the computationally expensive calculation of the full Hessian (https://arxiv.org/abs/1512.04202). This makes PSGD highly suitable for large-scale applications. Importantly, unlike some quasi-Newton methods that can be unstable in non-convex landscapes, PSGD demonstrates robustness in both convex and non-convex problems (https://arxiv.org/abs/1512.04202).

Applications and Impact: Expanding Beyond Deep Learning

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Lucas Nestler
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share