Algorithmic Lens

Algorithmic Lens

Unraveling Grokking

A Deep Dive into Delayed Generalization in Deep Neural Networks

Nov 01, 2024
∙ Paid
Share

Introduction

The phenomenon of "grokking," or delayed generalization, in deep neural networks (DNNs) has captivated the machine learning community. Characterized by a substantial lag between achieving near-perfect training accuracy and the emergence of robust generalization, grokking challenges our understanding of DNN training dynamics. The seminal work, "Deep Networks Always Grok and Here is Why" (Humayun et al., 2024, available on arXiv), provides a compelling explanation for this intriguing behavior. This survey delves into grokking, exploring its implications, connections to broader machine learning research, and focusing on research related to Humayun et al., 2024, as well as other relevant studies up to November 1, 2024, including insights gleaned from open access websites.

The Grokking Phenomenon: Delayed Generalization and Robustness

Initially observed in specific contexts, like transformers trained on algorithmic datasets or DNNs initialized with large-norm parameters, grokking was initially perceived as an anomaly. However, Humayun et al. (2024) demonstrated its widespread prevalence across various architectures and datasets, encompassing CNNs on CIFAR-10 and ResNets on Imagenette, suggesting a fundamental link to DNN training dynamics. Crucially, they introduced the concept of delayed robustness, observing a concurrent emergence of increased adversarial robustness (Goodfellow et al., 2014) along with delayed generalization, hinting at a deeper relationship between these desirable properties. This suggests that the mechanisms driving grokking might also contribute to enhanced resistance against adversarial attacks, a critical factor for deploying DNNs in real-world applications.

The core contribution of Humayun et al. (2024) is concisely captured in their abstract:

Grokking...is a phenomenon where generalization...occurs long after achieving near zero training error. We demonstrate that grokking is...widespread...We introduce the new concept of delayed robustness...We develop an analytical explanation...based on the local complexity of a DNN's input-output mapping.

This highlights their key findings: the ubiquity of grokking, the introduction of delayed robustness, and the proposed framework based on local complexity.

Linear Regions and the Phase Transition

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Lucas Nestler
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture