Abstract
Understanding why deep learning optimization works — not just that it works — is one of the fundamental open questions in machine learning theory. A promising theoretical approach models Stochastic Gradient Descent not as a deterministic algorithm with random perturbations, but as the discretization of a continuous stochastic process. Previous work proposed that SGD can be viewed as a stochastic differential equation driven by fractional Brownian motion (FBM), a process characterized by a single Hurst parameter that governs its memory structure — how strongly past behavior influences the future.
Our investigation revealed that this model is incomplete. When we fit FBM to SGD trajectories, the Hurst parameter is not constant — it changes over the course of training. Early in optimization, the dynamics exhibit one type of memory structure; later, as the algorithm approaches a solution, the structure shifts. This means that FBM, which assumes a fixed Hurst parameter, is an inadequate model for the full training process.
The natural generalization is multi-fractional Brownian motion (mFBM), in which the Hurst parameter is itself a function of time. Our finding — that the Hurst parameter is time-dependent in SGD — suggests that mFBM may serve as a more suitable theoretical framework for understanding the dynamics of deep learning optimization. This has implications for algorithm design: if we understand how the memory structure of optimization evolves during training, we can potentially design algorithms that adapt their exploration strategy to match.
We plan to prepare a proposal to investigate this line of research in depth.