Multi-Fractional Brownian Motion and SGD

Abstract

Understanding why deep learning optimization works â€” not just that it works â€” is one of the fundamental open questions in machine learning theory. A promising theoretical approach models Stochastic Gradient Descent not as a deterministic algorithm with random perturbations, but as the discretization of a continuous stochastic process. Previous work proposed that SGD can be viewed as a stochastic differential equation driven by fractional Brownian motion (FBM), a process characterized by a single Hurst parameter that governs its memory structure â€” how strongly past behavior influences the future.

Our investigation revealed that this model is incomplete. When we fit FBM to SGD trajectories, the Hurst parameter is not constant â€” it changes over the course of training. Early in optimization, the dynamics exhibit one type of memory structure; later, as the algorithm approaches a solution, the structure shifts. This means that FBM, which assumes a fixed Hurst parameter, is an inadequate model for the full training process.

The natural generalization is multi-fractional Brownian motion (mFBM), in which the Hurst parameter is itself a function of time. Our finding â€” that the Hurst parameter is time-dependent in SGD â€” suggests that mFBM may serve as a more suitable theoretical framework for understanding the dynamics of deep learning optimization. This has implications for algorithm design: if we understand how the memory structure of optimization evolves during training, we can potentially design algorithms that adapt their exploration strategy to match.

We plan to prepare a proposal to investigate this line of research in depth.