您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[Meta]:神经缩放定律的起源:从随机图到自然语言 - 发现报告

神经缩放定律的起源:从随机图到自然语言

信息技术2026-01-15-MetaE***
AI智能总结
查看更多
神经缩放定律的起源:从随机图到自然语言

Maissam Barkeshli1,2,∗,Alberto Alfarano3,†,∗,Andrey Gromov1 1Meta Superintelligence Labs, FAIR,2Department of Physics, University of Maryland, College Parkand Joint Quantum Institute,3Axiom Math∗Equal contribution,†Work done at Meta Scaling laws have played a major role in the modern AI revolution, providing practitioners predictivepower over how the model performance will improve with increasing data, compute, and number ofmodel parameters. This has spurred an intense interest in the origin of neural scaling laws, witha common suggestion being that they arise from power law structure already present in the data.In this paper we study scaling laws for transformers trained to predict random walks (bigrams) ongraphs with tunable complexity. We demonstrate that this simplified setting already gives rise toneural scaling laws even in the absence of power law structure in the data correlations. We furtherconsider dialing down the complexity of natural language systematically, by training on sequencessampled from increasingly simplified generative language models, from 4,2,1-layer transformer languagemodels down to language bigrams, revealing a monotonic evolution of the scaling exponents. Ourresults also include scaling laws obtained from training on random walks on random graphs drawnfrom Erdös-Renyi and scale-free Barabási-Albert ensembles. Finally, we revisit conventional scalinglaws for language modeling, demonstrating that several essential results can be reproduced using 2layer transformers with context length of 100, provide a critical analysis of various fits used in priorliterature, demonstrate an alternative method for obtaining compute optimal curves as comparedwith current practice in published literature, and provide preliminary evidence that maximal updateparameterization may be more parameter efficient than standard parameterization. 1Introduction One of the most important lessons in modern deep learning is the steady improvement in model capabilities asadditional compute resources and data are effectively leveraged (Sutton, 2019). This was partially quantifiedthrough the characterization of neural scaling laws (Cortes et al., 1993; Hestness et al., 2017; Kaplan et al.,2020; Henighan et al., 2020; Hoffmann et al., 2022), which demonstrate that across many vision and naturallanguage tasks, the test loss decreases predictably as a simple power law over many orders of magnitude innumber of model parametersN, dataset sizeD, and amount of computeC. The discovery of neural scalinglaws has had significant impact in practice for language model pretraining. It allows practitioners to determinehow to optimally scale the model size and dataset size with compute (Kaplan et al., 2020; Hoffmann et al.,2022; Chowdhery et al., 2022; Grattafiori and et al., 2024; Yang and et al., 2024, 2025; Liu and et al., 2024;Jiang and et al., 2024; Tian et al., 2025). It also provides a way to benchmark algorithmic breakthroughs inarchitectures, optimization, and data.arXiv:2601.10684v1 [cs.LG] 15 Jan 2026 These empirical results have led to significant theoretical work in trying to understand the origin of neuralscaling laws. Specifically, why is there a power law decrease in the test loss over many orders of magnitude inN,D, andC, and what sets the exponents of the power laws? A clear answer to this question may be ofsignificant practical value, since if we understand what sets the exponents, we might understand the extent towhich they can be increased, thus increasing the asymptotic efficiency of deep learning methods. A popular suggestion has been that the power law scaling in the test loss originates from power laws that arealready present in the dataset itself. For example, it is well-known that the frequency of words in a corpus of text follows Zipf’s law, with many other power laws having also been characterized in natural languagecorpora (Piantadosi, 2014; Altmann and Gerlach, 2016). Natural images also exhibit power laws in theirspectra (Ruderman, 1994; Maloney et al., 2022). Many theoretical works have shown that in linear or kernelregression, power laws in the test loss do in fact originate from power laws in the data (or in features definedin terms of the data) (Bordelon et al., 2020; Bahri et al., 2021; Spigler et al., 2020; Maloney et al., 2022; Linet al., 2024; Paquette et al., 2024; Bordelon et al., 2024). More generally, if we assume that models need tolearn a discrete set of tasks to achieve a particular value of test loss, and if these tasks are distributed withpower law weighting, then a power law in the test loss follows (Michaud et al., 2024; Ren et al., 2025). The above theories based on linear models with mean square error (MSE) loss are rather far from the settingof auto-regressive sequence modeling with cross-entropy loss. Consequently it is not clear to what extent theyare representative of the neural scaling laws seen in natural language modeling. A potentially fruitful approachwould be to