您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[DeepSeek]:DeepSeek mHC:流形约束超连接 - 发现报告

DeepSeek mHC:流形约束超连接

2026-01-01DeepSeek洪***
AI智能总结
查看更多
DeepSeek mHC:流形约束超连接

This is actually an engineering paper, taking as astarting points ideas already exposed in theoriginal Hyper-Connections (HC) paper fromByteDance, which is consequently a prerequisitefor reading. So initial notes on this first.这实际上是⼀篇⼯程论⽂,其起点是字节跳动在原始超连接(HC)论⽂中已经阐述的观点,因此后者是阅读本⽂的先决条件。所以先对前者做⼀些初步笔记。 ever since SYNTH:1) Reasoning capacities seem to emerge fromdepth, so indirectly better layer combinations.This is especially striking for math and CircuitTranformers already suggest that models performformal operations at this sub-token level. Draftsjust wrap this process through another time. Butthen, how can we build more optimal layercombinations/assignments? This become evenmore critical as we scale depth (or nest it throughMoE): it’s known through interpretability studiesthat layers are largely redundant.2) Synthetic data has become the most efficientway to train models, mostly as we delegate“training” to the data shape. Paraphrasing isliterally a way to extrapolate the memorizationprocess in transformers world, as we create endless variations of the same knowledgecomponents. If training was really optimized, thisshould be mostly internalized. So how can webuild efficient training?作为序⾔,HC出⼈意料地与⾃SYNTH以来⼀直困扰我的两个重⼤开放性问题产⽣了交集:1)推理能⼒似乎源于深度,因此间接源于更好的层组合。这在数学⽅⾯尤为显著,且Circuit Transformers已经表明模型在⼦标记(sub-token)级别执⾏形式化操作。草案只是将这⼀过程通过另⼀段时间维度进⾏包装。但是,我们该如何构建更优的层组合/分配呢?随着我们扩展深度(或通过 上是冗余的。2)合成数据已成为训练模型最⾼效的⽅式,这主要是因为我们将“训练”委托给了数据形态。改写(Paraphrasing)实际上是Transformer世界中外推记忆过程的⼀种⼿段,因为我们为相同的知识组件创造了⽆穷⽆尽的变体。如果训练真的得到了优化,这种能⼒本应在很⼤程度上被内化。那么,我们该如何构建⾼效的训练呢? It’s not surprising that hyper-connections isimmediately associated with Muon. The generalidea is similar: make better training updates. Yet,there is a major difference: hyper-connections area low level change, transforming a decade oldpiece of deep learning infra, the residual functionF, and making it trainable.超连接(hyper-connections)⽴即让⼈联想到Muon并不令⼈意外。两者的核⼼理念相似:实现更好的训练更新。然⽽,两者之间存在重⼤区别:超连接是⼀种底层变⾰,它改造了深度学习领域沿⽤⼗年的基础设施——残差函数F,并使其变得可训练。Current normalization approach scale well and yetresult in "representation collapse", "where hiddenfeatures in deeper layers become highly similar,diminishing the contribution of additional layersas their number increases." To address this, hyper-connections introduce entirely new learnableobjectives for "depth-connections and width-connections". In theory "learning the hyper-connection matrix in various forms can createlayer arrangements that surpass traditionalsequential and parallel configurations, resulting ina soft-mixture or even dynamic arrangement".⽬前的归⼀化⽅法虽然扩展性良好,但会导致“表示崩溃(representationcollapse)”,即“深层的隐藏特征变得⾼度相似,随着层数增加,额外层所带来的贡献反⽽减⼩”。为了解决这⼀问题,超连接为“深度连接和宽度连接”引⼊了全新的可学习⽬标。理论上,“以各种形式学习超连接矩阵可以创造出超越传统顺序和并⾏配置的层排列⽅式,从⽽形成⼀种软混合(soft-mixture)甚⾄动态排列”。The original HC paper does manage to retrain asmall Olmo-MoE and demonstrate it "converges1.8 times faster and shows an improvement of 6points on ARC-Challenge compared to thebaseline trained with 500 B tokens". Layerinterpretability suggests that "the baseline tendstoward representation collapse", while the HCvariant "exhibits significantly lower similaritybetween features".超连接(HC)的原始论⽂确实成功重新训练了⼀个⼩型Olmo-MoE,并证明其“收敛速度快了1.8倍,且在ARC-Challenge上的表现⽐使⽤500B token训练的基准模型提⾼了6个百分点”。层可解释性分析表明,“基准模型趋向于表示崩溃”,⽽HC变体“特征之间的相似度显著降低”。DeepSeek paper starts almost in media res andfirst underlines a major success of HC original complexity did not result in computationaloverhead. Yet, does it scale?DeepSeek的论⽂⼏乎是开⻔⻅⼭,⾸先强调了HC原始⽅法的⼀个重⼤成功:数学/拓扑复杂度的增加并没有导致计算开销的上升。然⽽,它是否具备可扩展性?Moving beyond small models, there are two major issues: "as the training scale increases, HCintroduces potential risks of instability" and "thehardware efficiency concerning memory accesscosts for the widened residual stream remainsunaddressed in the original design". Concretely,naive experiment scaling of HC results in"unexpected loss surge around the 12k step,which is highly correlated with the instability in thegradient norm"在超越⼩模型之后,存在两个主要问题:“随着训练规模的增加,HC引⼊了潜在的不稳定性⻛险”,以及“原始设计中尚未解决加宽残差流带来的内存访问成本相关的硬件效率问题”。具体⽽⾔,HC的原⽣实验缩放会导致“在12k步左右出现意外的损失激增,这与梯度范数的不稳定性⾼度相关”。Consequently DeepSeek proposes their ownvariant, Manifold-Constrained Hyper-Connections(mHC). As the name implied, it restrict thelearnable objective preventing deviations fromidentity mapping and "effectively constrains theresidual connection matrices within the manifoldthat is constituted by doubly stochastic matrices".因此,DeepSeek提出了他们⾃⼰的变体:流形约束超连接(mHC)。顾名思义,它限制了可学习⽬标,防⽌偏离恒等映射,并“有效地将残差连接矩阵约束在由双标置矩 The math part (4.1 & 4.2) is very elegant, butclearly not the hardest part. The actual core of thepaper is “4.3 efficient training design", where theysimply:1) write three new mHC kernels that "employmixed-precision strategies to maximize numericalaccuracy without compromising speed, and fusemultiple operations with shared memory accessinto unified compute kernels to reduce memorybandwidth bottlenecks"2) manage the substantial memory overhead bydiscarding "the intermediate activations of themHC kernels after the for