行业研究公司研究宏观策略财报招股书会议纪要海南封关低空经济 DeepSeek AIGC 大模型

DeepSeek mHC：流形约束超连接

2026-01-01DeepSeek洪***

AI智能总结

DeepSeek mHC：流形约束超连接

核心观点与背景

这篇工程论文 DeepSeek mHC 延续了字节跳动在原始超连接（HC）论文中提出的观点，后者是理解本文的先决条件。HC 论文探讨了深度学习中的两个关键问题：

推理能力与层组合：推理能力似乎源于深度和更优的层组合，但现有模型中层存在冗余。
高效训练：合成数据训练效率高，但如何进一步优化训练过程？

HC 与 Muon 的核心理念相似，都是通过改进训练更新，但 HC 是对残差函数 F 进行底层改造，使其可训练。当前归一化方法存在“表示崩溃”问题，即深层特征相似度高，增加层数效果递减。

HC 的初步成果

原始 HC 论文成功重训了一个小型 Olmo-MoE，证明其收敛速度提升 1.8 倍，ARC-Challenge 表现提高 6 个百分点。层可解释性显示，HC 变体特征相似度显著降低，验证了其有效性。

mHC 的提出与改进

随着模型规模扩大，HC 存在两个主要问题：

不稳定性风险：训练规模增加时，HC 可能出现梯度范数不稳定性。
硬件效率：加宽残差流导致内存访问成本高，原始设计未解决。

DeepSeek 提出 mHC，通过流形约束限制可学习目标，防止偏离恒等映射，将残差连接矩阵约束在双标置矩阵构成的流形内。

核心技术细节

高效训练设计：
- 新 mHC 核函数采用混合精度策略，融合操作减少内存带宽瓶颈。
- 通过丢弃中间激活值并在反向传播中重新计算，管理内存开销。
- 调整流水线并行方案，为 MLP 层的 Fpost,res 核函数分配高优先级计算流，防止通信阻塞。

研究结论

mHC 的核心价值不在于证明 HC 可大规模运作，而在于展示了重新工程化训练环境的全面能力（核函数、内存管理、节点间通信），体现了前沿实验室的技术实力。

This is actually an engineering paper, taking as astarting points ideas already exposed in theoriginal Hyper-Connections (HC) paper fromByteDance, which is consequently a prerequisitefor reading. So initial notes on this first.这实际上是⼀篇⼯程论⽂，其起点是字节跳动在原始超连接（HC）论⽂中已经阐述的观点，因此后者是阅读本⽂的先决条件。所以先对前者做⼀些初步笔记。 ever since SYNTH:1) Reasoning capacities seem to emerge fromdepth, so indirectly better layer combinations.This is especially striking for math and CircuitTranformers already suggest that models performformal operations at this sub-token level. Draftsjust wrap this process through another time. Butthen, how can we build more optimal layercombinations/assignments? This become evenmore critical as we scale depth (or nest it throughMoE): it’s known through interpretability studiesthat layers are largely redundant.2) Synthetic data has become the most efficientway to train models, mostly as we delegate“training” to the data shape. Paraphrasing isliterally a way to extrapolate the memorizationprocess in transformers world, as we create endless variations of the same knowledgecomponents. If training was really optimized, thisshould be mostly internalized. So how can webuild efficient training?作为序⾔，HC出⼈意料地与⾃SYNTH以来⼀直困扰我的两个重⼤开放性问题产⽣了交集：1)推理能⼒似乎源于深度，因此间接源于更好的层组合。这在数学⽅⾯尤为显著，且Circuit Transformers已经表明模型在⼦标记（sub-token）级别执⾏形式化操作。草案只是将这⼀过程通过另⼀段时间维度进⾏包装。但是，我们该如何构建更优的层组合/分配呢？随着我们扩展深度（或通过上是冗余的。2)合成数据已成为训练模型最⾼效的⽅式，这主要是因为我们将“训练”委托给了数据形态。改写（Paraphrasing）实际上是Transformer世界中外推记忆过程的⼀种⼿段，因为我们为相同的知识组件创造了⽆穷⽆尽的变体。如果训练真的得到了优化，这种能⼒本应在很⼤程度上被内化。那么，我们该如何构建⾼效的训练呢？ It’s not surprising that hyper-connections isimmediately associated with Muon. The generalidea is similar: make better training updates. Yet,there is a major difference: hyper-connections area low level change, transforming a decade oldpiece of deep learning infra, the residual functionF, and making it trainable.超连接（hyper-connections）⽴即让⼈联想到Muon并不令⼈意外。两者的核⼼理念相似：实现更好的训练更新。然⽽，两者之间存在重⼤区别：超连接是⼀种底层变⾰，它改造了深度学习领域沿⽤⼗年的基础设施——残差函数F，并使其变得可训练。Current normalization approach scale well and yetresult in "representation collapse", "where hiddenfeatures in deeper layers become highly similar,diminishing the contribution of additional layersas their number increases." To address this, hyper-connections introduce entirely new learnableobjectives for "depth-connections and width-connections". In theory "learning the hyper-connection matrix in various forms can createlayer arrangements that surpass traditionalsequential and parallel configurations, resulting ina soft-mixture or even dynamic arrangement".⽬前的归⼀化⽅法虽然扩展性良好，但会导致“表示崩溃（representationcollapse）”，即“深层的隐藏特征变得⾼度相似，随着层数增加，额外层所带来的贡献反⽽减⼩”。为了解决这⼀问题，超连接为“深度连接和宽度连接”引⼊了全新的可学习⽬标。理论上，“以各种形式学习超连接矩阵可以创造出超越传统顺序和并⾏配置的层排列⽅式，从⽽形成⼀种软混合（soft-mixture）甚⾄动态排列”。The original HC paper does manage to retrain asmall Olmo-MoE and demonstrate it "converges1.8 times faster and shows an improvement of 6points on ARC-Challenge compared to thebaseline trained with 500 B tokens". Layerinterpretability suggests that "the baseline tendstoward representation collapse", while the HCvariant "exhibits significantly lower similaritybetween features".超连接（HC）的原始论⽂确实成功重新训练了⼀个⼩型Olmo-MoE，并证明其“收敛速度快了1.8倍，且在ARC-Challenge上的表现⽐使⽤500B token训练的基准模型提⾼了6个百分点”。层可解释性分析表明，“基准模型趋向于表示崩溃”，⽽HC变体“特征之间的相似度显著降低”。DeepSeek paper starts almost in media res andfirst underlines a major success of HC original complexity did not result in computationaloverhead. Yet, does it scale?DeepSeek的论⽂⼏乎是开⻔⻅⼭，⾸先强调了HC原始⽅法的⼀个重⼤成功：数学/拓扑复杂度的增加并没有导致计算开销的上升。然⽽，它是否具备可扩展性？Moving beyond small models, there are two major issues: "as the training scale increases, HCintroduces potential risks of instability" and "thehardware efficiency concerning memory accesscosts for the widened residual stream remainsunaddressed in the original design". Concretely,naive experiment scaling of HC results in"unexpected loss surge around the 12k step,which is highly correlated with the instability in thegradient norm"在超越⼩模型之后，存在两个主要问题：“随着训练规模的增加，HC引⼊了潜在的不稳定性⻛险”，以及“原始设计中尚未解决加宽残差流带来的内存访问成本相关的硬件效率问题”。具体⽽⾔，HC的原⽣实验缩放会导致“在12k步左右出现意外的损失激增，这与梯度范数的不稳定性⾼度相关”。Consequently DeepSeek proposes their ownvariant, Manifold-Constrained Hyper-Connections(mHC). As the name implied, it restrict thelearnable objective preventing deviations fromidentity mapping and "effectively constrains theresidual connection matrices within the manifoldthat is constituted by doubly stochastic matrices".因此，DeepSeek提出了他们⾃⼰的变体：流形约束超连接（mHC）。顾名思义，它限制了可学习⽬标，防⽌偏离恒等映射，并“有效地将残差连接矩阵约束在由双标置矩 The math part (4.1 & 4.2) is very elegant, butclearly not the hardest part. The actual core of thepaper is “4.3 efficient training design", where theysimply:1) write three new mHC kernels that "employmixed-precision strategies to maximize numericalaccuracy without compromising speed, and fusemultiple operations with shared memory accessinto unified compute kernels to reduce memorybandwidth bottlenecks"2) manage the substantial memory overhead bydiscarding "the intermediate activations of themHC kernels after the for

点击免费查看完整报告