行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

PaTH请注意：通过累积住户转换进行位置编码

房地产 2025-12-02 IBM 张东旭

核心观点与关键数据

PaTH 位置编码方案：本文提出了 PaTH，一种基于累积 Householder 变换的灵活数据相关位置编码方案。与 RoPE 仅依赖于相对位置的静态变换不同，PaTH 的变换矩阵是数据相关的，即依赖于输入，从而提高了表达性。

理论优势：PaTH 可以将 Transformer 的表达能力扩展到 NC1 完全问题，超越了 RoPE 的 TC0 复杂度。这得益于其与具有数据相关转换矩阵的线性 RNN 的相似性，使其能够解决状态跟踪问题。

高效训练与推理：本文开发了类似 FlashAttention 的块状算法，利用 Householder 矩阵乘积的紧凑表示，实现了硬件高效的并行训练。同时，提出了高效的推理方法，通过原地更新历史键来保持效率。

实验结果：

合成任务：在 Flip-flop 语言建模和 Word 问题任务中，PaTH 模型表现优于 RoPE 和 FoX 基线模型，证明了其在状态跟踪方面的优势。
语言建模：在 760M 参数 Transformer 的中等规模语言建模实验中，PaTH 在 Wikitext 混淆度和多个零样本常识推理任务上均优于 RoPE 和 FoX。
长度外推：在 PG-19、CodeParrot 和 NarrativeQA 等长文本语料库上，PaTH-FoX 模型表现出良好的长度外推能力，优于 RoPE 和 FoX。
长文本基准测试：在 RULER、BABILONG、PhoneBook 和 LongBench-E 等长文本基准测试中，PaTH-FoX 在检索和状态跟踪任务上均取得了最佳性能。

RoPE 转换为 PaTH：本文探索了将预训练的 RoPE 模型转换为 PaTH 模型的方法，通过持续预训练，PaTH 模型在数学和编码领域表现出显著优势。

研究结论

PaTH 位置编码方案通过引入数据相关的 Householder 变换，有效提高了 Transformer 的表达能力和状态跟踪能力。实验结果表明，PaTH 在合成任务、语言建模和长文本基准测试中均优于 RoPE 和 FoX 等基线模型。此外，PaTH 可以通过持续预训练将预训练的 RoPE 模型转换为更强大的模型，特别是在数学和编码领域。未来研究方向包括提高训练稳定性、扩展到更大的头维度以及直接模拟旋转等。

Songlin Yang1Yikang Shen2Kaiyue WenMayank Mishra2Liliang Ren4Rameswar Panda1Massachusetts Institute of Technology23Stanford University4Microsoftyangsl66@mit.eduAbstractThe attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant,position encoding is essential for modeling structured domains such as language.Rotary position encoding (RoPE) has emerged as the de facto standard approachfor position encoding and is part of many modern LLMs. However, in RoPE thekey/query transformation between two elements in a sequence is only a function oftheir relative position and otherwise independent of the actual input. This limitsthe expressivity of RoPE-based transformers. This paper describes PaTH, a flexi-ble data-dependentposition encoding scheme based onaccumulated products of Householder(like)transformations, where each transformation is data-dependent,i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matri-ces, and implement a FlashAttention-style blockwise algorithm.Across bothtargeted synthetic benchmarks and moderate-scale real-world language modelingexperiments, we find that PaTH improves upon RoPE and other recent baselines.Finally, we show that we can convert pretrained RoPE transformers into PaTH withcontinued pretraining. 1Introduction Attention mechanisms form the backbone of transformer architectures that power contemporary AIsystems. Attention is inherently permutation-invariant, and thus encoding positional information intoattention is important for effective sequence modeling. Since the original sinusoidal embeddings [77],various position encoding schemes have been proposed over the years [16,63,28,25,45,58,72,interalia]; see Dufter et al.[17]for a comprehensive survey. Among these, rotary position embeddingarXiv:2505.16381v2 [cs.CL] 3 Feb 2026 RoPE works by transforming the key (kj) and query (qi) embeddings through a rotation matrixRwhose rotation angle is a function of the difference in positions, resulting in the bilinear formq⊤Ri−jkjfor the attention logits. The rotation matrixRitself is a block-diagonal matrix composed The implementation of the PaTH attention layer is also made available as part of the FLASHLINEARAlibrary [80, 79]:https://github.com/fla-org/flash-linear-attention in our LLMs, these failure modes highlight the need to design new primitives that can overcome thesetheoretical and empirical limitations of existing attention layers. Thiswork develops PaTH,a position encoding scheme with accumulated Householder transformations, targeting the above problem. In PaTH, the attention logit is still parameterized⊤d×d iHijkj, but the matrixHijofdata-dependentmatrices along the path between positionsjandi, where the matrices haveHouseholder-like identity-plus-rank-one structure. Intuitively, this formulation captures the cumula- tive transformation between positions, enabling PaTH to dynamically adapt to input data and solvecertain state-tracking problems. Indeed, we show that a constant-layer PaTH-based transformer can solve anNC1-complete problem underAC0theTC0complexity class (assumingTC0 To scale up PaTH Attention, we develop a FlashAttention-like algorithm [14] for hardware-efficientparallel training that leverages a compact representation of products of Householder matrices [5,27].Empirical results show that PaTH-based models can solve challenging synthetic state-trackingtasks where RoPE-based Transformers struggle. On moderate-scale language modeling with 760M-parameter Transformers, PaTH outperforms both RoPE and the Forgetting Transformer [39], whichmodulates attention logits via a data-dependent additive term. Combining PaTH with the ForgettingTransformer yields further gains, and the resulting models generalize well beyond the trainingsequence length. Finally, we show that we can convert pretrained RoPE transformers into PaTH withcontinued pretraining. 2PaTH Attention PaTH employs a dynamic data-dependent transition matrix—in particular identity-plus-rank-oneHouseholder-like transformations—for computing the bilinear attention logits, unlike RoPE whichapplies a fixed transformation at each time step. 2.1Generalizing RoPE with Multiplicative Position Encodings Traditional additive position encodings, such as sinusoidal embeddings [77] or ALiBi [58], representpositions as vectors or matrices summed directly with token embeddings or attention logits. RoPEinstead encodes relative positions multiplicatively rather than additively by directly modulating thekey/query vectors via position-dependent transformations. The class of multiplicative positional encodings can more generally be defined asAij Aij∝expk⊤jiYs=j+1Hsqi,whereiandjare positions of the query and key, andHs∈Rd×dis atransition matrix. RoPE is thusa special case of the above with a static transition matrixH

点击免费查看完整报告

你可能感兴趣

会议总结报告：竞争力委员会、执行委员会及国家委员请注意，这个翻译结果可能需要根据具体的上下文和具体内容进行调整，以确保最大程度的准确性和适用性。如果您有特定的句子或段落需要翻译，请直接提供给我。

美国竞争力委员会2022-10-11

广东-香港-澳门大湾区内低碳经济转型路径与金融如何加速业务过渡请注意，这里将“Decarbonisation Pathways”（低碳化路径）和“How Finance Can Accelerate the Business Transition”（金融如何加速企业转型）以及“to a Low-carbon Economy”（至低碳经济）这些部分进行整合，并适应中文的表达方式。同时保留了原有的大写字母和标点符号，以保持原文的格式。

世界资源研究所2023-04-26

PaTH请注意：通过累积住户转换进行位置编码

核心观点与关键数据

研究结论

你可能感兴趣

会议总结报告：竞争力委员会、执行委员会及国家委员请注意，这个翻译结果可能需要根据具体的上下文和具体内容进行调整，以确保最大程度的准确性和适用性。如果您有特定的句子或段落需要翻译，请直接提供给我。

改进多主题住户调查以更好地进行交通政策分析

利用分层编码进行量子权重缩减：马里兰大学提出量子编码稀疏化新范式

学习使用 ChatGPT 进行编码

【风口研报·洞察】市场积极因素在累积；美国FDA及欧盟EMA发布药品短缺公告，这类药用辅料在医药领域中占据关键位置，海外需求持续向好

华泰通信260201光纤光缆当前位置怎么看周末我们和产业进行

持有：尚未进行现金转换测试

是否进行担保：虽然高级-次级利差相对于历史水平看起来较窄，且相对于评级直接下降的情况而言，在持续的衰退风险中无担保债券相对回收率的上升证明了这一点。我们筛选了资本结构上下的个别转换机会。

激励医疗系统（INSP）：对I-5带来的下半年增长加速充满信心；客户转换持续进行

PaTH请注意：通过累积住户转换进行位置编码

你可能感兴趣

会议总结报告：竞争力委员会、执行委员会及国家委员 请注意，这个翻译结果可能需要根据具体的上下文和具体内容进行调整，以确保最大程度的准确性和适用性。如果您有特定的句子或段落需要翻译，请直接提供给我。

改进多主题住户调查以更好地进行交通政策分析

利用分层编码进行量子权重缩减：马里兰大学提出量子编码稀疏化新范式

学习使用 ChatGPT 进行编码

【风口研报·洞察】市场积极因素在累积；美国FDA及欧盟EMA发布药品短缺公告，这类药用辅料在医药领域中占据关键位置，海外需求持续向好

华泰通信260201光纤光缆当前位置怎么看周末我们和产业进行

持有：尚未进行现金转换测试

是否进行担保：虽然高级-次级利差相对于历史水平看起来较窄，且相对于评级直接下降的情况而言，在持续的衰退风险中无担保债券相对回收率的上升证明了这一点。我们筛选了资本结构上下的个别转换机会。

激励医疗系统（INSP）：对I-5带来的下半年增长加速充满信心；客户转换持续进行

会议总结报告：竞争力委员会、执行委员会及国家委员请注意，这个翻译结果可能需要根据具体的上下文和具体内容进行调整，以确保最大程度的准确性和适用性。如果您有特定的句子或段落需要翻译，请直接提供给我。