您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [IBM]:PaTH请注意:通过累积住户转换进行位置编码 - 发现报告

PaTH请注意:通过累积住户转换进行位置编码

房地产 2025-12-02 IBM 张东旭
报告封面

Songlin Yang1Yikang Shen2Kaiyue WenMayank Mishra2Liliang Ren4Rameswar Panda1Massachusetts Institute of Technology23Stanford University4Microsoftyangsl66@mit.eduAbstractThe attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant,position encoding is essential for modeling structured domains such as language.Rotary position encoding (RoPE) has emerged as the de facto standard approachfor position encoding and is part of many modern LLMs. However, in RoPE thekey/query transformation between two elements in a sequence is only a function oftheir relative position and otherwise independent of the actual input. This limitsthe expressivity of RoPE-based transformers. This paper describes PaTH, a flexi-ble data-dependentposition encoding scheme based onaccumulated products of Householder(like)transformations, where each transformation is data-dependent,i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matri-ces, and implement a FlashAttention-style blockwise algorithm.Across bothtargeted synthetic benchmarks and moderate-scale real-world language modelingexperiments, we find that PaTH improves upon RoPE and other recent baselines.Finally, we show that we can convert pretrained RoPE transformers into PaTH withcontinued pretraining. 1Introduction Attention mechanisms form the backbone of transformer architectures that power contemporary AIsystems. Attention is inherently permutation-invariant, and thus encoding positional information intoattention is important for effective sequence modeling. Since the original sinusoidal embeddings [77],various position encoding schemes have been proposed over the years [16,63,28,25,45,58,72,interalia]; see Dufter et al.[17]for a comprehensive survey. Among these, rotary position embeddingarXiv:2505.16381v2 [cs.CL] 3 Feb 2026 RoPE works by transforming the key (kj) and query (qi) embeddings through a rotation matrixRwhose rotation angle is a function of the difference in positions, resulting in the bilinear formq⊤Ri−jkjfor the attention logits. The rotation matrixRitself is a block-diagonal matrix composed The implementation of the PaTH attention layer is also made available as part of the FLASHLINEARAlibrary [80, 79]:https://github.com/fla-org/flash-linear-attention in our LLMs, these failure modes highlight the need to design new primitives that can overcome thesetheoretical and empirical limitations of existing attention layers. Thiswork develops PaTH,a position encoding scheme with accumulated Householder transformations, targeting the above problem. In PaTH, the attention logit is still parameterized⊤d×d iHijkj, but the matrixHijofdata-dependentmatrices along the path between positionsjandi, where the matrices haveHouseholder-like identity-plus-rank-one structure. Intuitively, this formulation captures the cumula- tive transformation between positions, enabling PaTH to dynamically adapt to input data and solvecertain state-tracking problems. Indeed, we show that a constant-layer PaTH-based transformer can solve anNC1-complete problem underAC0theTC0complexity class (assumingTC0 To scale up PaTH Attention, we develop a FlashAttention-like algorithm [14] for hardware-efficientparallel training that leverages a compact representation of products of Householder matrices [5,27].Empirical results show that PaTH-based models can solve challenging synthetic state-trackingtasks where RoPE-based Transformers struggle. On moderate-scale language modeling with 760M-parameter Transformers, PaTH outperforms both RoPE and the Forgetting Transformer [39], whichmodulates attention logits via a data-dependent additive term. Combining PaTH with the ForgettingTransformer yields further gains, and the resulting models generalize well beyond the trainingsequence length. Finally, we show that we can convert pretrained RoPE transformers into PaTH withcontinued pretraining. 2PaTH Attention PaTH employs a dynamic data-dependent transition matrix—in particular identity-plus-rank-oneHouseholder-like transformations—for computing the bilinear attention logits, unlike RoPE whichapplies a fixed transformation at each time step. 2.1Generalizing RoPE with Multiplicative Position Encodings Traditional additive position encodings, such as sinusoidal embeddings [77] or ALiBi [58], representpositions as vectors or matrices summed directly with token embeddings or attention logits. RoPEinstead encodes relative positions multiplicatively rather than additively by directly modulating thekey/query vectors via position-dependent transformations. The class of multiplicative positional encodings can more generally be defined asAij Aij∝expk⊤jiYs=j+1Hsqi,whereiandjare positions of the query and key, andHs∈Rd×dis atransition matrix. RoPE is thusa special case of the above with a static transition matrixH