您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [未知机构]:AWS Trainium3 深度解析 | 一个潜在的挑战者正在逼近 - 发现报告

AWS Trainium3 深度解析 | 一个潜在的挑战者正在逼近

信息技术 2025-12-05 - 未知机构 xingxing+
报告封面

AWSTrainium3深度解析|⼀个潜在的挑战者正在逼近 Step-FunctionSoftware&SystemImprovements,“AmazonBasics”GB200NVL36x2,NL72x2/NL32x2ScaleUpRackArchitecture,OptimizedPerfperTCO,Trainium4 阶梯函数软件与系统改进、“AmazonBasics”GB200NVL36x2、NL72x2/NL32x2可扩展机架架构、针对总体拥有成本优化的性能、Trainium4 Trainium3:A NewChallengerApproaching! Trainium3:⼀位逼近的新挑战者! Hot on the heels of our10K word deep dive on TPUs, Amazon launched Trainium3 (Trn3)general availability and announced Trainium4 (Trn4) at its annual AWS re:Invent. Amazonhas had the longest and broadest history of custom silicon in the datacenter. While theywerebehind in AI for quite some time, they are rapidly progressing to be competitive. Last yearwedetailed Amazon’s ramp of its Trainium2 (Trn2) acceleratorsaimed at internal Bedrockworkloads and Anthropic’s training/inference needs. 在我们关于 TPU 的⼀万字深度剖析之后不久,亚马逊在其年度 AWS re:Invent⼤会上推出了 Trainium3(Trn3)的全⾯可⽤性,并宣布了 Trainium4(Trn4)。亚马逊在数据中⼼定制芯⽚⽅⾯拥有最长且最⼴泛的历史。尽管他们在⼈⼯智能领域曾落后了⼀段时间,但正在迅速进步以具备竞争⼒。去年我们详细介绍了亚马逊为其内部 Bedrock⼯作负载和 Anthropic 的训练/推理需求⽽加速部署的 Trainium2(Trn2)加速器。 AmazonʼsAI SelfSufficiency|Trainium2Architecture&Networking亚⻢逊的⼈⼯智能⾃给⾃⾜|Trainium2架构与⽹络 DYLAN PATEL,DANIEL NISHBALL,AND REYK KNUHTSEN that led toour blockbuster call that AWSwould accelerate on revenue. ⾃那时起,通过我们的数据中⼼模型和加速器模型,我们详细说明了导致我们重磅预测——AWS 的营收将加速增长——的巨⼤攀升。 AmazonʼsAI Resurgence:AWS&Anthropic'sMulti-GigawattTrainiumExpansion Today, we are publishing our next technical bible on the step-function improvement that isthe Trainium3 chip, microarchitecture, system and rack architecture, scale up, profilers,software platform, and datacenters ramps. This is the most detailed piece we've written on anaccelerator and it's hardware/software, on desktop there is a table of contents that makes itpossible to review specific sections. 今天,我们发布了下⼀部关于跨越式改进的 Trainium3 芯⽚的技术圣经,涵盖微架构、系统与机架架构、扩展、性能分析⼯具、软件平台以及数据中⼼部署等内容。这是我们迄今为⽌关于加速器及其硬件/软件⽅⾯撰写的最详尽的⽂章,桌⾯版提供了⽬录,便于查阅特定章节。 AmazonBasicsGB200akaGB200-at-HomeAmazonBasicsGB200亦称GB200-at-Home With Trainium3, AWS remains laser-focused on optimizing performance per total cost ofownership (perf per TCO). Their hardware North Star is simple: deliver the fastest time tomarket at the lowest TCO. Rather than committing to any single architectural design, AWSmaximizes operationalflexibility. This extends from their work with multiple partners on thecustom silicon side to the management of their own supply chain to multi-sourcing multiplecomponent vendors. 在 Trainium3 上,AWS 仍然⾼度专注于优化每单位总拥有成本的性能(perf perTCO)。他们的硬件北极星很简单:以最低的 TCO 提供最快的上市时间。AWS 并不拘泥于任何单⼀架构设计,⽽是最⼤化运营灵活性。这既体现在他们与多个合作伙伴在定制硅⽅⾯的⼯作,也体现在对⾃有供应链的管理以及对多个元件供应商的多源采购上。 On the systems and networking front, AWS is following an “Amazon Basics” approach thatoptimizes for perf per TCO. Design choices such as whether to use a 12.8T, 25.6T or a 51.2Tbandwidth scale-out switch or to select liquid vs air cooling are merely a means to an end to 在系统和⽹络⽅⾯,AWS 正在采⽤⼀种“Amazon Basics”的⽅法,优化每单位总拥有成本(TCO)的性能。诸如使⽤12.8T、25.6T 或 51.2T 带宽的扩展交换机,或选择液冷与风冷等设计决策,仅仅是为特定客户和特定数据中⼼提供最佳 TCO 的⼿段。 For the scale-up network, while Trn2 only supports a 4x4x4 3D Torus mesh scaleup topology,Trainium3 adds a unique switched fabric that is somewhat similar to the GB200 NVL36x2topology with a few key differences. This switched fabric was added because a switchedscaleup topology has better absolute performance and perf per TCO for frontier Mixture-of-Experts (MoE) model architectures. 对于扩展型⽹络,尽管 Trn2 仅⽀持 4x4x4 的 3D Torus⽹格扩展拓扑,Trainium3 则新增了⼀种独特的交换式结构,这种结构在某些⽅⾯与 GB200 的 NVL36x2 拓扑相似,但存在⼀些关键差异。之所以添加这种交换式结构,是因为对于前沿的混合专家(Mixture-of-Experts,MoE)模型架构,交换式扩展拓扑在绝对性能和每单位总拥有成本(perf per TCO)⽅⾯表现更佳。 Even for the switches used in this scale-up architecture, AWS has decided tonot decide: theywill go with three different scale-up switch solutions over the lifecycle of Trainium3, startingwith a 160 lane, 20 port PCIe switch for fast time to market due to the limited availabilitytoday of high lane & port count PCIe switches, later switching to 320 Lane PCIe switches andultimately a larger UALink to pivot towards best performance. 即便是在此扩展架构中使⽤的交换机上,AWS 也决定不做单⼀选择:在 Trainium3 的⽣命周期中,他们将采⽤三种不同的扩展交换机解决⽅案,起始采⽤⼀款 160 通道、20 端⼜的 PCIe 交换机以实现快速上市,因为⽬前⾼通道和⾼端⼜数的 PCIe 交换机供应有限,随后切换到 320 通道的 PCIe 交换机,最终采⽤更⼤的 UALink 以转向最佳性能。 AmazonʼsSoftwareNorthStar亚⻢逊的软件北极星 On the software front, AWS’s North Star expands and opens their software stack to target themasses, moving beyond just optimizing perf per TCO for internal Bedrock workloads (ieDeepSeek/Qwen/etc which run a private fork of vLLM v1) and for Anthropic’s training andinference workloads (which runs a custom inference engine and all custom NKI kernels). 在软件⽅⾯,AWS 的北极星扩展并开放了他们的软件栈以⾯向⼤众,超越了仅为内部Bedrock⼯作负载(即运⾏私有分⽀的 vLLM v1 的 DeepSeek/Qwen/等)以及为Anthropic 的训练和推理⼯作负载(其运⾏⾃定义推理引擎和所有⾃定义 NKI 内核)优化每总拥有成本的性能的范畴。 In fact, they are conducting a massive, multi-phase shift in software strategy. Phase 1 isreleasing and open sourcing a new native PyTorch backend. They will also be open sourcing kernel and communication libraries matmul and ML ops (a