行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

AWS Trainium3 深度解析 | 一个潜在的挑战者正在逼近

信息技术 2025-12-05 - 未知机构 xingxing+

Trainium3 是亚马逊推出的新一代 AI 加速器，旨在提升性能并降低总拥有成本 (TCO)。其核心策略是采用“阶梯函数”方法，通过多阶段改进硬件和软件，实现快速上市和成本效益。

硬件方面：

制程升级：从 N5 工艺升级到 N3P 工艺，提升频率和降低功耗。
内存提升： HBM3E 内存堆栈升级到 12 层，容量达到 144GB/芯片，带宽提升 70%。
互连优化：采用 PCIe Gen 6，每芯片向上扩展带宽翻倍，达到 1.2 TB/s。
网络架构：引入独特的交换式扩展拓扑，提升前沿 MoE 模型性能，并支持不同代扩展交换机，兼顾上市时间和成本。
服务器类型：提供 NL32x2 和 NL72x2 两种机架 SKU，分别采用风冷和液冷，满足不同需求。

软件方面：

开源策略：开源 PyTorch 后端、内核库和 XLA 图编译器，构建开放开发者生态。
性能优化：支持多种数值格式，包括 BF16、FP16、FP32 和 MXFP8/MXFP4，并通过硬件加速优化性能。
灵活性：支持多种并行策略，包括专家并行、张量并行和上下文并行，适应不同工作负载。
开发者工具：提供 NeuronExplorer 性能分析工具，帮助开发者优化模型性能。

关键数据：

Trainium3 每芯片支持 8 个 NeuronCore，每个 NeuronCore 包含张量引擎、向量引擎、标量引擎和 GpSimd 引擎。
Trainium3 支持多种 PyTorch 原生 API，包括 torch.compile、DTensor、FSDP 等。
Trainium3 在推理和训练工作负载上均表现出色，并支持动态 MoE 组 GEMM 和集合通信专用核心。

研究结论：

Trainium3 是一个具有竞争力的 AI 加速器，在性能和成本效益方面具有优势。
AWS 通过开源软件和灵活的硬件策略，提升了 Trainium3 的采用率。
Trainium3 将对英伟达的市场地位构成挑战，但英伟达仍将保持领先地位。

AWSTrainium3深度解析|⼀个潜在的挑战者正在逼近 Step-FunctionSoftware&SystemImprovements,“AmazonBasics”GB200NVL36x2,NL72x2/NL32x2ScaleUpRackArchitecture,OptimizedPerfperTCO,Trainium4 阶梯函数软件与系统改进、“AmazonBasics”GB200NVL36x2、NL72x2/NL32x2可扩展机架架构、针对总体拥有成本优化的性能、Trainium4 Trainium3:A NewChallengerApproaching! Trainium3：⼀位逼近的新挑战者！ Hot on the heels of our10K word deep dive on TPUs, Amazon launched Trainium3 (Trn3)general availability and announced Trainium4 (Trn4) at its annual AWS re:Invent. Amazonhas had the longest and broadest history of custom silicon in the datacenter. While theywerebehind in AI for quite some time, they are rapidly progressing to be competitive. Last yearwedetailed Amazon’s ramp of its Trainium2 (Trn2) acceleratorsaimed at internal Bedrockworkloads and Anthropic’s training/inference needs. 在我们关于 TPU 的⼀万字深度剖析之后不久，亚马逊在其年度 AWS re:Invent⼤会上推出了 Trainium3（Trn3）的全⾯可⽤性，并宣布了 Trainium4（Trn4）。亚马逊在数据中⼼定制芯⽚⽅⾯拥有最长且最⼴泛的历史。尽管他们在⼈⼯智能领域曾落后了⼀段时间，但正在迅速进步以具备竞争⼒。去年我们详细介绍了亚马逊为其内部 Bedrock⼯作负载和 Anthropic 的训练/推理需求⽽加速部署的 Trainium2（Trn2）加速器。 AmazonʼsAI SelfSufficiency|Trainium2Architecture&Networking亚⻢逊的⼈⼯智能⾃给⾃⾜|Trainium2架构与⽹络 DYLAN PATEL,DANIEL NISHBALL,AND REYK KNUHTSEN that led toour blockbuster call that AWSwould accelerate on revenue. ⾃那时起，通过我们的数据中⼼模型和加速器模型，我们详细说明了导致我们重磅预测——AWS 的营收将加速增长——的巨⼤攀升。 AmazonʼsAI Resurgence:AWS&Anthropic'sMulti-GigawattTrainiumExpansion Today, we are publishing our next technical bible on the step-function improvement that isthe Trainium3 chip, microarchitecture, system and rack architecture, scale up, profilers,software platform, and datacenters ramps. This is the most detailed piece we've written on anaccelerator and it's hardware/software, on desktop there is a table of contents that makes itpossible to review specific sections. 今天，我们发布了下⼀部关于跨越式改进的 Trainium3 芯⽚的技术圣经，涵盖微架构、系统与机架架构、扩展、性能分析⼯具、软件平台以及数据中⼼部署等内容。这是我们迄今为⽌关于加速器及其硬件/软件⽅⾯撰写的最详尽的⽂章，桌⾯版提供了⽬录，便于查阅特定章节。 AmazonBasicsGB200akaGB200-at-HomeAmazonBasicsGB200亦称GB200-at-Home With Trainium3, AWS remains laser-focused on optimizing performance per total cost ofownership (perf per TCO). Their hardware North Star is simple: deliver the fastest time tomarket at the lowest TCO. Rather than committing to any single architectural design, AWSmaximizes operationalflexibility. This extends from their work with multiple partners on thecustom silicon side to the management of their own supply chain to multi-sourcing multiplecomponent vendors. 在 Trainium3 上，AWS 仍然⾼度专注于优化每单位总拥有成本的性能（perf perTCO）。他们的硬件北极星很简单：以最低的 TCO 提供最快的上市时间。AWS 并不拘泥于任何单⼀架构设计，⽽是最⼤化运营灵活性。这既体现在他们与多个合作伙伴在定制硅⽅⾯的⼯作，也体现在对⾃有供应链的管理以及对多个元件供应商的多源采购上。 On the systems and networking front, AWS is following an “Amazon Basics” approach thatoptimizes for perf per TCO. Design choices such as whether to use a 12.8T, 25.6T or a 51.2Tbandwidth scale-out switch or to select liquid vs air cooling are merely a means to an end to 在系统和⽹络⽅⾯，AWS 正在采⽤⼀种“Amazon Basics”的⽅法，优化每单位总拥有成本（TCO）的性能。诸如使⽤12.8T、25.6T 或 51.2T 带宽的扩展交换机，或选择液冷与风冷等设计决策，仅仅是为特定客户和特定数据中⼼提供最佳 TCO 的⼿段。 For the scale-up network, while Trn2 only supports a 4x4x4 3D Torus mesh scaleup topology,Trainium3 adds a unique switched fabric that is somewhat similar to the GB200 NVL36x2topology with a few key differences. This switched fabric was added because a switchedscaleup topology has better absolute performance and perf per TCO for frontier Mixture-of-Experts (MoE) model architectures. 对于扩展型⽹络，尽管 Trn2 仅⽀持 4x4x4 的 3D Torus⽹格扩展拓扑，Trainium3 则新增了⼀种独特的交换式结构，这种结构在某些⽅⾯与 GB200 的 NVL36x2 拓扑相似，但存在⼀些关键差异。之所以添加这种交换式结构，是因为对于前沿的混合专家（Mixture-of-Experts，MoE）模型架构，交换式扩展拓扑在绝对性能和每单位总拥有成本（perf per TCO）⽅⾯表现更佳。 Even for the switches used in this scale-up architecture, AWS has decided tonot decide: theywill go with three different scale-up switch solutions over the lifecycle of Trainium3, startingwith a 160 lane, 20 port PCIe switch for fast time to market due to the limited availabilitytoday of high lane & port count PCIe switches, later switching to 320 Lane PCIe switches andultimately a larger UALink to pivot towards best performance. 即便是在此扩展架构中使⽤的交换机上，AWS 也决定不做单⼀选择：在 Trainium3 的⽣命周期中，他们将采⽤三种不同的扩展交换机解决⽅案，起始采⽤⼀款 160 通道、20 端⼜的 PCIe 交换机以实现快速上市，因为⽬前⾼通道和⾼端⼜数的 PCIe 交换机供应有限，随后切换到 320 通道的 PCIe 交换机，最终采⽤更⼤的 UALink 以转向最佳性能。 AmazonʼsSoftwareNorthStar亚⻢逊的软件北极星 On the software front, AWS’s North Star expands and opens their software stack to target themasses, moving beyond just optimizing perf per TCO for internal Bedrock workloads (ieDeepSeek/Qwen/etc which run a private fork of vLLM v1) and for Anthropic’s training andinference workloads (which runs a custom inference engine and all custom NKI kernels). 在软件⽅⾯，AWS 的北极星扩展并开放了他们的软件栈以⾯向⼤众，超越了仅为内部Bedrock⼯作负载（即运⾏私有分⽀的 vLLM v1 的 DeepSeek/Qwen/等）以及为Anthropic 的训练和推理⼯作负载（其运⾏⾃定义推理引擎和所有⾃定义 NKI 内核）优化每总拥有成本的性能的范畴。 In fact, they are conducting a massive, multi-phase shift in software strategy. Phase 1 isreleasing and open sourcing a new native PyTorch backend. They will also be open sourcing kernel and communication libraries matmul and ML ops (a

点击免费查看完整报告

你可能感兴趣

转接下来NVDA其实还有一个利空那就是12月1日5日AMZN的AWS峰

未知机构2025-11-27

1026强势股脱水|一个悄悄咪咪逼近历史新高的行业

未知机构2023-10-26

商品期权：从期权标的走势来看，本交易日黑色价格有所下降，能化价格有所上升，农产品、金属价格涨跌互现。从期权隐含波动率的角度来看，农产品、金属期权隐含波动率有所下降；能化、黑色期权隐含波动率涨跌互现。另外，铜、锌、原油、纯碱、硅铁、锰硅期权隐含波动率与历史波动率相差大于6%。从期权成交量和持仓量方面来看，本交易日是大商所及广期所部分商品期权最后一个交易日，对应期权持仓量有所下降。其余期权持仓量基本维持不变。本交易日，玉米、橡胶、聚丙烯期权成交量分别上升193.29%，112.14%，102.01%。白糖期权成交量则下降43.7%。白银期权隐含波动率自周二以来逐日攀升，目前再次逼近历史最高点，建议下周逢高构建做空白银期权隐含波动率策略。

南华期货2024-06-07

AWS Trainium3 深度解析 | 一个潜在的挑战者正在逼近

你可能感兴趣

转接下来NVDA其实还有一个利空那就是12月1日5日AMZN的AWS峰

1026强势股脱水|一个悄悄咪咪逼近历史新高的行业

深度报告：算力帝国的挑战者

晨会聚焦：地产的韧性正在逼近临界值？

新的推拉火车正在逼近

3月债券策略月报：双驱动的风险正在逼近

【电报解读】小米首款AI眼镜发布！机构称AI眼镜有望成为新一代计算终端，2030年销量有望逼近1亿部，这家公司和小米的相关合作正在有序推进中

为一个体育用品品牌的潜在收购而发行新股，维持“买入”

港股市场点评报告：6月下一个审议考察期:潜在“入通”的港股标的梳理