行业研究公司研究宏观策略财报招股书会议纪要 seedance2.0 低空经济 DeepSeek AIGC 大模型

GTC 2026 – 推理王国扩张 GTC 2026 – The Inference Kingdom Expands

2026-04-02 SiAnalysis 何杰斌

英伟达在 GTC 2026 上发布了 Groq LPX、Vera ETL256 和 STX 三款全新系统，并对 Groq LPU、Kyber 机架架构系统、CPO 路线图、Vera ETL256、CMX 与 STX 等进行了更新。

Groq LPU：

英伟达以 200 亿美元获得 Groq，获得其 IP 授权并雇佣大部分团队成员。
Groq LPU 采用单级 SRAM 设计，实现低延迟和高带宽，但密度和成本较高。
Groq LPU 3（LP30）采用三星 SF4X 工艺制造，SerDes 问题已解决，但性能提升主要来自工艺迁移。
Groq LPU 4（LP40）将采用台积电 N3P 工艺和 CoWoS-R 封装，并使用英伟达自有 IP。
Groq LPU 与 GPU 集成采用注意力与前馈网络解耦（AFD）技术，将 Attention 计算映射到 GPU，FFN 计算映射到 LPU。
Groq LPU 可用于加速投机解码，提高吞吐量。

LPX 机架系统：

LPX 机架包含 32 个 1U LPU 计算托盘和 2 个 Spectrum-X 交换机。
每个 LPX 计算托盘拥有 16 个 LPU、2 个 Altera FPGA、1 个 Intel Granite Rapids 主机 CPU 和 1 个 BlueField-4 前端模块。
LPU 模块采用“背靠背”方式安装，并通过 PCB 走线实现高速连接。
FPGA 用于将 LPU 的 C2C 协议转换为以太网，并连接到主机 CPU。
LPX 系统采用 Spectrum-X 多平面拓扑结构，实现全对全连接。

Kyber 机架架构：

Kyber 机架每个计算刀片包含 4 个 Rubin Ultra GPU 和 2 个 Vera CPU，共 36 个计算刀片，144 个 GPU。
Kyber 机架每个交换机刀片包含 6 个 NVLink 7 交换机，共 72 个 NVLink 7 交换机。
Rubin Ultra NVL144 Kyber 不使用 CPO 进行扩展，但 NVLink 光学组件即将引入。
Rubin Ultra NVL576 将使用 CPO 连接 8 个 Oberon 机架，形成两层全互连网络。

CPO 路线图：

英伟达计划在 Rubin 和 Feynman 架构中使用 CPO 技术。
Rubin NVL72 采用 Oberon NVL72 架构和全铜缆扩展网络。
Rubin Ultra NVL144 Kyber 不使用 CPO 进行扩展，但 NVLink 光学组件即将引入。
Rubin Ultra NVL576 将使用 CPO 连接 8 个 Oberon 机架，形成两层全互连网络。
Feynman NVL1152 将采用 CPO 技术连接 8 个 Kyber 机架，但机架内互连可能仍使用铜缆。

Vera ETL256：

Vera ETL256 机架包含 32 个计算托盘和 4 个 1U MGX ETL 交换机托盘，共 256 个 CPU。
机架内网络采用 Spectrum-X 多平面拓扑结构，实现全对全连接。
Vera ETL256 采用液冷技术，并通过铜缆实现高速连接。

CMX 和 STX：

CMX 是英伟达的上下文内存存储平台，解决 KV 缓存瓶颈问题。
STX 是英伟达的参考存储机架架构，采用基于 BF-4 的存储解决方案。
BlueField-4、CMX 和 STX 代表了英伟达在存储层标准化集群设计方面的努力。

供应链影响：

高通 AlphaWave 提供 112G SerDes，用于 Groq LPU 3 和 LP35。
LPX 计算托盘需要高规格 PCB，胜宏科技和沪士电子为供应商。
安费诺为 LPX 背板提供连接器，并从鸿腾精密获得 VR NVL72 背板电缆盒和 Paladin HD 连接器的生产许可。
Kyber 中板将采用 Voronoi 专有连接器规范，鸿腾精密和安费诺为供应商。
英伟达采用 MBOM 设计，阻止使用可插拔收发器，超大规模云厂商继续推动使用 OSFP 笼子。

Groq LP30, LPX Rack, Attention FFN Disaggregation, Oberon & Kyber Updates,Nvidia's CPO Roadmap, Vera ETL256, CMX & STX Groq LP30、LPX 机架、Attention FFN 解耦、Oberon 与 Kyber 更新、英伟达 CPO 路线图、Vera ETL256、CMX 与 STX At GTC 2026, Nvidia delivered an event packed full of ground breakingannouncements. Nvidia’s pace of innovation is not showing any signs of slowing, as they introduced three entirely new systems this year: Groq LPX, Vera ETL256, andSTX. Also announced were updates to Nvidia’s Kyber rack architecture system, CPOmaking its debut for scale-up networking with the unveiling of the Rubin UltraNVL576 and Feynman NVL1152 multi-rack systems. Early hints on Feynman’sarchitecture was also a key topic. A Jensen callout forInferenceX during the keynotewas a highlight. 在GTC 2026上，英伟达（Nvidia）带来了⼀场充满突破性发布的盛会。英伟达的创新步伐没有表现出任何放缓的迹象，今年他们推出了三款全新的系统：Groq LPX、Vera ETL256和STX。同时发布的还有英伟达Kyber机架架构系统的更新，随着Rubin Ultra NVL576和Feynman NVL1152多机架系统的亮相，CPO（共封装光学）在扩展⽹络领域⾸次登场。关于Feynman架构的早期线索也是⼀个核⼼话题。⻩仁勋在主题演讲中对InferenceX的点名表扬成为了⼀⼤亮点。 This is our GTC 2026 recap, and we will address many of the key questions that havebeen left unanswered by Nvidia. Specifically, we will go through the LPX rack andLP30 chip and explain how attention and feed forward network disaggregation (AFD)works; more details on the various rack architectures behind NVL144, NVL576, andNVL1152 and clarify just how much optics will be inserted as well as the rationalebehind the dense Vera ETL256. The next generation Kyber rack had some big updates and some hidden details. 这是我们的GTC 2026回顾，我们将解答英伟达留下的许多关键问题。具体⽽⾔，我们将深⼊探讨LPX机架和LP30芯⽚，并解释注意⼒机制与前馈⽹络解耦（AFD）的⼯作原理；详细介绍NVL144、NVL576和NVL1152背后的各种机架架构，并阐明光模块的实际接⼊量以及⾼密度Vera ETL256背后的设计逻辑。下⼀代Kyber机架也有⼀些重⼤更新和隐藏细节。 Groq First up is the Groq LPU. One of the most significant recent events in AIinfrastructure was Nvidia’s “acquisition” of Groq. Strictly speaking, Nvidia paid Groq$20B to license their IP and hire most the team. This functions almost as anacquisition, though its structure technically falls short of it being legally considered asone, thereby simplifying or obviating the need for regulatory approvals. Given Nvidia’smarket share, if this transaction were structured as a full acquisition and were put toanti-trust review, such a transaction would likely not go through. The other benefit isthat it avoids a drawn-out transaction closing process. Nvidia got instant access toGroq’s IP and people. This is why, less than four months after the deal was announced,Nvidia already has a system concept that is being integrated into the Vera Rubininference stack. ⾸先是Groq LPU。近期AI基础设施领域最重⼤的事件之⼀就是英伟达对Groq的“收购”。严格来说，英伟达向Groq⽀付了200亿美元，⽤于获得其IP授权并雇佣了⼤部分团队成员。这在功能上⼏乎等同于收购，尽管其结构在技术上并不符合法律意义上的收购，从⽽简化或消除了监管审批的必要性。考虑到英伟达的市场份额，如果这笔交易被构造成完整的收购并提交反垄断审查，很可能⽆法通过。另⼀个好处是它避免了漫⻓的交易交割过程。英伟达⽴即获得了Groq的IP和⼈才。这就是为什么在交易宣布不到四个⽉后，英伟达就已经拥有了⼀个正在整合到Vera Rubin推理堆栈中的系统概念。 Let’s now go through a refresher on the LPU architecture to see how Groq’s LPUcomplements Nvidia’s GPU. For more detailssee our original Groq piece.The premisefrom that piece remains unchanged: the standalone Groq LPU system is noteconomical for serving tokens at scale, but it can serve tokens very quickly which candemand a large market premium. This is the premise behind how LPU fits into a disaggregated decode system. 让我们现在复习⼀下LPU架构，看看Groq的LPU是如何补充英伟达GPU的。更多详情请参阅我们最初关于Groq的⽂章。那篇⽂章中的前提依然没有改变：独⽴的Groq LPU系统在⼤规模提供Token服务⽅⾯并不经济，但它可以极快地提供Token，从⽽获得巨⼤的市场溢价。这就是LPU如何融⼊解耦解码（disaggregateddecode）系统的基础前提。 LPU chipLPU 芯片 Groq’s first and only publicly announced LPU architecture was detailed in their ISCA2020 paper. Unlike typical hardware architectures connecting many general-purposecores, Groq re-organized the architecture into groups of single-purpose unitsconnecting to other groups of different purposes, and they named the groups “slices.”Between functional units are streaming registers, scratchpad SRAM for functionalunits to pass data to each other. Groq opted for single-level scratchpad SRAM insteadof multi-level memory hierarchy to make the hardware execution deterministic. Groq⾸个也是唯⼀公开宣布的LPU架构在其2020年的ISCA论⽂中进⾏了详细阐述。与连接多个通⽤核⼼的典型硬件架构不同，Groq将架构重新组织为若⼲组连接到其他不同⽤途组的单⼀⽤途单元，并将这些组称为“切⽚”。功能单元之间设有流寄存器和暂存SRAM，供功能单元相互传递数据。Groq选择使⽤单级暂存SRAM⽽⾮多级存储层次结构，以确保硬件执⾏的确定性。 Concretely, LPU architecture has VXM slices for vector operations, MEM slices forloading/storing data, SXM slices for tensor shape manipulation, and MXM slices forperforming matrix multiplication. Spatially, the slices are laid out horizontally,allowing the data to stream horizontally. Within a slice, instructions are pumpedvertically across units. Conceptually, LPU resembles a systolic array that pumpsinstructions vertically and data horizontally. 具体⽽⾔，LPU架构包含⽤于向量运算的VXM切⽚、⽤于加载/存储数据的MEM切⽚、⽤于张量形状操作的SXM切⽚，以及⽤于执⾏矩阵乘法的MXM切⽚。在空间布局上，这些切⽚⽔平排列，允许数据⽔平流转。在切⽚内部，指令在各单元之间垂直泵送。从概念上讲，LPU类似于⼀个垂直泵送指令、⽔平泵送数据的脉动阵列。 The data flow and instruction flow design requires fine-grained pipelining to achievehigh performance. Since LPU architecture makes computation deterministic,

点击免费查看完整报告

你可能感兴趣

GTC 2026 – 推理王国扩张 GTC 2026 – The Inference Kingdom Expands

你可能感兴趣

2026 英伟达 GTC 大会点评：LPU 融入推理体系，全栈设计能力塑造领先优势

The Role of Storage in Commodity Markets: Indirect Inference Based on Grains Data

Energy Policies of IEA Countries: The United Kingdom 2006 Review

Oil and Gas Emergency Policy: The United Kingdom 2010 update

Renewables Readiness Assessment: The Hashemite Kingdom of Jordan