您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [新华三技术有限公司]:RDMA Telemetry技术白皮书 - 发现报告

RDMA Telemetry技术白皮书

2026-05-12 新华三技术有限公司 李强
报告封面

目录 1概述·······························································································································11.1产生背景······················································································································11.1.1 RDMA技术的发展历程:从InfiniBand到RoCEv2·······················································11.1.2 RDMA与智算中心的融合:构建高性能计算的基石·······················································21.1.3 RDMA网络质量监控的需求与挑战············································································31.1.4 RDMA Telemetry技术的诞生··················································································31.2技术优点······················································································································42RDMA Telemetry技术实现·································································································42.1 I/O质量可视功能···········································································································42.1.1功能简介·············································································································42.1.2系统架构·············································································································42.1.3网络分段测量·······································································································52.1.4测量策略·············································································································52.1.5测量指标·············································································································62.1.6读操作交互流程····································································································72.1.7写操作交互流程····································································································92.2吞吐量可视功能···········································································································112.2.1功能简介···········································································································112.2.2测量策略···········································································································122.2.3测量指标···········································································································122.2.4运行机制···········································································································132.3 RDMA Telemetry可视化·······························································································143典型组网应用·················································································································16 3.1 AI训练存储网络I/O质量监测·························································································163.2 AI训练计算平面梯度同步性能优化··················································································16 1概述 1.1产生背景 1.1.1RDMA技术的发展历程:从InfiniBand到RoCEv2 RDMA(Remote Direct Memory Access,远程直接内存访问)是一种高速网络互联技术,该技术主要设计目的是减少在数据传输过程中收发端的处理延迟以及CPU资源消耗。该技术允许计算机能够直接访问远程计算机的内存,在内存层面完成数据传输而无需本地CPU频繁介入,从而显著提升网络通信性能。 1.InfiniBand时代:高性能网络的起源 RDMA技术最初由IBTA(InfiniBand Trade Association,InfiniBand贸易协会)提出,旨在解决传统TCP/IP协议栈在HPC(High Performance Computing,高性能计算)环境中存在的高延迟和高CPU开销问题。InfiniBand通过专用硬件实现RDMA,具备以下核心特征: •极低延迟:InfiniBand能够提供极低的通信延迟,通常可以控制在1微秒(μs)以内。•高吞吐:它支持非常高的数据传输速率,可以达到40Gbps、56Gbps甚至100Gbps以上的带宽。•无损网络:InfiniBand还采用了基于信用(Credit-Based)的流量控制机制,确保网络传输过程中不会出现数据丢失的情况,实现了所谓的“无损网络”。 然而,InfiniBand依赖专用的交换机和网卡设备,形成了相对封闭的技术生态,导致其在通用数据中心环境中难以大规模部署。 2.RoCEv2的出现:RDMA与以太网的结合 为降低RDMA的部署成本,业界提出了RoCE(RDMA over Converged Ethernet)技术,实现在通用以太网上运行RDMA。RoCE技术有两个主要版本: •RoCEv1:于2010年推出。这个版本是在以太网的第二层(数据链路层)实现的RDMA技术,它依赖于PFC(Priority Flow Control,优先级流量控制)机制来保证网络传输的无损特性。但是,这种设计存在一个潜在的问题,就是可能会导致网络死锁情况的发生。•RoCEv2:于2014年发布。这个版本做了重要改进,将协议提升到了以太网的第三层(网络层),使用UDP/IP协议进行传输。这样的改变使得RoCEv2能够支持跨子网的路由功能。同时,RoCEv2还引入了ECN(Explicit Congestion Notification,显式拥塞通知)等先进机制。正是这些改进使RoCEv2成为了现代数据中心中最主流的RDMA协议。 RoCEv2具有几个关键优势: •经济性:兼容现有以太网设备,不需要专门购买InfiniBand交换机。•普适性:完美契合了云计算、人工智能和大规模数据存储等现代数据中心的核心需求。•高性能:在性能和成本之间取得了很好的平衡,虽然延迟略高于InfiniBand(约5微秒),但远低于传统TCP/IP网络。 1.1.2RDMA与智算中心的融合:构建高性能计算的基石 以人工智能训练为代表的智能计算(智算)飞速发展,其训练任务需调动成千上万的GPU芯片协同工作数周甚至数月,由此催生了面向高性能、低延迟、无损化需求的智算中心。智算中心作为数据中心服务于极致算力需求的专用子系统,其典型架构包含三层: •计算层(GPU服务器集群):由海量GPU/NPU服务器构成,承担核心计算任务。•网络层(高速交换网络):由高性能以太网交换机组成,负责高速互联与数据交换。•存储层(分布式存储系统):由高性能存储服务器构成,提供训练数据与模型检查点的持久化存储。 在智算中心,基于以太网的RDMA技术——RoCEv2,凭借其优异性能与良好兼容性,成为各层间数据通信的核心标准,如图2所示。它主要加速以下两个关键流程: •计算平面通信(GPU间同步):GPU服务器之间通过纯RoCEv2实现微秒级的数据同步(梯度、参数交换),保障万卡集群的扩展效率。 •存储平面访问(数据供给与持久化):GPU服务器与存储服务器之间通过NVMe over Fabricsover RoCEv2(NVMe-oF over RoCEv2)实现高带宽、低延迟的数据读写,确保训练数据持续供给与检查点快速保存。 在智算中心的存储平面,采用了NVMe over FabricsoverRoCEv2技术。 •NVMe(NVM Express)是应用层/命令层协议,定义了一套高效的命令队列、完成机制和数据结构,用于访问非易失性存储器。•RoCEv2是网络传输层协议,在以太网上承载RDMA语义,实现远端内存的直接访问。 RoCEv2是NVMe over Fabrics的“性能加速器”和“理想座驾”。NVMe定义了存储的语言,而RoCEv2提供了在网络上说这种语言的最高效方式。“NVMe SSD + RoCEv2网络”正在成为高性能存储网络的事实标准,它打破了存储与计算之间的网络壁垒,使得远端存储的访问延迟接近本地NVMe SSD,从而真正实现了存算分离架构下的高性能。 RoCEv2为GPU服务器间及GPU服务器与存储服务器间的