行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

扩展光纤网络以满足未来数据中心需求白皮书

信息技术 2024-12-30 AFL 刘银河

AI 数据中心网络关键考量

能源消耗

AI 数据中心因高性能硬件和大规模数据集需求，能源消耗比传统超大规模架构高 300-1000 倍，达到 300 兆瓦至吉瓦级别。
训练阶段：数据预处理和迭代计算是主要能耗因素，模型参数规模越大，能耗越高。
推理阶段：实时计算场景下能耗仍保持高位，推理虽比训练计算量低，但高频请求下能耗显著。
冷却需求：AI 工作负载产生大量热量，传统风冷无法满足高密度环境，需采用直接到芯片和浸没式冷却技术。

物理空间需求

AI 数据中心因集群集成需求，物理空间需求远高于传统超大规模数据中心。
未来项目如微软和 OpenAI 的 Stargate 设施，预计功率达数吉瓦，占地数百英亩。

网络拓扑

网络拓扑选择影响数据流效率和可扩展性，需平衡延迟和带宽。
Clos 拓扑：提供非阻塞、高带宽连接，支持高效数据传输，具有可扩展性和容错能力。
Torus 拓扑：高效分配计算任务，提供低延迟通信，但需非标准硬件和软件。
光电路交换：适用于稳定连接模式，具有波长、带宽和协议无关性，但商业可用性有限。
混合拓扑：结合不同拓扑元素，如 Clos 核心、Dragonfly 集聚层、Torus 计算集群和 OCS 互联层，实现性能、可用性和成本优化。

Fat-tree 网络

Fat-tree 网络提供高带宽和低延迟，具有以下特点：
- 平衡连接：每个交换机上下链路数量相等，最大化数据传输路径。
- 多路径：提供冗余路径，增强容错和负载均衡。
- 可扩展性：通过增加交换机即可扩展网络。
层级结构：包括 Top of Rack (ToR) 交换机、Leaf 交换机和 Spine 交换机。
优势：均匀性能、成本效益、简化管理和可扩展性。
应用场景：分布式训练、集群扩展和性能维护。

前端网络 (FENW) 与后端网络 (BENW)

FENW：连接节点 (CPU)，包含并行管理网络，负责数据管理和外部连接。
BENW：连接加速器，实现模型更新共享和低延迟连接，支持 Nvidia NVLink 和 Google ICI 技术。
连接器选择：MPO 和 MF VSFF 连接器，支持 800G 及更高速度。
电缆配置：采用光纤电缆，优化带宽和延迟，并实施结构化布线系统。
收发器和带宽选择：SFP+ 和 QSFP28 收发器，建议采用 800G 以满足未来需求。
延迟和吞吐量优化：通过减少数据跳数和高速交换机实现，BENW 路径限制在 100 米。
服务质量 (QoS)：优先处理 AI 工作负载，确保网络性能。
边缘计算集成：通过数据处理本地化降低延迟和带宽需求，提高响应速度和安全性。

后端网络要求

1:1 订阅模式提供无与伦比的网络访问，但共享带宽可优化成本。
基本要求：无阻塞、无损耗数据包传输，支持 RDMA，实现低延迟高吞吐量。
负载均衡：通过优化网络控制和参数分配，防止单点拥塞。
网络控制：需适应动态需求、故障管理和高可靠性，支持同步子系统操作。

Clos 网络扩展

通过增加 Leaf 和 Spine 交换机，Clos 网络可支持大规模扩展。
示例：131,072 个端点加速器，64 端口交换机，800G 交换机间链路，400G 端点链路。
功耗：总功耗达 9,646 千瓦。
机架数量：假设每 GPU/TPU 机架 32 个加速器，每交换机机架 16 个 2RU 交换机。

结论

AI 数据中心需创新冷却和能源管理策略，采用高效光纤解决方案支持高带宽和低延迟。网络拓扑和后端网络设计需支持加速器间低延迟连接，AFL 提供先进的网络和物理层解决方案，满足大规模 AI 数据中心建设需求。

Scaling Fiber Networks to Meet Tomorrow’s Data Center Demands Executive Summary This document explores the critical considerations linked to data centers optimized for AI workloads. By highlightingthe growing computational power required by large language models (LLMs), the paper seeks to inform readers on What you will learn: 01 02 Cooling solutions Energy consumption Due to the energy required by high-performance hardware and the complexity andsize of the datasets needed for LLM training and AI workloads generate significant heat,necessitating advanced thermal managementmethods such as direct-to-chip and immersioncooling – traditional air-cooling methods cannot 04 03 Physical space requirements Network topologies To accommodate very large systems withspecialized hardware and cooling systems, AI datacenter size – both in terms of physical footprintand cubic meters – has grown and continues to Choice of network topology defines a system’sdata flow efficiency and readiness for rapidscalability. With the aim of minimizing latencyand maximizing bandwidth, operators must 05 06 Backend network (BENW) andfrontend network (FENW) Scalability From sharing model updates during training tolow-latency connections between accelerators,discover the essential load balancing and network Looking to the future, we consider a scalableClos network supporting hundreds of thousands This overview of AI data center infrastructure, hardware requirements, and capabilities provides the groundworkfor a forthcoming comprehensive exploration of in-depth technical considerations. Written by Alan KeizerSenior Technology Advisor, AFL Ben Atherton “The emergence of generative AI, withits exceptionally large models and trulyextraordinary computing requirements, has Alan KeizerSenior Technology Advisor, AFL The surging demand for artificial intelligence (AI) and machine learning (ML) technologiespresents data center operators with unique challenges in terms of increasing, optimizing,and maintaining network efficiency. To keep pace, modern data center architectures must The unprecedented computational power and energy resources linked to the rise oflarge language models (LLMs) cannot be overlooked, requiring a deeper understanding By closely examining multiple performance-related factors, industry leaders can betterequip the data center operators of tomorrow with the necessary tools and wisdom to This white paper explores the intricacies of AI data center networking, highlighting thesignificant differences between traditional infrastructures and data centers optimized for What’s Different Aboutan AI Data Center Network? Large Language Models (LLMs) are systems trained on data to recognize patterns, discern sentiment, and generate human-like languagein response to prompts. LLM creation follows a two-step process. First, the training phase involves AI models learning from datasets byadjusting parameters to improve accuracy. Next, during the inference phase, trained models apply the knowledge learned from training LLMs provide the natural language processing capability within the broader AI ecosystem. To train the requisite LLMs for AI datacenter networks requires immense power and computational resources. For example, today’s leading-edge GPUs used to train clusterscomprising over 100,000 GPUs can each consume 1,200 to 1,500 watts. This results in total data center power in the range of 300 Energy consumption Energy is power over time, expressed as kilowatt-hours (kWh). Energy consumption is a critical differentiator thatsets AI data centers apart from traditional data centers. The combination of high-performance hardware and thecomputational demands of training and inference drives the need for massive amounts of power. Large LanguageModels (LLMs), which can have billions or even trillions of parameters, require immense energy resources. As Training phase Factors influencing energy consumption during the training phase include hardware efficiency, dataset size, andmodel complexity. The training phase can be divided into two main components: Data processing This involves cleaning and preparing data before training. For example, the Common Crawl dataset, used totrain models like GPT-3, comprises 9.5 petabytes of data. Iterative computation The power and hardware required during this phase varies based on dataset size and model complexity.As model parameters grow, the demand for computational resources increases, leading to greater energyconsumption over time – advanced accelerators require 3-to-10 times the power but result in hundreds to Inference phase Once trained, large model energy consumption levels remain high, particularly in scenarios demanding real-timecalculations – generally, inference is less computationally intensive than training, but still consumes substantialamounts of energy, especially in relation to high-frequency requests. This highlights the ongoing energy demands Power and coolin

点击免费查看完整报告