Scaling Fiber Networks to Meet Tomorrow’s Data Center Demands Executive Summary This document explores the critical considerations linked to data centers optimized for AI workloads. By highlightingthe growing computational power required by large language models (LLMs), the paper seeks to inform readers on What you will learn: 01 02 Cooling solutions Energy consumption Due to the energy required by high-performance hardware and the complexity andsize of the datasets needed for LLM training and AI workloads generate significant heat,necessitating advanced thermal managementmethods such as direct-to-chip and immersioncooling – traditional air-cooling methods cannot 04 03 Physical space requirements Network topologies To accommodate very large systems withspecialized hardware and cooling systems, AI datacenter size – both in terms of physical footprintand cubic meters – has grown and continues to Choice of network topology defines a system’sdata flow efficiency and readiness for rapidscalability. With the aim of minimizing latencyand maximizing bandwidth, operators must 05 06 Backend network (BENW) andfrontend network (FENW) Scalability From sharing model updates during training tolow-latency connections between accelerators,discover the essential load balancing and network Looking to the future, we consider a scalableClos network supporting hundreds of thousands This overview of AI data center infrastructure, hardware requirements, and capabilities provides the groundworkfor a forthcoming comprehensive exploration of in-depth technical considerations. Written by Alan KeizerSenior Technology Advisor, AFL Ben Atherton “The emergence of generative AI, withits exceptionally large models and trulyextraordinary computing requirements, has Alan KeizerSenior Technology Advisor, AFL The surging demand for artificial intelligence (AI) and machine learning (ML) technologiespresents data center operators with unique challenges in terms of increasing, optimizing,and maintaining network efficiency. To keep pace, modern data center architectures must The unprecedented computational power and energy resources linked to the rise oflarge language models (LLMs) cannot be overlooked, requiring a deeper understanding By closely examining multiple performance-related factors, industry leaders can betterequip the data center operators of tomorrow with the necessary tools and wisdom to This white paper explores the intricacies of AI data center networking, highlighting thesignificant differences between traditional infrastructures and data centers optimized for What’s Different Aboutan AI Data Center Network? Large Language Models (LLMs) are systems trained on data to recognize patterns, discern sentiment, and generate human-like languagein response to prompts. LLM creation follows a two-step process. First, the training phase involves AI models learning from datasets byadjusting parameters to improve accuracy. Next, during the inference phase, trained models apply the knowledge learned from training LLMs provide the natural language processing capability within the broader AI ecosystem. To train the requisite LLMs for AI datacenter networks requires immense power and computational resources. For example, today’s leading-edge GPUs used to train clusterscomprising over 100,000 GPUs can each consume 1,200 to 1,500 watts. This results in total data center power in the range of 300 Energy consumption Energy is power over time, expressed as kilowatt-hours (kWh). Energy consumption is a critical differentiator thatsets AI data centers apart from traditional data centers. The combination of high-performance hardware and thecomputational demands of training and inference drives the need for massive amounts of power. Large LanguageModels (LLMs), which can have billions or even trillions of parameters, require immense energy resources. As Training phase Factors influencing energy consumption during the training phase include hardware efficiency, dataset size, andmodel complexity. The training phase can be divided into two main components: Data processing This involves cleaning and preparing data before training. For example, the Common Crawl dataset, used totrain models like GPT-3, comprises 9.5 petabytes of data. Iterative computation The power and hardware required during this phase varies based on dataset size and model complexity.As model parameters grow, the demand for computational resources increases, leading to greater energyconsumption over time – advanced accelerators require 3-to-10 times the power but result in hundreds to Inference phase Once trained, large model energy consumption levels remain high, particularly in scenarios demanding real-timecalculations – generally, inference is less computationally intensive than training, but still consumes substantialamounts of energy, especially in relation to high-frequency requests. This highlights the ongoing energy demands Power and coolin