行业研究公司研究宏观策略财报招股书会议纪要中央经济工作会议低空经济 DeepSeek AIGC 智能驾驶大模型

RecSys示例：HSTU模型训练和推理最佳实践

信息技术2025-05-30-NVIDIA张***

AI智能总结

核心观点与关键数据

推荐系统趋势与挑战：Transformer及其变体在推荐系统中的应用日益广泛，HSTU作为SOTA实现备受关注。随着模型和框架从TensorFlow迁移到PyTorch，硬件和软件挑战也随之而来，包括大规模嵌入表的需求和日益增长的数学计算量。
RecSys-Examples介绍：该仓库提供大型推荐系统（尤其是生成式推荐）训练与推理的最佳实践示例，涵盖PyTorch框架。训练方面，展示了TorchRec::dynamicemb处理大规模嵌入、优化HSTU注意力内核以及Megatron-Core实现并行化等优势。推理方面，介绍了KVCache用于减少计算复杂度、支持分页HSTU注意力内核以及数据传输与计算的重叠等技术。

HSTU训练实践

TorchRec与DynamicEmb：TorchRec是PyTorch领域库，专用于分布式嵌入操作，但无法完全适用于生成式序列模型。DynamicEmb作为TorchRec的补充，支持张量并行动态嵌入，具有哈希、容量扩展、嵌入驱逐、增量转储和CPU卸载等功能，并集成HugeCTR的高性能嵌入内核。
DynamicEmb内部机制：DynamicEmb基于HierarchicalKV哈希表，支持可配置的GPU HBM大小，统一所有存储地址，并将所有键存储在HBM上，确保高吞吐量。
优化HSTU内核：已集成基于CUTLASS的优化HSTU内核，支持Ampere和Hopper架构，并自定义注意力掩码和RAB支持。
内核融合：通过移除中间张量IO和CPU内核启动开销，提高训练精度，但需注意训练中反向传播的梯度计算需求。
Megatron-Core并行化：Megatron-Core提供多种并行范式（TP、DP、CP、PP、EP），HSTU主要受益于线性权重和注意力头的张量并行、批大小数据并行以及序列长度的上下文并行。
激活卸载：为适应资源有限的HBM，可将激活张量异步卸载到主机内存，并支持多个激活共享连续缓冲区。
端到端性能：在DGX-A100和DGX-H100上，HSTU训练性能提升显著，例如在50M嵌入、256维嵌入维度、32批次大小、4096序列长度、4个头、256头维度、8层模型上，性能提升达1.3x~2.0x。

HSTU推理实践

KV数据重用：通过重用用户特征标记和先前历史标记的KV数据，仅需计算新的（增量）历史标记和候选项目，从而提高推理性能，尤其适用于长历史序列。
GPU KV Cache：使用用户ID作为缓存数据管理的键，支持增量追加和数据缓存跨推理请求，同时利用GPU内存和主机存储进行缓存。
分页布局与分页HSTU注意力：GPU KV Cache采用分页表布局，通过分页HSTU注意力内核从分页表中加载KV数据。
卸载和重新计算：将KV数据备份到主机，按块（例如1024个标记）异步传输，并与HSTU推理计算重叠，使用多个传输进行不同层的KV数据，并使用非阻塞主机到设备数据传输与前一层的计算重叠。
推理工作流程：包括生成BatchedRequest、在KV Cache Manager中分配新块、从主机KV存储查找并开始将数据传输到GPU、将新历史标记的KV复制到GPU KV Cache、计算新历史标记和候选项目的注意力输出以及推理完成时卸载GPU缓存中的KV数据。
推理性能：在合成数据集上，使用GPU KV Cache使单个HSTU层推理延迟提升1.3x~2.0x；在Kuairand 27K数据集上，推理性能提升1.1x~1.15x。

持续工作进行

DynamicEmb增强：结合TorchRec SSD嵌入集合以实现SSD卸载，添加预取管道以隐藏SSD延迟等。
密集增强：设计和开发HSTU层的最佳重计算方案，支持激活卸载，添加上下文并行以解决长序列激活OOM问题等。
推理增强：支持更多KV缓存中的序列管理，支持隔离历史序列推理并与其他推理步骤重叠等。

RecSys-Examples IntroductionHSTU TrainingPracticeinRecSys-ExamplesHSTU Inference Practice inRecSys-ExamplesAgenda • Recommender Trend And ChallengesTowards Generative and Sequential model:HSTU1.TheTransformeror its variants have been applied toRecSysexperimentally1since it’s advent. HSTU is the SOTA implementation.2.FW migration from TF toPyTorchinRecSys--Inspired by the great success of LLM3.Along with the model and FW migration, comes the hardware and software challenges:1.Billions ~ Trillionsof embedding cardinality–demanding memory and dynamic/hash requirements.--TorchRec::dynamicemb2.Growing math computes–Scaling-Lawnow can also apply inrecsysdomain. But no efficient parallelism implementation forcustomized attentioninRecSys.--Megatron-Core /KVCache.DLRM-DCNv2 are obtained fromMLPerf, and HSTU are test with our examples. (8hstulayers)Fig . Total compute used to train deep learningmodelsSource: Action Speaks Loader Than Words,https://arxiv.org/pdf/2402.17152Fig. DLRM-DCNv2 vs HSTU training breakdown (DynamicEmbCPU offloading)2. DGX-H100 For instanceSASRec, Bert4Recoveryears 3 RecSys-Example Repo Introductionhttps://github.com/NVIDIA/recsys-examples•Areference design & example collectionto demonstrate the best practicefor large-scale recommender systems esp. generative recommender(GR)•TorchRec::dynamicembempowers us to process colossal embeddings across•Optimized (HSTU attention) kernelsare integrated in our example andsignificant perf improvement is observed.•Megatron-Coreare utilized to enable various parallelism paradigm and other•KVCachefor history sequence is used for reducing calculation complexity.•Paged HSTU Attentionkernelsare supported with the adoption of paged GPU•Overlapping data transfer with computationgives further performanceFig 1. Software Stack Diagram. All are open-sourcedFBGEMM_GPU…CUDA kernelsPyTorchopsmodulesExample Models training & inference inPyTorch.•In terms of training, we present thatany number of workers.benefits like distributed optimizer.•In terms of inferenceKV cache.improvement. RecSys-Example IntroductionHSTU TrainingPracticeinRecSys-ExamplesHSTU Inference Practice inRecSys-ExamplesAgenda •• HSTU Training practice--SparseTorchRec•TorchRecis aPyTorchdomain libraryspecializing in distributed embedding•The essential Module isShardedEmbeddingCollection(EC) orShardedEmbeddingBagCollection(EBC) composed of below submodules•TorchRecprovides following advantages:•Table sharding and corresponding data communications.1•Embedding tables grouping–Group multiple tables into one single contiguous•However,TorchReccan not fit in all cases, especially in the scenario of operators inRecSys.1.Input data distributions submodule2.Lookup submodule3.Output data distributions submodulestorage.•Backward and optimizer step fusion.•RowWisemomentum optimizer.•…generative sequential models.1TorchRecassumes input are all data-parallel. HSTU Training practice--SparseDynamicEmb(Dynamic Embedding)•DynamicEmbis a Python package that supportsTensor-ParalleldynamicEmbedding. The functionality is providedthroughPyTorch/TorchRecModule.–easy to integrate into your framework•DynamicEmb’s a patch to theTorchRecnative static embedding rather than a replacement.•Now is open-sourced as acorelibunder NVIDIArecsys-example•The main functionalities are listed as below table:TorchRecstaticembeddingDynamicEmbNoYesNo (Pre-allocate)YesNoYes(with various strategies)NoYes(with various strategies)YesYesCW, RW, TW, etc1RW2Dynamic Emb `shard` in round-robin fashion based on the raw key HashCapacity ExpansionEmbedding EvictionIncremental DumpCPU offloadingEmbedding ShardingCW,RW,TW: column-wise, row-wise, table-wiseFig.TorchRecExisting Dynamic Embedding vs ours HSTU Training practice--SparseInsideDynamicEmb•DynamicEmbembraces highly-optimized embedding kernels1from•The underlying supporting data structure isHierarchicalKV::•Configurable GPU HBM size for values–Flexible for scaling dense•Unified all storage addressing--Those storages are mutual exclusive. i.e.One value can only reside in one location.•Allkeys are stored onHBM–Ensure higher throughput of key•HierarchicalKVhasgained widespread recognition in the industry. e.g.•HugeCTR::SOK2, a TF plugin for embedding operator built uponEvenly Distributedincluding index calculation, embedding backward and gradient reduction HugeCTR.HashTabledeveloped by NVIDIAnetwork.processing. (e.gkeys unique)HierarchicalKV.•DeepRechas integrated SOK•…HugeCTR SoK technical blog HSTU Training practice--DenseOptimized HSTU kernel•We have integrated the highly-optimizedCUTLASS-basedHSTU kernelinto our example and has gained pronounced benefits.•Individual Ampere and Hopper implementations to best exploit the•Customized attention mask and RAB support.•Now is open-sourced as acorelibunder NVIDIArecsys-example.•Is being ported intoFBGEMM_GPU project. hardware. HSTU Training practice--DenseKernel Fusion1.It removes intermediate tensors IO (one write and one read)–2.It removes CPU kern

点击免费查看完整报告