AI智能总结
RecSys-Examples IntroductionHSTU TrainingPracticeinRecSys-ExamplesHSTU Inference Practice inRecSys-ExamplesAgenda • Recommender Trend And ChallengesTowards Generative and Sequential model:HSTU1.TheTransformeror its variants have been applied toRecSysexperimentally1since it’s advent. HSTU is the SOTA implementation.2.FW migration from TF toPyTorchinRecSys--Inspired by the great success of LLM3.Along with the model and FW migration, comes the hardware and software challenges:1.Billions ~ Trillionsof embedding cardinality–demanding memory and dynamic/hash requirements.--TorchRec::dynamicemb2.Growing math computes–Scaling-Lawnow can also apply inrecsysdomain. But no efficient parallelism implementation forcustomized attentioninRecSys.--Megatron-Core /KVCache.DLRM-DCNv2 are obtained fromMLPerf, and HSTU are test with our examples. (8hstulayers)Fig . Total compute used to train deep learningmodelsSource: Action Speaks Loader Than Words,https://arxiv.org/pdf/2402.17152Fig. DLRM-DCNv2 vs HSTU training breakdown (DynamicEmbCPU offloading)2. DGX-H100 For instanceSASRec, Bert4Recoveryears 3 RecSys-Example Repo Introductionhttps://github.com/NVIDIA/recsys-examples•Areference design & example collectionto demonstrate the best practicefor large-scale recommender systems esp. generative recommender(GR)•TorchRec::dynamicembempowers us to process colossal embeddings across•Optimized (HSTU attention) kernelsare integrated in our example andsignificant perf improvement is observed.•Megatron-Coreare utilized to enable various parallelism paradigm and other•KVCachefor history sequence is used for reducing calculation complexity.•Paged HSTU Attentionkernelsare supported with the adoption of paged GPU•Overlapping data transfer with computationgives further performanceFig 1. Software Stack Diagram. All are open-sourcedFBGEMM_GPU…CUDA kernelsPyTorchopsmodulesExample Models training & inference inPyTorch.•In terms of training, we present thatany number of workers.benefits like distributed optimizer.•In terms of inferenceKV cache.improvement. RecSys-Example IntroductionHSTU TrainingPracticeinRecSys-ExamplesHSTU Inference Practice inRecSys-ExamplesAgenda •• HSTU Training practice--SparseTorchRec•TorchRecis aPyTorchdomain libraryspecializing in distributed embedding•The essential Module isShardedEmbeddingCollection(EC) orShardedEmbeddingBagCollection(EBC) composed of below submodules•TorchRecprovides following advantages:•Table sharding and corresponding data communications.1•Embedding tables grouping–Group multiple tables into one single contiguous•However,TorchReccan not fit in all cases, especially in the scenario of operators inRecSys.1.Input data distributions submodule2.Lookup submodule3.Output data distributions submodulestorage.•Backward and optimizer step fusion.•RowWisemomentum optimizer.•…generative sequential models.1TorchRecassumes input are all data-parallel. HSTU Training practice--SparseDynamicEmb(Dynamic Embedding)•DynamicEmbis a Python package that supportsTensor-ParalleldynamicEmbedding. The functionality is providedthroughPyTorch/TorchRecModule.–easy to integrate into your framework•DynamicEmb’s a patch to theTorchRecnative static embedding rather than a replacement.•Now is open-sourced as acorelibunder NVIDIArecsys-example•The main functionalities are listed as below table:TorchRecstaticembeddingDynamicEmbNoYesNo (Pre-allocate)YesNoYes(with various strategies)NoYes(with various strategies)YesYesCW, RW, TW, etc1RW2Dynamic Emb `shard` in round-robin fashion based on the raw key HashCapacity ExpansionEmbedding EvictionIncremental DumpCPU offloadingEmbedding ShardingCW,RW,TW: column-wise, row-wise, table-wiseFig.TorchRecExisting Dynamic Embedding vs ours HSTU Training practice--SparseInsideDynamicEmb•DynamicEmbembraces highly-optimized embedding kernels1from•The underlying supporting data structure isHierarchicalKV::•Configurable GPU HBM size for values–Flexible for scaling dense•Unified all storage addressing--Those storages are mutual exclusive. i.e.One value can only reside in one location.•Allkeys are stored onHBM–Ensure higher throughput of key•HierarchicalKVhasgained widespread recognition in the industry. e.g.•HugeCTR::SOK2, a TF plugin for embedding operator built uponEvenly Distributedincluding index calculation, embedding backward and gradient reduction HugeCTR.HashTabledeveloped by NVIDIAnetwork.processing. (e.gkeys unique)HierarchicalKV.•DeepRechas integrated SOK•…HugeCTR SoK technical blog HSTU Training practice--DenseOptimized HSTU kernel•We have integrated the highly-optimizedCUTLASS-basedHSTU kernelinto our example and has gained pronounced benefits.•Individual Ampere and Hopper implementations to best exploit the•Customized attention mask and RAB support.•Now is open-sourced as acorelibunder NVIDIArecsys-example.•Is being ported intoFBGEMM_GPU project. hardware. HSTU Training practice--DenseKernel Fusion1.It removes intermediate tensors IO (one write and one read)–2.It removes CPU kern