行业研究公司研究宏观策略财报招股书会议纪要海南封关低空经济 DeepSeek AIGC 大模型

GPU 编程和优化 – 最佳实践分享

2023-10-24NVIDIA浮***

AI智能总结

主题演讲

GPU 编程和优化–最佳实践分享：刘冰/郑鹏分享了 CUDA 编程优化的基础知识，包括全局内存对齐、共享内存银行冲突、ILP 和 TLP 等概念，并通过案例分析讲解如何融合 MHA 和 FMHA 来提升性能。
在 NVIDIA NeMo 中实现大语言模型全周期开发 –以 LLaMa2 为例：姚鑫/颜子杰介绍了 NeMo 框架的功能和结构，并详细讲解了使用 NeMo 框架进行 LLaMa2 模型的全周期开发流程，包括预训练、微调（SFT 和 PEFT）、RLHF 和 TRT-LLM 等阶段。
TensorRT Hackathon 2023 总结：季光/陈庾回顾了 TensorRT Hackathon 的活动内容，包括初赛和复赛的赛制、赛况和成果，并总结了赛事的特点和经验。
Accelerating LLaMA Inference with Quantization in TensorRT-LLM：陈庾介绍了如何在 TensorRT-LLM 中构建和运行 LLaMA 模型，以及如何对 LLaMA 模型进行量化，包括权重量化、INT8 KV 缓存和 INT8 SmoothQuant 等方法，并展示了量化后的模型在准确性和性能方面的表现。
向量数据库的加速策略和实战：王雍/张静蓉讲解了向量数据库的加速策略，包括近似最近邻搜索（ANN）算法、GPU 加速索引技术和向量数据库的实战案例，并介绍了 RAPIDS RAFT 和 CAGRA 等工具。
推荐系统的最新优化策略和实践–以 HPS 为例：魏英灿/王泽寰介绍了 HPS（Hierarchical Parameter Server）的架构和功能，并详细讲解了 HPS 的 GPU 嵌入式推理缓存（EIC）的实现方式，包括设备锁、主机锁和无锁实现等方案，最后展示了 HPS 在不同 NVIDIA GPU 上的性能评估结果。

分组讨论及答疑

GPU 专家：刘冰/郑鹏/郁凡/王猛
大语言模型训练：颜子杰/陶砺/姚鑫
TRT LLM 以及扩散模型：季光/薛博阳/陈庾/方杰
向量数据库：王雍/张静蓉/董建兵
推荐系统的训练与推理：魏英灿/王泽寰/张耀斌/孙凯

欢迎致辞李曦鹏NVIDIA开发与技术部亚太区总经理 GPU编程和优化–最佳实践分享刘冰&郑鹏 GPU编程和优化–最佳实践分享PetrickLiu刘冰, Devtech |PerkzZheng郑鹏, Devtech CUDA Optimization FundamentalsUnderstand what isGlobalMemoryCoalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhyfusethe MHAFMHAas exampleAgenda CUDA Optimization FundamentalsUnderstand what isGlobalMemoryCoalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhyfusethe MHAFMHAas exampleAgenda GPUArchitectureGPU: MassiveThroughputMachine, Keep the Throughput Maximumfull GH100 with 144 SMsH100 SXM5:3352GB/sFP32 non-Tensor:66.9TFLOPSFP16 dense-Tensor:984.9TFLOPSFP8dense-Tensor:1978.9 TFLOPS DRAM: Understand what is Global Memory Coalesced AccessTypical Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector) Dram->L2->L1•Threads in a warp access adjacent float value. 32 threads access 32x4 Bytes = 128 Bytes = 4 x 32B = 4 Sectors(Show inRed)•floatval= (float*)src[threadIdx.x];=> Sector 0=> Sector 1=> Sector 2=>Secror3 transactions.•Example:•T0~T7•T8~T15•T16~T23•T24~T31 11 Understand what is Global Memory Coalesced AccessMisaligned Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•Threads in a warp access adjacent float value, but with an offset, like 5.•32 threads access 32x4 Bytes = 128 Bytes = 4 x 32B = 4 Sectors(Ideal) But will access 5 Sectors(Actual)(Show inRed)•floatval= (float*)src[threadIdx.x+ offset];=> Sector 0=> Sector 1=> Sector 2=> Sector 3=> Sector 4 transactions.•Example:•T0~T2•T3~T10•T11~T19•T20~T27•T28~T31 Understand what is Global Memory Coalesced AccessStride Access Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•Example:Stride of 2:=> Sector 0=> Sector 1=> Sector 2=> Sector 3=> Sector 4=> Sector 5=> Sector 6=> Sector 71 warp access 128 Bytes = 4 x 32B = 4 Sectors(Ideal). But it will access 8 Sectors(Actual) transactions.•T0~T4•T5~T7•T7~T11•T12~T15•T16~T19•T20~T23•T24~T27•T28~T31 Understand what is Global Memory Coalesced AccessStride Access Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•Example:Stride >= 32B:•T0=> Sector 0•T1=> Sector 1•T2=> Sector 2•T3=> Sector 3•T30 => Sector 30•T31 => Sector 311 warp access 128 Bytes4 x 32B = 4 Sectors(Ideal).But it will access32Sectors(Actual) transactions.•….. Understand what is Global Memory Coalesced AccessStride Access vs Coalesced Access Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•Assume 1024 threads in each block, each block copy 4096 elements.•Test withL1 cache enable& disable, by–Xptxas–dlcm=ca or–Xptxas–dlcm=cg•(ca is for cache all, including L1; cg is for cache global, excluding L1) transactions. 15 Understand what is Global Memory Coalesced AccessStride Access vs Coalesced Access Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•On A100-40GB,total 400 * 4096 float•L1 Cache enabled:•L1 Cache disabled:•Conclusion:•Try your best to coalesceevery global memory access. transactions. Understand what is Global Memory Coalesced AccessStride Access vs Coalesced Access Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•CUDA provide built-in vector data type, such as float4, float2, int4 ,int2, etc. Can be used when the aligments meets therequirements.•On A100-40GB,total 400 * 4096 float•L1 Cache enabled:•Conclusion:•Try your best to coalesceevery global memory access. transactions. 17 Understand what is Global Memory Coalesced AccessCoalesced Access with vec type Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•CUDA provide built-in vector data type, such as float4, float2, int4 ,int2, etc. Can be used when the alignments meets therequirements.•On A100-40GB,total 400 * 8192 float ,L1 enabled:•Conclusion:Try to use vec type to access memory when the aligments requirements are met. transactions. 18 CUDA Optimization FundamentalsUnderstand what isGlobalMemoryCoalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhyfusethe MHAFMHAas exampleAgenda Understand what is Shared Memory Coalesced AccessOfficial Shared Memory Access Example•Shared memory has32 banksthat are organized such that successive 32-bit words map to successive banks.•Each

点击免费查看完整报告

你可能感兴趣

GPU 编程和优化 – 最佳实践分享

主题演讲

分组讨论及答疑

你可能感兴趣

云原生时代下大规模 GPU 资源利用率优化最佳实践

分享关于电池储能系统（BESS）标准在促进安全、能源韧性和可持续性方面作用的最佳实践和能力建设。

APPC最佳城市达峰减排实践比较和分享

最佳城市达峰减排实践比较和分享

余英豪 - 阿里巴巴万卡 GPU PAI 集群的资源效率优化：数据剖析和工程实践