AI智能总结
欢迎致辞李曦鹏NVIDIA开发与技术部亚太区总经理 GPU编程和优化–最佳实践分享刘冰&郑鹏 GPU编程和优化–最佳实践分享PetrickLiu刘冰, Devtech |PerkzZheng郑鹏, Devtech CUDA Optimization FundamentalsUnderstand what isGlobalMemoryCoalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhyfusethe MHAFMHAas exampleAgenda CUDA Optimization FundamentalsUnderstand what isGlobalMemoryCoalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhyfusethe MHAFMHAas exampleAgenda GPUArchitectureGPU: MassiveThroughputMachine, Keep the Throughput Maximumfull GH100 with 144 SMsH100 SXM5:3352GB/sFP32 non-Tensor:66.9TFLOPSFP16 dense-Tensor:984.9TFLOPSFP8dense-Tensor:1978.9 TFLOPS DRAM: Understand what is Global Memory Coalesced AccessTypical Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector) Dram->L2->L1•Threads in a warp access adjacent float value. 32 threads access 32x4 Bytes = 128 Bytes = 4 x 32B = 4 Sectors(Show inRed)•floatval= (float*)src[threadIdx.x];=> Sector 0=> Sector 1=> Sector 2=>Secror3 transactions.•Example:•T0~T7•T8~T15•T16~T23•T24~T31 11 Understand what is Global Memory Coalesced AccessMisaligned Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•Threads in a warp access adjacent float value, but with an offset, like 5.•32 threads access 32x4 Bytes = 128 Bytes = 4 x 32B = 4 Sectors(Ideal) But will access 5 Sectors(Actual)(Show inRed)•floatval= (float*)src[threadIdx.x+ offset];=> Sector 0=> Sector 1=> Sector 2=> Sector 3=> Sector 4 transactions.•Example:•T0~T2•T3~T10•T11~T19•T20~T27•T28~T31 Understand what is Global Memory Coalesced AccessStride Access Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•Example:Stride of 2:=> Sector 0=> Sector 1=> Sector 2=> Sector 3=> Sector 4=> Sector 5=> Sector 6=> Sector 71 warp access 128 Bytes = 4 x 32B = 4 Sectors(Ideal). But it will access 8 Sectors(Actual) transactions.•T0~T4•T5~T7•T7~T11•T12~T15•T16~T19•T20~T23•T24~T27•T28~T31 Understand what is Global Memory Coalesced AccessStride Access Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•Example:Stride >= 32B:•T0=> Sector 0•T1=> Sector 1•T2=> Sector 2•T3=> Sector 3•T30 => Sector 30•T31 => Sector 311 warp access 128 Bytes4 x 32B = 4 Sectors(Ideal).But it will access32Sectors(Actual) transactions.•….. Understand what is Global Memory Coalesced AccessStride Access vs Coalesced Access Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•Assume 1024 threads in each block, each block copy 4096 elements.•Test withL1 cache enable& disable, by–Xptxas–dlcm=ca or–Xptxas–dlcm=cg•(ca is for cache all, including L1; cg is for cache global, excluding L1) transactions. 15 Understand what is Global Memory Coalesced AccessStride Access vs Coalesced Access Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•On A100-40GB,total 400 * 4096 float•L1 Cache enabled:•L1 Cache disabled:•Conclusion:•Try your best to coalesceevery global memory access. transactions. Understand what is Global Memory Coalesced AccessStride Access vs Coalesced Access Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•CUDA provide built-in vector data type, such as float4, float2, int4 ,int2, etc. Can be used when the aligments meets therequirements.•On A100-40GB,total 400 * 4096 float•L1 Cache enabled:•Conclusion:•Try your best to coalesceevery global memory access. transactions. 17 Understand what is Global Memory Coalesced AccessCoalesced Access with vec type Example•Global memory loads and stores bythreads of A Warpare coalesced by the device intoas few as possible•Access unit is32-byte(Also calledSector)Dram->L2->L1•CUDA provide built-in vector data type, such as float4, float2, int4 ,int2, etc. Can be used when the alignments meets therequirements.•On A100-40GB,total 400 * 8192 float ,L1 enabled:•Conclusion:Try to use vec type to access memory when the aligments requirements are met. transactions. 18 CUDA Optimization FundamentalsUnderstand what isGlobalMemoryCoalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhyfusethe MHAFMHAas exampleAgenda Understand what is Shared Memory Coalesced AccessOfficial Shared Memory Access Example•Shared memory has32 banksthat are organized such that successive 32-bit words map to successive banks.•Each