热门搜索：

Triton推理引擎专场,面向多框架的AI模型部署服务Triton及其在蚂蚁预测引擎中的应用实践（上）

信息技术2022-07-06-NVIDIA罗***

何成杰NVIDIA高级软件架构师TensorRT&TRITON 21、从TensorRT到Triton，AI模型的推理部署2、Triton中的多框架兼容实现3、TensorRT推理加速库与Triton推理引擎4、蚂蚁预测引擎概述5、蚂蚁在Triton上的创新以及重要场景下的应用6、Triton在蚂蚁的未来AGENDA TensorRT 4OverviewHighlightsWorkflowNew FeaturesSummaryAGENDA 5AI INFERENCE NEEDS TO RUN EVERYWHERETrainingInferenceDNN Model 6NVIDIA TensorRTFrom Every Framework, Optimized For Each Target PlatformTESLA A100DRIVETESLA V100JETSON XAVIERJETSON AGX XAVIERTESLA T4TensorRTNVIDIA DLA 7TensorRT POWERS AI INFERENCEdeveloper.nvidia.com/tensorrtSoftwareResearch / Higher EdIT ServicesAutomotiveOtherInternet / TelecomHealthcare & Life SciencesHardware / SemiconductorManufacturingCloud ServicesPublic SectorFinancial ServicesConsulting ServicesEnergy / Oil & Gas•2M+ Downloads•300,000 Developers•16,000 Companies•300 Inference Services Worldwide•Datacenter, Edge and Embedded•FP32, TF32, FP16, INT8 8HIGHLIGHTS 9NVIDIA TensorRTOptimize and Deploy neural networks in production environmentsMaximize throughput for latency-critical apps with compiler and runtime Deploy responsive and memory efficient apps with INT8 & FP16 optimizationsOptimize every network including CNNs, RNNs and TransformersAccelerate every framework with ONNX support, native TensorRT integrationRun multiple models on a node with containerized inference serverSDK for High-Performance Deep Learning Inferencedeveloper.nvidia.com/tensorrtTensorRT OptimizerTensorRT Runtime EngineTrained Neural NetworkEmbeddedAutomotiveData centerJetsonDriveTesla 10TENSORRTOPTIMIZATIONSKernel Auto-TuningLayer & Tensor FusionDynamic TensorMemoryWeights & ActivationPrecision Calibration➢Optimizations are completely automatic ➢Performed with a single function call 11Un-Optimized Networkconcatmax poolinputnext input3x3 conv.relubias1x1 conv.relubias1x1 conv.relubias1x1 conv.relubiasconcat1x1 conv.relubias5x5 conv.relubiasLAYER & TENSOR FUSIONmax poolinputnext input3x3 CBR5x5 CBR1x1 CBR1x1 CBRTensorRT Optimized NetworkVerticalHorizontalMerge 12Un-Optimized Networkconcatmax poolinputnext input3x3 conv.relubias1x1 conv.relubias1x1 conv.relubias1x1 conv.relubiasconcat1x1 conv.relubias5x5 conv.relubiasLAYER & TENSOR FUSIONSupported Layer Fusions•Vertical Fusion •Horizonal Fusion •Layer Elimination•Layer mergeNetworkLayers beforeLayers afterVGG194327Inception V3309113ResNet-152670159Convolution and ReLUActivationFullyConnectedand ReLUActivationScale and ActivationConvolution And ElementWiseSumShuffle and ReduceShuffle and ShuffleScale(add 0, multiply by 1)Convolution and ScaleReducehttps://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#enable-fusion 13KERNEL AUTO-TUNINGKernel Auto-TuningTesla V100Jetson TX2Multiple factors:•Target platform•Batch size•Input dimensions•Filter dimensions•Tensor layout...Choice:•Implementation of specific algorithm•Kernels•Tensor Layouts100s for specialized kernels Optimized for every GPU platformDrive PX2 14FP16, INT8 PRECISION CALIBRATIONPrecisionDynamic RangeFP32-3.4x1038~ +3.4x1038FP16-65504 ~ +65504INT8-128 ~ +127RequirescalibrationPrecision calibration for INT8 inference:➢Minimizes information loss between FP32 and INT8 inference on a calibration dataset➢Completely automaticTraining precisionNo calibration requiredLeverage reduce precision capabilities:➢FP16 (Tesla V100): 125 TflopsFP16 vs 15.7 TflopsFP32➢INT8 (T4, 70W): 130 INT8 TOPS vs 8.1 TflopsFP32 15140305570014ms6.67ms6.83ms051015202530354001,0002,0003,0004,0005,0006,000CPU-OnlyV100 + TensorFlowV100 + TensorRTLatency (ms)Images/secInference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT OnV100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512.425550280ms153ms117ms0501001502002503003504004505000100200300400500600CPU-Only + TorchV100 + TorchV100 + TensorRTLatency (ms)Images/secInference throughput (sentences/sec) on OpenNMT692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1,Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT OnTENSORRTPERFORMANCEdeveloper.nvidia.com/tensorrt40x Faster CNNs on V100 vs. CPU-Only Under 7ms Latency (ResNet50)140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference (OpenNMT) 16BERT-LARGE INFERENCE IN 4.1msdeveloper.nvidia.com/tensorrtBERT Sample Code in TensorRT RepoMakes Real-Time Natural Language Understanding PossibleLatency (milliseconds)CPU: Intel Gold 6240, 18 threads, S/W: OpenVINO2020.21.6ms4.1ms020406080100120BERT-BaseBERT-LargeTensorRT breaks 10 msbarrier for BE

点击免费查看完整报告

你可能感兴趣

Triton推理引擎专场,面向多框架的AI模型部署服务Triton及其在蚂蚁预测引擎中的应用实践（上）

你可能感兴趣

【点金互动易】 AlKimi 这家公司核心产品可对算力调度、调优进行观测分析，在一体化数据模型等积累了大量实践;这家公司一站式AI应用平台支持多种大模型的接入与纳管，AIGC相关技术应用已产生收入

脱水研报（国内AI大模型及应用再迎跃迁，产业链热度高;量子计算有望突破算力瓶颈，券商称或成为人工智能与新质生产力的“引擎”）

【盘中宝】AI在新型电力系统建设中的极佳落地场景，各地陆续出台专项政策，这个行业或迎快速发展期，这家公司已应用AI技术开展相关领域实践-20240311

传媒互联网行业：昆仑万维Opera生成式AI服务海外用户数突破100万，腾讯混元大模型进入公司内应用测试阶段

《生成式人工智能服务管理暂行办法》发布点评：生成式AI服务管理暂行办法重磅发布，促进应用落地及大模型层厂商发展