您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[NVIDIA]:Triton推理引擎专场,面向多框架的AI模型部署服务Triton及其在蚂蚁预测引擎中的应用实践(上) - 发现报告
当前位置:首页/行业研究/报告详情/

Triton推理引擎专场,面向多框架的AI模型部署服务Triton及其在蚂蚁预测引擎中的应用实践(上)

信息技术2022-07-06-NVIDIA罗***
Triton推理引擎专场,面向多框架的AI模型部署服务Triton及其在蚂蚁预测引擎中的应用实践(上)

何成杰NVIDIA高级软件架构师TensorRT&TRITON 21、从TensorRT到Triton,AI模型的推理部署2、Triton中的多框架兼容实现3、TensorRT推理加速库与Triton推理引擎4、蚂蚁预测引擎概述5、蚂蚁在Triton上的创新以及重要场景下的应用6、Triton在蚂蚁的未来AGENDA TensorRT 4OverviewHighlightsWorkflowNew FeaturesSummaryAGENDA 5AI INFERENCE NEEDS TO RUN EVERYWHERETrainingInferenceDNN Model 6NVIDIA TensorRTFrom Every Framework, Optimized For Each Target PlatformTESLA A100DRIVETESLA V100JETSON XAVIERJETSON AGX XAVIERTESLA T4TensorRTNVIDIA DLA 7TensorRT POWERS AI INFERENCEdeveloper.nvidia.com/tensorrtSoftwareResearch / Higher EdIT ServicesAutomotiveOtherInternet / TelecomHealthcare & Life SciencesHardware / SemiconductorManufacturingCloud ServicesPublic SectorFinancial ServicesConsulting ServicesEnergy / Oil & Gas•2M+ Downloads•300,000 Developers•16,000 Companies•300 Inference Services Worldwide•Datacenter, Edge and Embedded•FP32, TF32, FP16, INT8 8HIGHLIGHTS 9NVIDIA TensorRTOptimize and Deploy neural networks in production environmentsMaximize throughput for latency-critical apps with compiler and runtime Deploy responsive and memory efficient apps with INT8 & FP16 optimizationsOptimize every network including CNNs, RNNs and TransformersAccelerate every framework with ONNX support, native TensorRT integrationRun multiple models on a node with containerized inference serverSDK for High-Performance Deep Learning Inferencedeveloper.nvidia.com/tensorrtTensorRT OptimizerTensorRT Runtime EngineTrained Neural NetworkEmbeddedAutomotiveData centerJetsonDriveTesla 10TENSORRTOPTIMIZATIONSKernel Auto-TuningLayer & Tensor FusionDynamic TensorMemoryWeights & ActivationPrecision Calibration➢Optimizations are completely automatic ➢Performed with a single function call 11Un-Optimized Networkconcatmax poolinputnext input3x3 conv.relubias1x1 conv.relubias1x1 conv.relubias1x1 conv.relubiasconcat1x1 conv.relubias5x5 conv.relubiasLAYER & TENSOR FUSIONmax poolinputnext input3x3 CBR5x5 CBR1x1 CBR1x1 CBRTensorRT Optimized NetworkVerticalHorizontalMerge 12Un-Optimized Networkconcatmax poolinputnext input3x3 conv.relubias1x1 conv.relubias1x1 conv.relubias1x1 conv.relubiasconcat1x1 conv.relubias5x5 conv.relubiasLAYER & TENSOR FUSIONSupported Layer Fusions•Vertical Fusion •Horizonal Fusion •Layer Elimination•Layer mergeNetworkLayers beforeLayers afterVGG194327Inception V3309113ResNet-152670159Convolution and ReLUActivationFullyConnectedand ReLUActivationScale and ActivationConvolution And ElementWiseSumShuffle and ReduceShuffle and ShuffleScale(add 0, multiply by 1)Convolution and ScaleReducehttps://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#enable-fusion 13KERNEL AUTO-TUNINGKernel Auto-TuningTesla V100Jetson TX2Multiple factors:•Target platform•Batch size•Input dimensions•Filter dimensions•Tensor layout...Choice:•Implementation of specific algorithm•Kernels•Tensor Layouts100s for specialized kernels Optimized for every GPU platformDrive PX2 14FP16, INT8 PRECISION CALIBRATIONPrecisionDynamic RangeFP32-3.4x1038~ +3.4x1038FP16-65504 ~ +65504INT8-128 ~ +127RequirescalibrationPrecision calibration for INT8 inference:➢Minimizes information loss between FP32 and INT8 inference on a calibration dataset➢Completely automaticTraining precisionNo calibration requiredLeverage reduce precision capabilities:➢FP16 (Tesla V100): 125 TflopsFP16 vs 15.7 TflopsFP32➢INT8 (T4, 70W): 130 INT8 TOPS vs 8.1 TflopsFP32 15140305570014ms6.67ms6.83ms051015202530354001,0002,0003,0004,0005,0006,000CPU-OnlyV100 + TensorFlowV100 + TensorRTLatency (ms)Images/secInference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT OnV100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512.425550280ms153ms117ms0501001502002503003504004505000100200300400500600CPU-Only + TorchV100 + TorchV100 + TensorRTLatency (ms)Images/secInference throughput (sentences/sec) on OpenNMT692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1,Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT OnTENSORRTPERFORMANCEdeveloper.nvidia.com/tensorrt40x Faster CNNs on V100 vs. CPU-Only Under 7ms Latency (ResNet50)140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference (OpenNMT) 16BERT-LARGE INFERENCE IN 4.1msdeveloper.nvidia.com/tensorrtBERT Sample Code in TensorRT RepoMakes Real-Time Natural Language Understanding PossibleLatency (milliseconds)CPU: Intel Gold 6240, 18 threads, S/W: OpenVINO2020.21.6ms4.1ms020406080100120BERT-BaseBERT-LargeTensorRT breaks 10 msbarrier for BE