12.3 trtexec 工具使用

本小節介紹trtexec工具的使用,trtexec可以實現onnx模型匯出trt模型、耗時分析和模型優化分析等功能,本節將對trtexec的運用進行介紹。

trtexec

trtexec是官方提供的命令列工具,主要用於一下三個方面

  • 生成模型序列化檔:由ONNX檔生成 TensorRT 引擎並序列化為 Plan/engine
  • 查看模型檔資訊:查看 ONNX檔或 Plan 檔的網路逐層資訊
  • 模型性能測試:測試 TensorRT 引擎基於隨機輸入或給定輸入下的性能

trtexec提供了大量參數,整體可分為構建和運行兩個階段。

構建階段常用參數

--onnx=<model>: onnx檔路徑

--minShapes=<shapes>,

--optShapes=<shapes>, and

--maxShapes=<shapes>: 當是onnx模型時,可指定batchsize的動態範圍。

–-memPoolSize=<pool_spec>: 優化過程可使用的最大記憶體

--saveEngine=<file>: 保存的檔輸出路徑

--fp16, --int8, --noTF32, and --best: 指定資料精度

--verbose: 是否需要列印詳細資訊。預設是不列印詳細資訊。

--skipInference: 創建並保存引擎檔,不執行推理過程。

--timingCacheFile=<file>: 記錄每個tensor的最小最大值、執行時間等,可以用來分析量化效果。

--dumpLayerInfo, --exportLayerInfo=<file>: 列印及保存每一層詳細資訊

更多高級用法,參考官方文檔:https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec

Copy

運行階段常用參數

--loadEngine=<file>: 要載入的模型檔

--shapes=<shapes>:指定輸入張量的形狀

--loadInputs=<specs>: Load input values from files. Default is to generate random inputs.

--warmUp=<duration in ms>熱身階段最短執行時間,單位ms

--duration=<duration in seconds>, 測試階段最短執行時間,單位s

--iterations=<N>: 測試階段最小反覆運算次數

--useCudaGraph: 採用 CUDA graph 捕獲和執行推理過程

--noDataTransfers: 關閉hostdevice之間的資料傳輸

--dumpProfile, --exportProfile=<file>: 列印及保存每一層性能資訊

--dumpLayerInfo, --exportLayerInfo=<file>: 列印及保存每一層詳細資訊

更多高級用法,參考官方文檔:https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec

Copy

案例0:固定batchsize

輸出固定batchsizeengine檔,這裡需要注意,batchsize的狀態需要與ONNX匹配,因此在生成onnx時需要設置好。

trtexec --onnx=resnet50_bs_1.onnx --saveEngine=resnet50_bs_1.engine

Copy

案例1: 動態batchsize使用

resnet50_bs_dynamic.onnx 可通過第十一章章生成,或通過百度網盤下載-提取碼:whai

trtexec --onnx=resnet50_bs_dynamic.onnx --saveEngine=resnet50_bs_dynamic_1-32-64.engine --timingCacheFile=dynamic-1-32-64.cache --minShapes=input:1x3x224x224 --maxShapes=input:64x3x224x224 --optShapes=input:16x3x224x224

Copy

通過下表可知,fp32時,大batchsize帶來輸送量增加不明顯,因此可考慮時延的平衡,選擇batchsize=8

FP32

1-8-64

1-16-64

1-32-64

1-64-64

輸送量( FPS

952

1060

1126

1139

時延(ms

8.3

14.9

28.2

112

案例2fp32fp16int8 性能比較

運行配套代碼中的run.bat/run.sh,可以查看log,觀察輸送量、時延的變化。

如下圖所示,輸送量方面

  • fp16相較於fp32有約2~3倍提升,int8相較於fp162倍提升
  • 相同精度時,輸送量隨batchsize增加,但在32後增速不明顯。int8隨著batchsize增速潛力更大。

時延方面

  • 時延隨著batchsize是線性增長
  • fp32, fp16, int8的時延依次遞減一半

<<AI人工智慧 PyTorch自學>> 12.3 trte

案例3:查看層詳細資訊

通過參數--dumpLayerInfo --exportLayerInfo,可以輸出各層詳細資訊,以及融合情況,還有輸入輸出張量的名字(Bindings

trtexec --onnx=resnet50_bs_dynamic.onnx --saveEngine=demo.engine --skipInference --dumpLayerInfo --exportLayerInfo="exportLayerInfo.log"

Copy

exportLayerInfo.log檔中可以看到如下資訊,主要包括

  • 各網路層內容,以及融合情況“Reformatting CopyNode for Input Tensor 0 to Conv_0 + Relu_1”
  •  
    • Reformatting CopyNode 表示 TensorRT 將輸入tensor 0 複製(Copy) Conv_0 Relu_1 兩個層進行了融合(Reformatting)。這裡的 Reformatting 指的是 TensorRT 在優化網路結構時,會將一些層進行融合,以減少記憶體拷貝和提高計算效率。CopyNode 則表示插入了一個拷貝層,用於將輸入資料複製到融合後新的層中。
    • 這種層的融合可以減少記憶體訪問,優化資料流程,從而提升推理性能。
  • Bindings:包括輸入輸出張量的名稱,這個在onnx匯出時設定的,在下游python推理代碼中也會用到。

{"Layers": ["Reformatting CopyNode for Input Tensor 0 to Conv_0 + Relu_1"

,"Conv_0 + Relu_1"

,"MaxPool_2"

,"Conv_3 + Relu_4"

,"Conv_5 + Relu_6"

...

,"Reformatting CopyNode for Input Tensor 0 to Gemm_121"

,"Gemm_121"

,"reshape_after_Gemm_121"

],

"Bindings": ["input"

,"output"

]}

Copy

案例4verbose中的日誌內容

打開verbose開關後,trtexec將輸出詳細內容,包括以下六大模組:

  • 導入模型情況:模型格式、名稱
  • 參數配置情況:設置了哪些參數進行優化,例如 --fp16
  • 設備情況:當前GPU device具體資訊
  • 計算圖優化細節:詳細描述網路層融合情況,計算圖優化結果
  • 網路層實現方式選擇(幾千行):列印每個網路層選擇的kernel的過程,挑選耗時最低的方法
  • 耗時統計:統計推理耗時時間,包括資料拷貝、推理等耗時的統計值

trtexec --onnx=resnet50_bs_dynamic.onnx --saveEngine=demo.engine --verbose > verbose.log

Copy

執行以下命令,可獲得日誌檔,下面對主要內容進行介紹。

Model Options :包含導入的模型內容

[08/20/2023-11:59:45] [I] === Model Options ===

[08/20/2023-11:59:45] [I] Format: ONNX

[08/20/2023-11:59:45] [I] Model: resnet50_bs_dynamic.onnx

[08/20/2023-11:59:45] [I] Output:

Copy

Build Options:創建trt模型的參數設置

[08/20/2023-11:59:45] [I] === Build Options ===

[08/20/2023-11:59:45] [I] Max batch: explicit batch

[08/20/2023-11:59:45] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default

[08/20/2023-11:59:45] [I] minTiming: 1

[08/20/2023-11:59:45] [I] avgTiming: 8

[08/20/2023-11:59:45] [I] Precision: FP32

...

Copy

推理設置

[08/20/2023-11:59:45] [I] === Inference Options===

[08/20/2023-11:59:45] [I] Batch: Explicit

[08/20/2023-11:59:45] [I] Input inference shapes: model

[08/20/2023-11:59:45] [I] Iterations: 10

[08/20/2023-11:59:45] [I] Duration: 3s (+ 200ms warm up)

[08/20/2023-11:59:45] [I] Sleep time: 0ms

[08/20/2023-11:59:45] [I] Idle time: 0ms

[08/20/2023-11:59:45] [I] Inference Streams: 1

Copy

日誌輸出設置

[08/20/2023-11:59:45] [I] === Reporting Options ===

[08/20/2023-11:59:45] [I] Verbose: Enabled

[08/20/2023-11:59:45] [I] Averages: 10 inferences

[08/20/2023-11:59:45] [I] Percentiles: 90,95,99

[08/20/2023-11:59:45] [I] Dump refittable layers:Disabled

[08/20/2023-11:59:45] [I] Dump output: Disabled

[08/20/2023-11:59:45] [I] Profile: Disabled

[08/20/2023-11:59:45] [I] Export timing to JSON file:

[08/20/2023-11:59:45] [I] Export output to JSON file:

[08/20/2023-11:59:45] [I] Export profile to JSON file:

Copy

設備資訊

[08/20/2023-11:59:46] [I] === Device Information ===

[08/20/2023-11:59:46] [I] Selected Device: NVIDIA GeForce RTX 3060 Laptop GPU

[08/20/2023-11:59:46] [I] Compute Capability: 8.6

[08/20/2023-11:59:46] [I] SMs: 30

[08/20/2023-11:59:46] [I] Device Global Memory: 6143 MiB

[08/20/2023-11:59:46] [I] Shared Memory per SM: 100 KiB

[08/20/2023-11:59:46] [I] Memory Bus Width: 192 bits (ECC disabled)

[08/20/2023-11:59:46] [I] Application Compute Clock Rate: 1.702 GHz

[08/20/2023-11:59:46] [I] Application Memory Clock Rate: 7.001 GHz

Copy

[03/28/2024-15:01:18] [I] === Device Information ===

[03/28/2024-15:01:20] [I] Available Devices:

[03/28/2024-15:01:20] [I]   Device 0: "NVIDIA GeForce RTX 4060 Laptop GPU

[03/28/2024-15:01:20] [I] Selected Device: NVIDIA GeForce RTX 4060 Laptop GPU

[03/28/2024-15:01:20] [I] Selected Device ID: 0

[03/28/2024-15:01:20] [I] Compute Capability: 8.9

[03/28/2024-15:01:20] [I] SMs: 24

[03/28/2024-15:01:20] [I] Device Global Memory: 8187 MiB

[03/28/2024-15:01:20] [I] Shared Memory per SM: 100 KiB

[03/28/2024-15:01:20] [I] Memory Bus Width: 128 bits (ECC disabled)

[03/28/2024-15:01:20] [I] Application Compute Clock Rate: 1.89 GHz

[03/28/2024-15:01:20] [I] Application Memory Clock Rate: 8.001 GHz

Copy

補充一個4060的顯卡資訊,可以看到SMs是少於3060的,這個與基本廠商的刀法有關。雖然是4060的設備,但是計算性能比不上3060設備。因為裡邊的核心——SMs是少於306030SM的。“SMs” 代表 “Streaming Multiprocessors”(流處理器),流處理器是執行 CUDA 核心的基本單元,SM越大算力越大。

對於RTX 4060 Laptop,官方顯示有3072CUDA核心,對應24SM,即一個SM128CUDA核心。

對於RTX 3060 Laptop,官方顯示有3840CUDA核心,對應30SM,也是符合一個SM128CUDA核心的。

4060不僅流處理器少,頻寬也低,128 bits VS 192 bits,唯一的優點就是8GB VS 6GB了。

ONNX模型載入及創建

解析模型耗時0.14秒,總共126層,後續trt會針對該模型進行優化。

[08/20/2023-11:59:52] [I] [TRT] ----------------------------------------------------------------

[08/20/2023-11:59:52] [I] [TRT] Input filename:   resnet50_bs_dynamic.onnx

[08/20/2023-11:59:52] [I] [TRT] ONNX IR version:  0.0.7

[08/20/2023-11:59:52] [I] [TRT] Opset version:    13

[08/20/2023-11:59:52] [I] [TRT] Producer name:    pytorch

[08/20/2023-11:59:52] [I] [TRT] Producer version: 1.12.0

[08/20/2023-11:59:52] [I] [TRT] Domain:          

[08/20/2023-11:59:52] [I] [TRT] Model version:    0

[08/20/2023-11:59:52] [I] [TRT] Doc string:      

[08/20/2023-11:59:52] [I] [TRT] ----------------------------------------------------------------

 

 

 

[08/20/2023-11:59:52] [V] [TRT] Plugin creator already registered - ::BatchedNMSDynamic_TRT version 1

[08/20/2023-11:59:52] [V] [TRT] Plugin creator already registered - ::BatchedNMS_TRT version 1

[08/20/2023-11:59:52] [V] [TRT] Plugin creator already registered - ::BatchTilePlugin_TRT version 1

...

[08/20/2023-11:59:52] [V] [TRT] Adding network input: input with dtype: float32, dimensions: (-1, 3, 224, 224)

[08/20/2023-11:59:52] [V] [TRT] Registering tensor: input for ONNX tensor: input

 

[08/20/2023-11:59:52] [V] [TRT] Importing initializer: fc.weight

[08/20/2023-11:59:52] [V] [TRT] Importing initializer: fc.bias

[08/20/2023-11:59:52] [V] [TRT] Importing initializer: onnx::Conv_497

[08/20/2023-11:59:52] [V] [TRT] Importing initializer: onnx::Conv_498

[08/20/2023-11:59:52] [V] [TRT] Importing initializer: onnx::Conv_500

...

[08/20/2023-11:59:52] [V] [TRT] Searching for input: onnx::Conv_497

[08/20/2023-11:59:52] [V] [TRT] Searching for input: onnx::Conv_498

[08/20/2023-11:59:52] [V] [TRT] Conv_0 [Conv] inputs: [input -> (-1, 3, 224, 224)[FLOAT]], [onnx::Conv_497 -> (64, 3, 7, 7)[FLOAT]], [onnx::Conv_498 -> (64)[FLOAT]],

[08/20/2023-11:59:52] [V] [TRT] Convolution input dimensions: (-1, 3, 224, 224)

[08/20/2023-11:59:52] [V] [TRT] Registering layer: Conv_0 for ONNX node: Conv_0

...

[08/20/2023-11:59:52] [V] [TRT] Marking output_1 as output: output

[08/20/2023-11:59:52] [I] Finished parsing network model. Parse time: 0.141545

[08/20/2023-11:59:52] [V] [TRT] After dead-layer removal: 126 layers

[08/20/2023-11:59:52] [V] [TRT] Graph construction completed in 0.0015515 seconds.

Copy

計算圖優化

優化計算圖,可以使得推理速度更快,在本案例中,將模型126層優化到57

[08/20/2023-11:59:52] [I] [TRT] Graph optimization time: 0.0150853 seconds.

計算圖優化中採用了大量的層融合,融合的原理是盡可能地合併不同層之間相關的計算,避免不必要的中間tensor生成, 減少記憶體讀寫, 降低計算消耗,最終提高推理效率

常見的優化方法如下:

  • ConstShuffleFusion: fc層的bias中使用,可以將常量shufflebias資料中,減少冗餘計算。
  • ShuffleShuffleFusion: flatten層中使用,可以減少shuffle的計算次數。
  • ConvReshapeBiasAddFusion: conv層的輸出reshape,然後進行bias add的計算融合到一起,減少運算耗時。
  • ConvReluFusion: conv層和後續的Relu啟動函數層融合,可以減少一次Relu的計算。
  • ConvEltwiseSumFusion: conv層和element-wise add層融合,避免重複計算。
  • ReduceToPoolingFusion: reduce層修改為pooling,減少運算消耗。
  • ConcatReluFusion: concat層和relu層融合,減少relu計算次數。
  • BiasSoftmaxFusion: 融合bias層和softmax,減少冗餘計算。

[08/20/2023-11:59:52] [V] [TRT] Running: ConstShuffleFusion on fc.bias

[08/20/2023-11:59:52] [V] [TRT] ConstShuffleFusion: Fusing fc.bias with (Unnamed Layer* 129) [Shuffle]

[08/20/2023-11:59:52] [V] [TRT] After Myelin optimization: 125 layers

...

[08/20/2023-11:59:52] [V] [TRT] After dupe layer removal: 57 layers

[08/20/2023-11:59:52] [V] [TRT] After final dead-layer removal: 57 layers

[08/20/2023-11:59:52] [V] [TRT] After tensor merging: 57 layers

[08/20/2023-11:59:52] [V] [TRT] After vertical fusions: 57 layers

[08/20/2023-11:59:52] [V] [TRT] After dupe layer removal: 57 layers

[08/20/2023-11:59:52] [V] [TRT] After final dead-layer removal: 57 layers

[08/20/2023-11:59:52] [V] [TRT] After tensor merging: 57 layers

[08/20/2023-11:59:52] [V] [TRT] After slice removal: 57 layers

[08/20/2023-11:59:52] [V] [TRT] After concat removal: 57 layers

[08/20/2023-11:59:52] [V] [TRT] Trying to split Reshape and strided tensor

[08/20/2023-11:59:52] [I] [TRT] Graph optimization time: 0.0150853 seconds.

Copy

各網路層實現方式選擇

網路層具體的實現有多種方式,例如不同的底層庫、不同的實現演算法、不同的演算法策略,在TensorRT中會把所有的實現方式跑一遍,挑選速度最優的實現方式。

在實現網路層的過程中,runnertacticTensorRT中用於實現layer的關鍵元件。

  • runner代表著一種實現layer的演算法或代碼路徑。例如,卷積層可以通過cudnncublas或者TensorRT自身的cask實現。runner封裝了具體的實現演算法。
  • tactic代表具體的實現方案。每個runner下面可以有多個tactic,對應不同的優化方法。例如cask convolution runner下面可以有基於tensor coretactic,或是各種tile sizetactic等等。tactic包含了針對特定layer進行各種優化的代碼實現。

所以TensorRT通過組合不同的runnertactic,就可以得到層的多種實現方式。然後通過Auto Tuner來測試不同組合的性能,選擇出最優的實現。

例如,對於一個卷積層:

  • runner可以是cudnncublascask convolution
  • cask convolution下面可以有基於tensor coretactic,tile size32x3264x64tactic等等

最終會選擇出cask convolution + 64x64 tile size這個tactic組合作為最優實現

在本日誌中,第一個runner跑的是conv_0 + Relu_1,最終選擇的Tactic Name 0x9cb304e2edbc1221,耗時0.040秒。

 

[08/20/2023-11:59:52] [V] [TRT] =============== Computing costs for

[08/20/2023-11:59:52] [V] [TRT] *************** Autotuning format combination: Float(150528,50176,224,1) -> Float(802816,12544,112,1) ***************

[08/20/2023-11:59:52] [V] [TRT] --------------- Timing Runner: Conv_0 + Relu_1 (CaskConvolution[0x80000009])

[08/20/2023-11:59:52] [V] [TRT] Tactic Name: sm50_xmma_conv_fprop_fused_conv_act_fp32_NCHW_fp32_NCHW_KCRS_fp32_fp32_fp32_Accfloat_1_1_cC1_dC1_srcVec1_fltVec1_1_TP3_TQ4_C1_R7_S7_U2_V2 Tactic: 0x0a617a3531b5b6dc Time: 0.102619

 

[08/20/2023-11:59:52] [V] [TRT] Tactic Name: sm50_xmma_conv_fprop_fused_conv_act_fp32_NCHW_fp32_NCHW_KCRS_fp32_fp32_fp32_Accfloat_1_1_cC1_dC1_srcVec1_fltVec2_2_TP3_TQ4_C1_R7_S7_U2_V2 Tactic: 0x520e893be7313ed2 Time: 0.0999131

...

[08/20/2023-11:59:52] [V] [TRT] Conv_0 + Relu_1 (CaskConvolution[0x80000009]) profiling completed in 0.0647644 seconds. Fastest Tactic: 0x9cb304e2edbc1221 Time: 0.0402286

Copy

最終trt57個層都進行了Computing costs,得到各網路層的最優實現方案。


除了網路層,還需要reformat layer,它的作用是改變tensor的格式,將前一層的輸出重新排布成後一層所需的格式。這樣就可以使得兩層之間的tensor相容,然後進行融合。

例如:Conv_0 + Relu_1層需要[50176,1:4,224,1]格式的tensor作為輸入,而輸入層輸出的是[150528,50176,224,1]格式,所以在輸入層和Conv_0層之間加入了reformat layer,tensor重新排布成Conv層需要的格式。

最終添加了25reformat layer,模型變為了82層。

...

[08/20/2023-12:00:07] [V] [TRT] Adding reformat layer: Reformatted Input Tensor 0 to Gemm_121 (onnx::Flatten_493) from Float(512,1:4,512,512) to Float(2048,1,1,1)

[08/20/2023-12:00:07] [V] [TRT] Formats and tactics selection completed in 15.1664 seconds.

[08/20/2023-12:00:07] [V] [TRT] After reformat layers: 82 layers

[08/20/2023-12:00:07] [V] [TRT] Total number of blocks in pre-optimized block assignment: 82

[08/20/2023-12:00:07] [I] [TRT] Detected 1 inputs and 1 output network tensors.

Copy

存儲空間佔用情況

介紹各網路層存儲佔用情況,以及匯總,例如本案例,engineGPU佔用是107MB

...

[08/20/2023-12:00:07] [V] [TRT] Layer: Conv_116 + Add_117 + Relu_118 Host Persistent: 7200 Device Persistent: 0 Scratch Memory: 0

[08/20/2023-12:00:07] [V] [TRT] Layer: GlobalAveragePool_119 Host Persistent: 4176 Device Persistent: 0 Scratch Memory: 0

[08/20/2023-12:00:07] [V] [TRT] Layer: Gemm_121 Host Persistent: 6944 Device Persistent: 0 Scratch Memory: 0

[08/20/2023-12:00:07] [V] [TRT] Skipped printing memory information for 26 layers with 0 memory size i.e. Host Persistent + Device Persistent + Scratch Memory == 0.

[08/20/2023-12:00:07] [I] [TRT] Total Host Persistent Memory: 331696

[08/20/2023-12:00:07] [I] [TRT] Total Device Persistent Memory: 22016

[08/20/2023-12:00:07] [I] [TRT] Total Scratch Memory: 4608

[08/20/2023-12:00:07] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 18 MiB, GPU 107 MiB

Copy

engine構建情況

對完成好的engine各網路層、網路層對應的kernel選擇情況進行列印。

可以看到engine的構建耗時15.3

 

[08/20/2023-12:00:07] [V] [TRT] Engine generation completed in 15.3167 seconds.

[08/20/2023-12:00:07] [V] [TRT] Deleting timing cache: 214 entries, served 528 hits since creation.

[08/20/2023-12:00:07] [V] [TRT] Engine Layer Information:

Layer(Reformat): Reformatting CopyNode for Input Tensor 0 to Conv_0 + Relu_1, Tactic: 0x00000000000003e8, input (Float[1,3,224,224]) -> Reformatted Input Tensor 0 to Conv_0 + Relu_1 (Float[1,3:4,224,224])

Layer(CaskConvolution): Conv_0 + Relu_1, Tactic: 0x9cb304e2edbc1221, Reformatted Input Tensor 0 to Conv_0 + Relu_1 (Float[1,3:4,224,224]) -> onnx::MaxPool_323 (Float[1,64:4,112,112])

Copy

推理耗時統計

進行10次推理,依次得到以下資訊,同時相應的統計值。

  • Throughput:模型的推理輸送量,以每秒推理數量(QPS)為單位。實際圖片量需要乘以batchsize
  • Latency:模型一次推理的延遲時間統計資訊,包括最小值、最大值、平均值、中位數和百分位數(90%95%99%)。
  • Enqueue Time:將資料傳輸到GPU的時間統計資訊,
  • H2D Latency:將主機資料傳輸到GPU的延遲時間統計資訊,
  • GPU Compute Time:模型在GPU上運行的計算時間統計資訊
  • D2H Latency:從GPU將資料傳輸回主機的延遲時間統計資訊
  • Total Host Walltime:模型推理的總時間,包括傳輸資料、計算和傳輸資料回主機的時間。
  • Total GPU Compute Time:模型在GPU上的總計算時間。

[08/20/2023-12:00:11] [I] === Performance summary ===

[08/20/2023-12:00:11] [I] Throughput: 502.107 qps

[08/20/2023-12:00:11] [I] Latency: min = 1.88583 ms, max = 2.96844 ms, mean = 1.93245 ms, median = 1.91833 ms, percentile(90%) = 1.9592 ms, percentile(95%) = 1.98364 ms, percentile(99%) = 2.34845 ms

[08/20/2023-12:00:11] [I] Enqueue Time: min = 0.312988 ms, max = 1.77197 ms, mean = 0.46439 ms, median = 0.390869 ms, percentile(90%) = 0.748291 ms, percentile(95%) = 0.836853 ms, percentile(99%) = 1.10229 ms

[08/20/2023-12:00:11] [I] H2D Latency: min = 0.0714111 ms, max = 0.225464 ms, mean = 0.0769845 ms, median = 0.0737305 ms, percentile(90%) = 0.088623 ms, percentile(95%) = 0.0947266 ms, percentile(99%) = 0.112671 ms

[08/20/2023-12:00:11] [I] GPU Compute Time: min = 1.80939 ms, max = 2.86005 ms, mean = 1.8518 ms, median = 1.84009 ms, percentile(90%) = 1.87183 ms, percentile(95%) = 1.89734 ms, percentile(99%) = 2.22314 ms

[08/20/2023-12:00:11] [I] D2H Latency: min = 0.00317383 ms, max = 0.0220947 ms, mean = 0.00366304 ms, median = 0.00341797 ms, percentile(90%) = 0.00390625 ms, percentile(95%) = 0.00402832 ms, percentile(99%) = 0.00488281 ms

[08/20/2023-12:00:11] [I] Total Host Walltime: 3.00334 s

[08/20/2023-12:00:11] [I] Total GPU Compute Time: 2.79252 s

[08/20/2023-12:00:11] [I] Explanations of the performance metrics are printed in the verbose logs.

[08/20/2023-12:00:11] [V]

[08/20/2023-12:00:11] [V] === Explanations of the performance metrics ===

[08/20/2023-12:00:11] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.

[08/20/2023-12:00:11] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.

[08/20/2023-12:00:11] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.

[08/20/2023-12:00:11] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.

[08/20/2023-12:00:11] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.

[08/20/2023-12:00:11] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.

[08/20/2023-12:00:11] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.

[08/20/2023-12:00:11] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.

[08/20/2023-12:00:11] [I]

Copy

案例5trt模型推理

通過推理trt模型,可以查看網路層資訊、網路層推理耗時情況

trtexec --loadEngine=resnet50_bs_128_fp32.engine --batch=128 --useCudaGraph --dumpProfile --dumpLayerInfo > inference.log

Copy

可以看到,卷積層耗時較大

8/20/2023-17:51:29] [I] === Profile (32 iterations ) ===

[08/20/2023-17:51:29] [I]                                                                    Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %

[08/20/2023-17:51:29] [I]              Reformatting CopyNode for Input Tensor 0 to Conv_0 + Relu_1       18.53           0.5790             0.5765      0.6

[08/20/2023-17:51:29] [I]                                                          Conv_0 + Relu_1      116.65           3.6453             3.6336      3.6

[08/20/2023-17:51:29] [I]                                                                MaxPool_2       51.21           1.6004             1.6005      1.6

Copy

小節

本節介紹了trtexec基礎用法,可以通過trtexec實現onnx模型轉trt模型,並且可以進行動態batchsize設置、半精度量化的選擇。

更全面用法推薦查看幫助文檔以及官方文檔

 

arrow
arrow
    全站熱搜
    創作者介紹
    創作者 HCHUNGW 的頭像
    HCHUNGW

    HCHUNGW的部落格

    HCHUNGW 發表在 痞客邦 留言(0) 人氣()