12.4 TensorRT 實用工具
前言
工程化部署是一個複雜的任務,涉及的環節眾多,因此需要有足夠好的工具來檢測、分析,NVIDIA也提供了一系列工具用於分析、調試、優化部署環節。本節就介紹兩個實用工具,nsight system 和 polygraphy。
nsight system可分析cpu和gpu的性能,可找出應用程式的瓶頸。
polygraphy可在各種框架中運行和調試深度學習模型,用於分析模型轉換間的瓶頸。
nsight system
NVIDIA Nsight Systems是一個系統分析工具,它可以分析CPU和GPU的利用率、記憶體佔用、資料輸送量等各種性能指標,找出應用程式的瓶頸所在。用戶文檔
安裝
打開官網,選擇對應的作業系統、版本進行下載 。
- NsightSystems-2023.3.1.92-3314722.msi,按兩下安裝,一路默認
- 將目錄添加到環境變數:C:\Program Files\NVIDIA Corporation\Nsight Systems 2023.3.1\target-windows-x64
- 將gui工作目錄頁添加到環境變數:C:\Program Files\NVIDIA Corporation\Nsight Systems 2023.3.1\host-windows-x64
運行
nsys包括命令列工具與UI介面,這裡採用UI介面演示。
- 命令列工具是C:\Program Files\NVIDIA Corporation\Nsight Systems 2023.3.1\target-windows-x64\nsys.exe
- UI介面是C:\Program Files\NVIDIA Corporation\Nsight Systems 2023.3.1\host-windows-x64\nsys-ui.exe
nsys運行邏輯是從nsys端啟動任務,nsys會自動監控任務的性能。
第一步:啟動nsys。在cmd中輸入nsys-ui,或者到安裝目錄下按兩下nsys-ui.exe。
第二步:創建project,配置要運行的程式。在這裡運行本章配套代碼01_trt_resnet50_cuda.py。具體操作如下圖所示
第三步:查看統計資訊
nsight system是一個強大的軟體,但具體如何有效使用,以及如何更細細微性、更接近底層的去分析耗時,請大家參照官方文檔以及需求來學習。
polygraphy
polygraphy是TensorRT生態中重要的debug調試工具,它可以
- 使用多種後端運行推理計算,包括 TensorRT, onnxruntime, TensorFlow;
- 比較不同後端的逐層計算結果;
- 由模型檔生成 TensorRT 引擎並序列化為.plan;
- 查看模型網路的逐層資訊;
- 修改 Onnx 模型,如提取子圖,計算圖化簡;
- 分析 Onnx 轉 TensorRT 失敗原因,將原計算圖中可以 / 不可以轉 TensorRT 的子圖分割保存;
- 隔離 TensorRT 中錯誤的tactic;
常用的幾個功能是:
- 檢驗 TensorRT 上計算結果正確性 /精度
- 找出計算錯誤 / 精度不足的層
- 進行簡單的計算圖優化
安裝
pip install nvidia-pyindex
pip install polygraphy
Copy
驗證
polygraphy依託於虛擬環境運行,因此需要啟動相應的虛擬環境,然後執行 polygraphy -h
polygraphy有七種模式,分別是 {run,convert,inspect,surgeon,template,debug,data},具體含義參見文檔
(pt112) C:\Users\yts32>polygraphy -h
usage: polygraphy [-h] [-v] {run,convert,inspect,check,surgeon,template,debug,data} ...
Polygraphy: A Deep Learning Debugging Toolkit
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Tools:
{run,convert,inspect,check,surgeon,template,debug,data}
run Run inference and compare results across backends.
convert Convert models to other formats.
inspect View information about various types of files.
check Check and validate various aspects of a model
surgeon Modify ONNX models.
template [EXPERIMENTAL] Generate template files.
debug [EXPERIMENTAL] Debug a wide variety of model issues.
data Manipulate input and output data generated by other Polygraphy subtools.
Copy
案例1:運行onnx及trt模型
polygraphy run resnet50_bs_1.onnx --onnxrt
polygraphy run resnet50_bs_1.engine --trt --input-shapes 'input:[1,3,224,224]' --verbose
Copy
得到如下運行日誌,表明兩個框架推理運行成功:
......
| Completed 1 iteration(s) in 1958 ms | Average inference time: 1958 ms.
......
| Completed 1 iteration(s) in 120.1 ms | Average inference time: 120.1 ms.
Copy
案例2:對比onnx與trt輸出結果(常用)
polygraphy還可以充當trtexec的功能,可以實現onnx匯出trt模型,並且進行逐層結果對比。
其中atol表示絕對誤差,rtol表示相對誤差。
polygraphy run resnet50_bs_1.onnx --onnxrt --trt ^
--save-engine=resnet50_bs_1_fp32_polygraphy.engine ^
--onnx-outputs mark all --trt-outputs mark all ^
--input-shapes "input:[1,3,224,224]" ^
--atol 1e-3 --rtol 1e-3 --verbose > onnx-trt-compare.log
Copy
輸出的日誌如下:
對於每一個網路層會輸出onnx、trt的長條圖,絕對誤差長條圖,相對誤差長條圖
最後會統計所有網路層符合設置的超參數atol, rtol的百分比,本案例中 Pass Rate: 100.0%。
[I] Comparing Output: 'input.4' (dtype=float32, shape=(1, 64, 112, 112)) with 'input.4' (dtype=float32, shape=(1, 64, 112, 112))
[I] Tolerance: [abs=0.001, rel=0.001] | Checking elemwise error
[I] onnxrt-runner-N0-08/20/23-23:08:36: input.4 | Stats: mean=0.20451, std-dev=0.31748, var=0.10079, median=0.19194, min=-1.3369 at (0, 33, 13, 104), max=2.0691 at (0, 33, 64, 77), avg-magnitude=0.29563
[V] ---- Histogram ----
Bin Range | Num Elems | Visualization
(-1.34 , -0.996) | 29 |
(-0.996, -0.656) | 11765 | #
(-0.656, -0.315) | 31237 | ###
(-0.315, 0.0255) | 140832 | #############
(0.0255, 0.366 ) | 411249 | ########################################
(0.366 , 0.707 ) | 166138 | ################
(0.707 , 1.05 ) | 40956 | ###
(1.05 , 1.39 ) | 551 |
(1.39 , 1.73 ) | 54 |
(1.73 , 2.07 ) | 5 |
[I] trt-runner-N0-08/20/23-23:08:36: input.4 | Stats: mean=0.20451, std-dev=0.31748, var=0.10079, median=0.19194, min=-1.3369 at (0, 33, 13, 104), max=2.0691 at (0, 33, 64, 77), avg-magnitude=0.29563
[V] ---- Histogram ----
Bin Range | Num Elems | Visualization
(-1.34 , -0.996) | 29 |
(-0.996, -0.656) | 11765 | #
(-0.656, -0.315) | 31237 | ###
(-0.315, 0.0255) | 140832 | #############
(0.0255, 0.366 ) | 411249 | ########################################
(0.366 , 0.707 ) | 166138 | ################
(0.707 , 1.05 ) | 40956 | ###
(1.05 , 1.39 ) | 551 |
(1.39 , 1.73 ) | 54 |
(1.73 , 2.07 ) | 5 |
[I] Error Metrics: input.4
[I] Minimum Required Tolerance: elemwise error | [abs=8.3447e-07] OR [rel=0.037037] (requirements may be lower if both abs/rel tolerances are set)
[I] Absolute Difference | Stats: mean=2.6075e-08, std-dev=3.2558e-08, var=1.06e-15, median=1.4901e-08, min=0 at (0, 0, 0, 3), max=8.3447e-07 at (0, 33, 86, 43), avg-magnitude=2.6075e-08
[V] ---- Histogram ----
Bin Range | Num Elems | Visualization
(0 , 8.34e-08) | 757931 | ########################################
(8.34e-08, 1.67e-07) | 40058 | ##
(1.67e-07, 2.5e-07 ) | 4334 |
(2.5e-07 , 3.34e-07) | 249 |
(3.34e-07, 4.17e-07) | 181 |
(4.17e-07, 5.01e-07) | 53 |
(5.01e-07, 5.84e-07) | 0 |
(5.84e-07, 6.68e-07) | 8 |
(6.68e-07, 7.51e-07) | 1 |
(7.51e-07, 8.34e-07) | 1 |
[I] Relative Difference | Stats: mean=6.039e-07, std-dev=5.4597e-05, var=2.9809e-09, median=8.7838e-08, min=0 at (0, 0, 0, 3), max=0.037037 at (0, 4, 15, 12), avg-magnitude=6.039e-07
[V] ---- Histogram ----
Bin Range | Num Elems | Visualization
(0 , 0.0037 ) | 802806 | ########################################
(0.0037 , 0.00741) | 7 |
(0.00741, 0.0111 ) | 1 |
(0.0111 , 0.0148 ) | 0 |
(0.0148 , 0.0185 ) | 0 |
(0.0185 , 0.0222 ) | 0 |
(0.0222 , 0.0259 ) | 1 |
(0.0259 , 0.0296 ) | 0 |
(0.0296 , 0.0333 ) | 0 |
(0.0333 , 0.037 ) | 1 |
[I] PASSED | Output: 'input.4' | Difference is within tolerance (rel=0.001, abs=0.001)
Copy
[I] PASSED | All outputs matched | Outputs: ['input.4', 'onnx::MaxPool_323', 'input.8', 'input.16', 'onnx::Conv_327', 'input.24', 'onnx::Conv_330', 'onnx::Add_505', 'onnx::Add_508', 'onnx::Relu_335', 'input.36', 'input.44', 'onnx::Conv_339', 'input.52', 'onnx::Conv_342', 'onnx::Add_517', 'onnx::Relu_345', 'input.60', 'input.68', 'onnx::Conv_349', 'input.76', 'onnx::Conv_352', 'onnx::Add_526', 'onnx::Relu_355', 'input.84', 'input.92', 'onnx::Conv_359', 'input.100', 'onnx::Conv_362', 'onnx::Add_535', 'onnx::Add_538', 'onnx::Relu_367', 'input.112', 'input.120', 'onnx::Conv_371', 'input.128', 'onnx::Conv_374', 'onnx::Add_547', 'onnx::Relu_377', 'input.136', 'input.144', 'onnx::Conv_381', 'input.152', 'onnx::Conv_384', 'onnx::Add_556', 'onnx::Relu_387', 'input.160', 'input.168', 'onnx::Conv_391', 'input.176', 'onnx::Conv_394', 'onnx::Add_565', 'onnx::Relu_397', 'input.184', 'input.192', 'onnx::Conv_401', 'input.200', 'onnx::Conv_404', 'onnx::Add_574', 'onnx::Add_577', 'onnx::Relu_409', 'input.212', 'input.220', 'onnx::Conv_413', 'input.228', 'onnx::Conv_416', 'onnx::Add_586', 'onnx::Relu_419', 'input.236', 'input.244', 'onnx::Conv_423', 'input.252', 'onnx::Conv_426', 'onnx::Add_595', 'onnx::Relu_429', 'input.260', 'input.268', 'onnx::Conv_433', 'input.276', 'onnx::Conv_436', 'onnx::Add_604', 'onnx::Relu_439', 'input.284', 'input.292', 'onnx::Conv_443', 'input.300', 'onnx::Conv_446', 'onnx::Add_613', 'onnx::Relu_449', 'input.308', 'input.316', 'onnx::Conv_453', 'input.324', 'onnx::Conv_456', 'onnx::Add_622', 'onnx::Relu_459', 'input.332', 'input.340', 'onnx::Conv_463', 'input.348', 'onnx::Conv_466', 'onnx::Add_631', 'onnx::Add_634', 'onnx::Relu_471', 'input.360', 'input.368', 'onnx::Conv_475', 'input.376', 'onnx::Conv_478', 'onnx::Add_643', 'onnx::Relu_481', 'input.384', 'input.392', 'onnx::Conv_485', 'input.400', 'onnx::Conv_488', 'onnx::Add_652', 'onnx::Relu_491', 'input.408', 'onnx::Flatten_493', 'onnx::Gemm_494', 'output']
[I] Accuracy Summary | onnxrt-runner-N0-08/20/23-23:08:36 vs. trt-runner-N0-08/20/23-23:08:36 | Passed: 1/1 iterations | Pass Rate: 100.0%
Copy
更多使用案例推薦閱讀github cookbook
小結
本節介紹了nsight system和polygraphy的應用,在模型部署全流程中,可以深入挖掘的還有很多,推薦查看TensorRT的GitHub下的tools目錄
留言列表