全篇完 <<AI人工智慧 PyTorch自學>> 12.9 TRT python 工程化－HCHUNGW的部落格

12.9 TRT python 工程化

前言

在生產應用中，需要將trt的engine封裝為類，內部實現engine初始化、推理的記憶體申請和釋放，host與device的資料移轉，對外提供推理介面。

這樣，演算法推理功能才能優雅地嵌入到整個生產流程的代碼中，供上下游調用。

本節將參考多個代碼案例，總結了基於context.execute_v3的trt模型類編寫。

兩種推理方式——context

在前面介紹trt的python代碼推理時，執行推理採用的是execute_async_v3，在不少案例教程中使用的是execute_v2，這將導致代碼無法複用。

為了瞭解execute_v2和execute_async_v3的區別，下面將會介紹：

兩者差異
基於yolov5代碼，以及Nvidia官方代碼，分析execute_v2的機制

兩者差異

def execute_v2(self, bindings, p_int=None): # real signature unknown; restored from __doc__

"""

execute_v2(self: tensorrt.tensorrt.IExecutionContext, bindings: List[int]) -> bool

Synchronously execute inference on a batch.

This method requires a array of input and output buffers. The mapping from tensor names to indices can be queried using :func:`ICudaEngine.get_binding_index()` .

This method only works for execution contexts built from networks with no implicit batch dimension.

:arg bindings: A list of integers representing input and output buffer addresses for the network.

:returns: True if execution succeeded.

"""

def execute_async_v3(self, stream_handle): # real signature unknown; restored from __doc__

"""

execute_async_v3(self: tensorrt.tensorrt.IExecutionContext, stream_handle: int) -> bool

Asynchronously execute inference.

Modifying or releasing memory that has been registered for the tensors before stream synchronization or the event passed to :func:`set_input_consumed_event` has been triggered results in undefined behavior.

Input tensors can be released after the :func:`set_input_consumed_event` whereas output tensors require stream synchronization.

:arg stream_handle: The cuda stream on which the inference kernels will be enqueued.

"""

Copy

根據函數注釋，可知execute_v2接收的是device位址，即要求資料已經變到gpu上，執行execute_v2時，僅僅實現gpu的顯存資料讀取與運算。

根據函數注釋，可知execute_async_v3接收的是cuda流的編號，即默認通過該cuda流進行運算（具體含義並不熟，需要深入瞭解cuda程式設計，這裡猜想是默認用的0），這就要求context知道從哪裡取輸入資料進行推理，會有額外的context.set_tensor_address(l_tensor_name[0], d_input) ， d_input是顯存地址。

YOLOv5 TRT推理

關於execute_async_v3，可回顧本章第二節的工作流程梳理。

為瞭解基於execute_v2進行推理時，需要進行的操作流程，這裡分析YOLOv5官方代碼。

yolov5支援多種backend的推理，方式在連結中也給出了詳細步驟，這裡不再重複。

# 1. 匯出trt模型

python export.py --weights best.pt --data data/mydrone.yaml --include engine --device 0

# 2. 推理

python detect.py --weights best.engine --data data/mydrone.yaml --source G:\虎門大橋車流\DJI_0049.MP4

Copy

核心代碼在detect.py和common.py中，其中實現了基於TRT的engine推理代碼，這裡主要關注模型初始化和模型推理兩部分。

模型初始化核心代碼：

yolov5-master/models/common.py 中的DetectMultiBackend的elif engine部分：

通過bindings字典管理輸入輸出資料，包括資料的'name', 'dtype', 'shape', 'data', 'ptr'。
資料的位址ptr指在device（GPU）上的地址！這裡基於torch庫對資料做操作，並實現資料搬遷到GPU。這點與我們手動寫推理不一樣！
execute_v2需要輸入的位址，恰恰是GPU位址，因此可借助torch的to(device)方法得到GPU上資料的位址。

Binding = namedtuple('Binding', ('name', 'dtype', 'shape', 'data', 'ptr'))

logger = trt.Logger(trt.Logger.INFO)

with open(w, 'rb') as f, trt.Runtime(logger) as runtime:

model = runtime.deserialize_cuda_engine(f.read())

context = model.create_execution_context()

bindings = OrderedDict()

output_names = []

for i in range(model.num_bindings):

name = model.get_binding_name(i)

dtype = trt.nptype(model.get_binding_dtype(i))

if model.binding_is_input(i):

pass # 略

else:

output_names.append(name)

shape = tuple(context.get_binding_shape(i))

im = torch.from_numpy(np.empty(shape, dtype=dtype)).to(device)

bindings[name] = Binding(name, dtype, shape, im, int(im.data_ptr()))

binding_addrs = OrderedDict((n, d.ptr) for n, d in bindings.items())

batch_size = bindings['images'].shape[0] # if dynamic, this is instead max batch size

Copy

模型推理核心代碼

yolov5-master/models/common.py 中的DetectMultiBackend的forward函數的elif self.engine部分：

self.binding_addrs['images'] = int(im.data_ptr()) # 設置輸入資料的GPU位址，im已經是在GPU上的tensor了

self.context.execute_v2(list(self.binding_addrs.values())) # 傳入資料的GPU位址，執行推理

y = [self.bindings[x].data for x in sorted(self.output_names)] # 完成推理後，到輸出張量（已在GPU上）取推理結果

Copy

小結：由此可知道yolov5的trt推理，借助了torch庫將cpu上的資料與gpu上的資料進行關聯

這種方式仍舊依賴pytorch，為此需要進一步探究手動管理gpu顯存時，如何基於execute_v2進行推理，難點在於如何獲得gpu上顯存的地址。

Nvidia官方案例代碼

代碼位於：https://github.com/NVIDIA/TensorRT/blob/78245b0ac2af9a208ed02e5257bfd3ee7ae8a88d/samples/python/detectron2/infer.py

這裡僅觀察輸入給execute_v2的資料如何得來，為此，代碼倒著來看。

1. 在infer函數中：

self.context.execute_v2(self.allocations)

2. 觀察init中 self.allocations的定義

shape = self.engine.get_binding_shape(i)

size = np.dtype(trt.nptype(dtype)).itemsize

for s in shape:

size *= s

allocation = common.cuda_call(cudart.cudaMalloc(size))

self.allocations.append(allocation)

Copy

由此可知，傳入execute_v2的地址是通過cudart.cudaMalloc(size)獲取的，這個在第二節中也採用了這個方式獲取GPU上資料的位址。

size則是通過shape和單個資料大小乘積得到。

通過兩個開原始程式碼分析，發現execute_v2還是需要手動管理顯存，為此接下來還是基於execute_async_v3進行推理類的編寫。

原因有兩個：

第一，本章第二節就是基於execute_async_v3的流程介紹資料在host和device之間是如何傳輸，流轉的。

第二，Nvidia官方教程中基於TRT 8.5給出了一個較好的host、device資料管理方法，代碼簡潔易理解。-

推理類的物件導向設計

在工程化時，演算法模組對外提供推理函數，供主流程調用，因此需要將演算法模組封裝為類，並將相關屬性和方法在類中實現，便於管理。

在推理類中，為了實現推理，首先需要構建context，而構建context所需要的過程步驟，均放到init函數中實現，同時配置模型相關的參數，如類別名稱，閾值，mean/std等。

根據ResNet50模型的特點，TensorRTInfer的UML類圖設計如下：

全篇完 <<AI人工智慧 PyTorch自學>> 12.9

主要包括對外提供推理介面inference，圍繞推理過程，實現預處理，TRT模型推理，後處理，視覺化，TRT模型初始化等函數功能。

配套完整代碼在這裡，代碼中實現的推理類已可以完成獨立功能，但仍有更進一步優化點，包括：

組batch推理模式，實現更高輸送量
預處理模組以batch形式處理，提高資料預處理效率
預處理模組放置GPU進行處理，均衡CPU和GPU的負載
採用分散式佇列機制，解耦任務調用，任務處理，可參考celery庫
對於C/S架構的設計，需要用web服務包裝演算法，常用的有flask, fastapi, django，對於一般工程，可用flask, fastapi，複雜工程用django。

更多TRT模型類參考：

TRT官方-simpledemo

TRT官方-detectron2-infer

YOLOv6官方

engine常用方法/屬性

在上述TRT模型構建中，使用了engine中一系列方法和屬性，這裡簡單總結一下。

engine.num_io_tensors：獲取輸入、輸出資料個數
engine.get_tensor_name(index)：根據資料的順序，獲取名稱，這個名稱是onnx匯出時設置的
engine.get_tensor_mode(tensor_name)：根據資料名稱，獲取是Input還是output類型。

小結

本結整理TRT在python下推理的代碼編寫，包括對比execute_v2與v3的差異，並實現了推理類的設計與實現。

對於TensorRT的學習及使用，暫時告一段落，通過本章內容，可以將pytorch模型轉換為trt模型，並在python中實現高效推理，同時瞭解TensorRT常用工具的使用。

對於一般場景，本章的TensorRT工具及量化內容可以滿足，但對於邊緣端、高性能部署，仍需精進模型加速知識。

HCHUNGW

HCHUNGW的部落格

HCHUNGW 發表在痞客邦留言(0) 人氣()

HCHUNGW的部落格

破軍突破革新希望多元開放平等進步

全篇完 <<AI人工智慧 PyTorch自學>> 12.9 TRT python 工程化

歷史上的今天

留言列表

站方公告

活動快報

我的好友

熱門文章

文章分類

最新文章

最新留言

動態訂閱

文章精選

文章搜尋

新聞交換(RSS)

誰來我家

參觀人氣

QR Code

POWERED BY

HCHUNGW的部落格

破軍 突破 革新 希望 多元 開放 平等 進步

全篇完 <<AI人工智慧 PyTorch自學>> 12.9 TRT python 工程化

歷史上的今天

留言列表

站方公告

活動快報

我的好友

熱門文章

文章分類

最新文章

最新留言

動態訂閱

文章精選

文章搜尋

新聞交換(RSS)

誰來我家

參觀人氣

QR Code

POWERED BY

破軍突破革新希望多元開放平等進步