logo头像

猪老大要进步!

TensorRT--Windows下使用

上篇文章记录了如何在Win10下配置TensorRT,这篇将记录如何将一个最简单的超分辨率SRCNN的TensorFlow模型.tf转化为TensorRT的engin文件,最后使用TensorRT推导。

模型格式转换:.tf->.onnx

  1. 安装tf2onnxonnxruntime
1
2
pip install onnxruntime
pip install git+https://github.com/onnx/tensorflow-onnx
  1. 转换命令
1
python -m tf2onnx.convert --saved-model ./checkpoints/yolov4.tf --output model.onnx --opset 11 --verbose

成功生成onnx模型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
(base) C:\Users\11197\Desktop\vitsr\models>python -m tf2onnx.convert --saved-model vitsr_4x.tf --output model.onnx --opset 11 --verbose
2022-05-31 11:44:25.907286: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
C:\Users\11197\Miniconda3\lib\runpy.py:127: RuntimeWarning: 'tf2onnx.convert' found in sys.modules after import of package 'tf2onnx', but prior to execution of 'tf2onnx.convert'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
2022-05-31 11:44:27,590 - WARNING - tf2onnx: ***IMPORTANT*** Installed protobuf is not cpp accelerated. Conversion will be extremely slow. See https://github.com/onnx/tensorflow-onnx/issues/1557
2022-05-31 11:44:27.592219: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll
2022-05-31 11:44:27.605153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 Laptop GPU computeCapability: 8.6
coreClock: 1.56GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2022-05-31 11:44:27.605279: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2022-05-31 11:44:27.612433: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2022-05-31 11:44:27.612553: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
2022-05-31 11:44:27.615466: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll
2022-05-31 11:44:27.616751: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll
2022-05-31 11:44:27.619042: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll
2022-05-31 11:44:27.621767: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll
2022-05-31 11:44:27.622415: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2022-05-31 11:44:27.622605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2022-05-31 11:44:27.623070: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-31 11:44:27.623904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 Laptop GPU computeCapability: 8.6
coreClock: 1.56GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2022-05-31 11:44:27.624021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2022-05-31 11:44:27.951984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-05-31 11:44:27.952142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2022-05-31 11:44:27.952264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2022-05-31 11:44:27.952483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5484 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
2022-05-31 11:44:27,953 - WARNING - tf2onnx.tf_loader: '--tag' not specified for saved_model. Using --tag serve
2022-05-31 11:44:36,348 - INFO - tf2onnx.tf_loader: Signatures found in model: [serving_default].
2022-05-31 11:44:36,348 - WARNING - tf2onnx.tf_loader: '--signature_def' not specified, using first signature: serving_default
2022-05-31 11:44:36,348 - INFO - tf2onnx.tf_loader: Output names: ['output_1']
2022-05-31 11:44:36.633737: I tensorflow/core/grappler/devices.cc:69] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2022-05-31 11:44:36.633977: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2022-05-31 11:44:36.635395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 Laptop GPU computeCapability: 8.6
coreClock: 1.56GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2022-05-31 11:44:36.635531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2022-05-31 11:44:36.635679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-05-31 11:44:36.635805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2022-05-31 11:44:36.635919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2022-05-31 11:44:36.636120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5484 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
2022-05-31 11:44:36.699775: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1144] Optimization results for grappler item: graph_to_optimize
function_optimizer: Graph size after: 702 nodes (567), 1002 edges (867), time = 11.066ms.
function_optimizer: function_optimizer did nothing. time = 0.291ms.

2022-05-31 11:44:37.342929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 Laptop GPU computeCapability: 8.6
coreClock: 1.56GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2022-05-31 11:44:37.343116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2022-05-31 11:44:37.343250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-05-31 11:44:37.343378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2022-05-31 11:44:37.343482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2022-05-31 11:44:37.343648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5484 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
WARNING:tensorflow:From C:\Users\11197\Miniconda3\lib\site-packages\tf2onnx\tf_loader.py:711: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`
2022-05-31 11:44:37,499 - WARNING - tensorflow: From C:\Users\11197\Miniconda3\lib\site-packages\tf2onnx\tf_loader.py:711: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`
2022-05-31 11:44:37.851693: I tensorflow/core/grappler/devices.cc:69] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2022-05-31 11:44:37.851885: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2022-05-31 11:44:37.852918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3070 Laptop GPU computeCapability: 8.6
coreClock: 1.56GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2022-05-31 11:44:37.853025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2022-05-31 11:44:37.853114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-05-31 11:44:37.853190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2022-05-31 11:44:37.853276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2022-05-31 11:44:37.853433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5484 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
2022-05-31 11:44:37.999328: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1144] Optimization results for grappler item: graph_to_optimize
constant_folding: Graph size after: 348 nodes (-354), 538 edges (-464), time = 22.997ms.
function_optimizer: function_optimizer did nothing. time = 0.407ms.
constant_folding: Graph size after: 348 nodes (0), 538 edges (0), time = 4.128ms.
function_optimizer: function_optimizer did nothing. time = 0.246ms.

2022-05-31 11:44:39,646 - INFO - tf2onnx: inputs: ['input_1:0']
2022-05-31 11:44:39,646 - INFO - tf2onnx: outputs: ['Identity:0']
2022-05-31 11:44:39,894 - INFO - tf2onnx.tfonnx: Using tensorflow=2.5.0, onnx=1.11.0, tf2onnx=1.10.0/16eb4b
2022-05-31 11:44:39,895 - INFO - tf2onnx.tfonnx: Using opset <onnx, 11>
2022-05-31 11:44:48,170 - INFO - tf2onnx.tf_utils: Computed 0 values for constant folding
2022-05-31 11:44:54,755 - VERBOSE - tf2onnx.tfonnx: Mapping TF node to ONNX node(s)
2022-05-31 11:44:54,810 - VERBOSE - tf2onnx.tfonnx: Summay Stats:
tensorflow ops: Counter({'Const': 199, 'Mul': 26, 'AddV2': 25, 'Conv3D': 22, 'BiasAdd': 22, 'Relu': 22, 'ConcatV2': 7, 'Squeeze': 6, 'Identity': 5, 'StridedSlice': 4, 'DepthToSpace': 3, 'Split': 2, 'Softmax': 2, 'Placeholder': 1, 'NoOp': 1, 'ResizeBilinear': 1, 'Pad': 1})
tensorflow attr: Counter({'dtype': 200, 'value': 199, 'data_format': 47, 'dilations': 22, 'padding': 22, 'strides': 22, 'N': 7, 'Tidx': 7, 'squeeze_dims': 6, 'begin_mask': 4, 'ellipsis_mask': 4, 'end_mask': 4, 'new_axis_mask': 4, 'shrink_axis_mask': 4, 'block_size': 3, 'num_split': 2, 'shape': 1, 'align_corners': 1, 'half_pixel_centers': 1})
onnx mapped: Counter({'Const': 111, 'Mul': 26, 'AddV2': 25, 'Conv3D': 22, 'BiasAdd': 22, 'Relu': 22, 'ConcatV2': 7, 'Squeeze': 6, 'Identity': 4, 'StridedSlice': 4, 'DepthToSpace': 3, 'Split': 2, 'Softmax': 2, 'Placeholder': 1, 'ResizeBilinear': 1, 'Pad': 1})
onnx unmapped: Counter()
2022-05-31 11:44:54,811 - INFO - tf2onnx.optimizer: Optimizing ONNX model
2022-05-31 11:44:54,811 - VERBOSE - tf2onnx.optimizer: Apply optimize_transpose
2022-05-31 11:44:54,969 - VERBOSE - tf2onnx.optimizer.TransposeOptimizer: Add -22 (47->25), Const +23 (128->151), Identity -3 (5->2), Reshape +45 (0->45), Transpose -44 (52->8)
2022-05-31 11:44:54,970 - VERBOSE - tf2onnx.optimizer: Apply remove_redundant_upsample
2022-05-31 11:44:54,991 - VERBOSE - tf2onnx.optimizer.UpsampleOptimizer: no change
2022-05-31 11:44:54,992 - VERBOSE - tf2onnx.optimizer: Apply fold_constants
2022-05-31 11:44:55,022 - VERBOSE - tf2onnx.optimizer.ConstFoldOptimizer: Cast -1 (1->0), Const -43 (151->108), Reshape -43 (45->2), Transpose -1 (8->7)
2022-05-31 11:44:55,023 - VERBOSE - tf2onnx.optimizer: Apply const_dequantize_optimizer
2022-05-31 11:44:55,039 - VERBOSE - tf2onnx.optimizer.ConstDequantizeOptimizer: no change
2022-05-31 11:44:55,039 - VERBOSE - tf2onnx.optimizer: Apply loop_optimizer
2022-05-31 11:44:55,056 - VERBOSE - tf2onnx.optimizer.LoopOptimizer: no change
2022-05-31 11:44:55,056 - VERBOSE - tf2onnx.optimizer: Apply merge_duplication
2022-05-31 11:44:55,075 - VERBOSE - tf2onnx.optimizer.MergeDuplicatedNodesOptimizer: Const -3 (108->105)
2022-05-31 11:44:55,076 - VERBOSE - tf2onnx.optimizer: Apply reshape_optimizer
2022-05-31 11:44:55,092 - VERBOSE - tf2onnx.optimizer.ReshapeOptimizer: no change
2022-05-31 11:44:55,093 - VERBOSE - tf2onnx.optimizer: Apply global_pool_optimizer
2022-05-31 11:44:55,109 - VERBOSE - tf2onnx.optimizer.GlobalPoolOptimizer: no change
2022-05-31 11:44:55,109 - VERBOSE - tf2onnx.optimizer: Apply q_dq_optimizer
2022-05-31 11:44:55,127 - VERBOSE - tf2onnx.optimizer.QDQOptimizer: no change
2022-05-31 11:44:55,127 - VERBOSE - tf2onnx.optimizer: Apply remove_identity
2022-05-31 11:44:55,143 - VERBOSE - tf2onnx.optimizer.IdentityOptimizer: Identity -2 (2->0)
2022-05-31 11:44:55,143 - VERBOSE - tf2onnx.optimizer: Apply remove_back_to_back
2022-05-31 11:44:55,159 - VERBOSE - tf2onnx.optimizer.BackToBackOptimizer: no change
2022-05-31 11:44:55,159 - VERBOSE - tf2onnx.optimizer: Apply einsum_optimizer
2022-05-31 11:44:55,176 - VERBOSE - tf2onnx.optimizer.EinsumOptimizer: no change
2022-05-31 11:44:55,176 - VERBOSE - tf2onnx.optimizer: Apply optimize_transpose
2022-05-31 11:44:55,196 - VERBOSE - tf2onnx.optimizer.TransposeOptimizer: no change
2022-05-31 11:44:55,197 - VERBOSE - tf2onnx.optimizer: Apply remove_redundant_upsample
2022-05-31 11:44:55,213 - VERBOSE - tf2onnx.optimizer.UpsampleOptimizer: no change
2022-05-31 11:44:55,214 - VERBOSE - tf2onnx.optimizer: Apply fold_constants
2022-05-31 11:44:55,756 - VERBOSE - tf2onnx.optimizer.ConstFoldOptimizer: no change
2022-05-31 11:44:55,757 - VERBOSE - tf2onnx.optimizer: Apply const_dequantize_optimizer
2022-05-31 11:44:55,773 - VERBOSE - tf2onnx.optimizer.ConstDequantizeOptimizer: no change
2022-05-31 11:44:55,773 - VERBOSE - tf2onnx.optimizer: Apply loop_optimizer
2022-05-31 11:44:55,789 - VERBOSE - tf2onnx.optimizer.LoopOptimizer: no change
2022-05-31 11:44:55,790 - VERBOSE - tf2onnx.optimizer: Apply merge_duplication
2022-05-31 11:44:55,807 - VERBOSE - tf2onnx.optimizer.MergeDuplicatedNodesOptimizer: no change
2022-05-31 11:44:55,807 - VERBOSE - tf2onnx.optimizer: Apply reshape_optimizer
2022-05-31 11:44:55,824 - VERBOSE - tf2onnx.optimizer.ReshapeOptimizer: no change
2022-05-31 11:44:55,824 - VERBOSE - tf2onnx.optimizer: Apply global_pool_optimizer
2022-05-31 11:44:55,841 - VERBOSE - tf2onnx.optimizer.GlobalPoolOptimizer: no change
2022-05-31 11:44:55,842 - VERBOSE - tf2onnx.optimizer: Apply q_dq_optimizer
2022-05-31 11:44:55,858 - VERBOSE - tf2onnx.optimizer.QDQOptimizer: no change
2022-05-31 11:44:55,858 - VERBOSE - tf2onnx.optimizer: Apply remove_identity
2022-05-31 11:44:55,875 - VERBOSE - tf2onnx.optimizer.IdentityOptimizer: no change
2022-05-31 11:44:55,875 - VERBOSE - tf2onnx.optimizer: Apply remove_back_to_back
2022-05-31 11:44:55,892 - VERBOSE - tf2onnx.optimizer.BackToBackOptimizer: no change
2022-05-31 11:44:55,892 - VERBOSE - tf2onnx.optimizer: Apply einsum_optimizer
2022-05-31 11:44:55,909 - VERBOSE - tf2onnx.optimizer.EinsumOptimizer: no change
2022-05-31 11:44:55,911 - INFO - tf2onnx.optimizer: After optimization: Add -22 (47->25), Cast -1 (1->0), Const -23 (128->105), Identity -5 (5->0), Reshape +2 (0->2), Transpose -45 (52->7)
2022-05-31 11:44:55,935 - INFO - tf2onnx:
2022-05-31 11:44:55,935 - INFO - tf2onnx: Successfully converted TensorFlow model vitsr_4x.tf to ONNX
2022-05-31 11:44:55,935 - INFO - tf2onnx: Model inputs: ['input_1']
2022-05-31 11:44:55,935 - INFO - tf2onnx: Model outputs: ['output_1']
2022-05-31 11:44:55,935 - INFO - tf2onnx: ONNX model is saved at model.onnx

生成engin文件

在开发者手册里面第4章介绍了Python API,给了一些基本用法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
success = parser.parse_from_file("models/model.onnx")
for idx in range(parser.num_errors):
print(parser.get_error(idx))
if not success:
pass # Error handling code here
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 20) # 1 MiB
serialized_engine = builder.build_serialized_network(network, config)
with open("sample.engine", "wb") as f:
f.write(serialized_engine)

使用该程序报错如下:

1
2
3
4
5
6
7
8
(base) C:\Users\11197\Desktop\vitsr>python quantization.py
[05/31/2022-12:11:18] [TRT] [W] onnx2trt_utils.cpp:365: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/31/2022-12:11:18] [TRT] [E] 4: [network.cpp::nvinfer1::Network::validate::3011] Error Code 4: Internal Error (Network has dynamic or shape inputs, but no optimization profile has been defined.)
[05/31/2022-12:11:18] [TRT] [E] 2: [builder.cpp::nvinfer1::builder::Builder::buildSerializedNetwork::619] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
Traceback (most recent call last):
File "C:\Users\11197\Desktop\vitsr\quantization.py", line 21, in <module>
f.write(serialized_engine)
TypeError: a bytes-like object is required, not 'NoneType'

按照报错,根据开发者手册8.2 Optimization Profiles添加了一些配置:

1
2
3
4
profile = builder.create_optimization_profile()
profile.set_shape("input_1", (1, 75, 75, 3), (1, 75, 75, 3), (1, 75, 75, 3))
profile.set_shape("output_1", (1, 300, 300, 3), (1, 300, 300, 3), (1, 300, 300, 3))
config.add_optimization_profile(profile)

最后生成成功

1
2
3
4
(base) C:\Users\11197\Desktop\srcnn>python serialize.py
[06/01/2022-11:54:18] [TRT] [W] onnx2trt_utils.cpp:365: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[06/01/2022-11:54:19] [TRT] [W] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/01/2022-11:54:19] [TRT] [W] TensorRT was linked against cuDNN 8.3.2 but loaded cuDNN 8.2.1

推理

这里我不太会写,在一篇知乎文章上修改:https://zhuanlan.zhihu.com/p/347172593

要注意的是,.engin文件的输入输出如下

1
2
input_1 16875 <class 'numpy.float32'>
output_1 270000 <class 'numpy.float32'>

字段分别是:name,size,dtype。输入时需要把图片flatten,输出时需要把图片reshape。

核心代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt

cfx = cuda.Device(0).make_context()
stream = cuda.Stream()
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
runtime = trt.Runtime(TRT_LOGGER)

engine_file_path = "sample.engine"
with open(engine_file_path, "rb") as f:
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()

host_inputs = []
cuda_inputs = []
host_outputs = []
cuda_outputs = []
bindings = []

for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
print(binding, size, dtype)

# 分配主机和设备buffers
host_mem = cuda.pagelocked_empty(size, dtype) # 主机
cuda_mem = cuda.mem_alloc(host_mem.nbytes) # 设备
# 将设备buffer绑定到设备.
bindings.append(int(cuda_mem))
# 绑定到输入输出
if engine.binding_is_input(binding):
host_inputs.append(host_mem) # CPU
cuda_inputs.append(cuda_mem) # GPU
else:
host_outputs.append(host_mem)
cuda_outputs.append(cuda_mem)

import time
import numpy as np
from PIL import Image

for i in range(701,761):
image = np.array(Image.open("./data/40_10_test/LR/Frame0%d.png" % i))[np.newaxis,...]

t1 = time.time()
# 拷贝输入图像到主机buffer
np.copyto(host_inputs[0], image.flatten())
# 将输入数据转到GPU.
cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
# 推理.
context.execute_async(bindings=bindings, stream_handle=stream.handle)
# 将推理结果传到CPU.
cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
# 同步 stream
stream.synchronize()
# 拿到推理结果 batch_size = 1
output = host_outputs[0].reshape(300,300,3)
t2 = time.time()

print("Inference time: %.2f ms"%(1000*t2-1000*t1))

cfx.pop()

命令行输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
(base) C:\Users\11197\Desktop\srcnn>python inference.py
[06/01/2022-11:59:25] [TRT] [I] [MemUsageChange] Init CUDA: CPU +395, GPU +0, now: CPU 6761, GPU 1332 (MiB)
[06/01/2022-11:59:25] [TRT] [I] Loaded engine size: 0 MiB
[06/01/2022-11:59:25] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 1 (MiB)
[06/01/2022-11:59:25] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +34, now: CPU 0, GPU 35 (MiB)
input_1 16875 <class 'numpy.float32'>
output_1 270000 <class 'numpy.float32'>
Inference time: 1.00 ms
Inference time: 1.00 ms
Inference time: 1.00 ms
Inference time: 1.00 ms
Inference time: 1.03 ms
...

原来使用TensorFlow-GPU推理速度是50ms,现在竟然只要1ms,速度提升了50倍!!!

参考文献

  1. 将 TensorFlow 模型转换为 ONNX:https://docs.microsoft.com/zh-cn/windows/ai/windows-ml/tutorials/tensorflow-convert-model
  2. https://github.com/NVIDIA/TensorRT/issues/301
  3. https://zhuanlan.zhihu.com/p/347172593
支付宝打赏 微信打赏

赞赏是不耍流氓的鼓励