Boosting PyTorch Inference on CPU: From Post-Training Quantization to Multithreading | by Leonie Monigatti | Jun, 2023

For an in-depth explanation of post-training quantization and a comparison of ONNX Runtime and OpenVINO, I recommend this article:

This section will specifically look at two popular techniques of post-training quantization:

ONNX Runtime

One popular approach to speed-up inference on CPU was to convert the final models to ONNX (Open Neural Network Exchange) format [2, 7, 9, 10, 14, 15].

The relevant steps to quantize and accelerate inference on CPU with ONNX Runtime are shown below:

Preparation: Install ONNX Runtime

pip install onnxruntime

Step 1: Convert PyTorch Model to ONNX

import torch
import torchvision

# Define your model here
model = ...

# Train model here

# Define dummy_input
dummy_input = torch.randn(1, N_CHANNELS, IMG_WIDTH, IMG_HEIGHT, device="cuda")

# Export PyTorch model to ONNX format
torch.onnx.export(model, dummy_input, "model.onnx")

Step 2: Make predictions with an ONNX Runtime session

import onnxruntime as rt

# Define X_test with shape (BATCH_SIZE, N_CHANNELS, IMG_WIDTH, IMG_HEIGHT)
X_test = ...

# Define ONNX Runtime session
sess = rt.InferenceSession("model.onnx")

# Make prediction
y_pred =[], {'input' : X_test})[0]


The equally popular approach to speed-up inference on CPU was to use OpenVINO (Open Visual Inference and Neural network Optimization) [5, 6, 12] as shown in this Kaggle Notebook:

The relevant steps to quantize and accelerate a Deep Learning model with OpenVINO are shown below:

Preparation: Install OpenVINO

!pip install openvino-dev[onnx]

Step 1: Convert PyTorch Model to ONNX (see Step 1 of ONNX Runtime)

Step 2: Convert ONNX Model to OpenVINO

mo --input_model model.onnx

This will output an XML file and a BIN file — of which we will we using the XML file in the next step.

Step 3: Quantize to INT8 using OpenVINO

import openvino.runtime as ov

core = ov.Core()
openvino_model = core.read_model(model='model.xml')
compiled_model = core.compile_model(openvino_model, device_name="CPU")

Step 4: Make predictions with an OpenVINO inference request

# Define X_test with shape (BATCH_SIZE, N_CHANNELS, IMG_WIDTH, IMG_HEIGHT)
X_test = ...

# Create inference request
infer_request = compiled_model.create_infer_request()

# Make prediction
y_pred = infer_request.infer(inputs=[X_test, 2])

Comparison: ONNX vs. OpenVINO vs. Alternatives

Both ONNX and OpenVINO are frameworks optimized for deploying models on CPUs. The inference times of a neural network quantized with ONNX and OpenVINO are said to be comparable [12].

Some competitors used PyTorch JIT [3] or TorchScript [1] as alternatives to speed up inference on CPU. However, other competitors shared that ONNX was considerably faster than TorchScript [10].

Another popular approach to speed-up inference on CPU was to use multithreading with ThreadPoolExecutor [2, 3, 9, 15] in addition to post-training quantization, as shown in this Kaggle Notebook:

This enabled competitors to run multiple inferences at the same time.

In the following example of ThreadPoolExecutor from the competition, we have a list of audio files to infer.

audios = ['audio_1.ogg', 
# ...,

Next, you need to define an inference function that takes an audio file as input and returns the predictions.

def predict(audio_path):
# Define any preprocessing of the audio file here

# Make predictions

return predictions

With the list of audios (e.g., audios) and the inference function (e.g., predict()), you now can use ThreadPoolExecutor to run multiple inferences at the same time (in parallel) as opposed to sequentially, which will give you a nice boost in inference time.

import concurrent.futures

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
dicts = list(, audios))

There are many more lessons to be learned from reviewing the learning resources Kagglers have created during the course of the “BirdCLEF 2023” competition. There are also many different solutions for this type of problem statement.

In this article, we focused on the general approach that was popular among many competitors:

  • Model Selection: Select the model size according to the best trade-off between performance and inference time. Also, leverage bigger and smaller models in your ensemble.
  • Post-Training Quantization: Post-training quantization can lead to faster inference times due to datatypes of the model weights and activations being optimized to the hardware. However, this can lead to a slight loss of model performance.
  • Multithreading: Run multiple inferences in parallel instead of sequentially. This will give you a boost in inference time.

If you are interested in how to approach audio classification with Deep Learning, which was the main aspect of this competition, check out the write-up of the BirdCLEF 2022 competition:

Source link

Leave a Comment