Solving Bottlenecks on the Data Input Pipeline with PyTorch Profiler and TensorBoard | by Chaim Rand | Aug, 2023

PyTorch Model Performance Analysis and Optimization — Part 4

Photo by Alexander Grey on Unsplash

This is the fourth post in our series of posts on the topic of performance analysis and optimization of GPU-based PyTorch workloads. Our focus in this post will be on the training data input pipeline. In a typical training application, the host’s CPUs load, pre-process, and collate data before feeding it into the GPU for training. Bottlenecks in the input pipeline occur when the host is not able to keep up with the speed of the GPU. This results in the GPU — the most expensive resource in the training setup — remaining idle for periods of time while it waits for data input from the overly tasked host. In previous posts (e.g., here) we have discussed input pipeline bottlenecks in detail and reviewed different ways of addressing them, such as:

  1. Choosing a training instance with a CPU to GPU compute ratio that is more suited to your workload (e.g., see our previous post on tips for choosing the best instance type for your ML workload),
  2. Improving the workload balance between the CPU and GPU by moving some of the CPU pre-processing activity to the GPU, and
  3. Offloading some of the CPU computation to auxiliary CPU-worker devices (e.g., see here).

Of course, the first step to addressing a performance bottleneck in the data-input pipeline is to identify and understand it. In this post we will demonstrate how this can be done using PyTorch Profiler and its associated TensorBoard plugin.

As in our previous posts, we will define a toy PyTorch model and then iteratively profile its performance, identify bottlenecks, and attempt to fix them. We will run our experiments on an Amazon EC2 g5.2xlarge instance (containing an NVIDIA A10G GPU and 8 vCPUs) and using the official AWS PyTorch 2.0 Docker image. Keep in mind that some of the behaviors we describe may vary between versions of PyTorch.

Many thanks to Yitzhak Levi for his contributions to this post.

In the following blocks we introduce the toy example we will use for our demonstration. We start by defining a simple image classification model. The input to the model is a batch of 256x256 YUV images and the output is its associated batch of semantic class predictions.

from math import log2
import torch
import torch.nn as nn
import torch.nn.functional as F

img_size = 256
num_classes = 10
hidden_size = 30

# toy CNN classification model
class Net(nn.Module):
def __init__(self, img_size=img_size, num_classes=num_classes):
self.conv_in = nn.Conv2d(3, hidden_size, 3, padding='same')
num_hidden = int(log2(img_size))
hidden = []
for i in range(num_hidden):
hidden.append(nn.Conv2d(hidden_size, hidden_size, 3, padding='same'))
self.hidden = nn.Sequential(*hidden)
self.conv_out = nn.Conv2d(hidden_size, num_classes, 3, padding='same')

def forward(self, x):
x = F.relu(self.conv_in(x))
x = self.hidden(x)
x = self.conv_out(x)
x = torch.flatten(x, 1)
return x

The code block below contains our dataset definition. Our dataset contains ten thousand jpeg-image file-paths and their associated (randomly generated) semantic labels. To simplify our demonstration, we will assume that all of the jpeg-file paths point to the same image — the picture of the colorful “bottlenecks” at the top of this post.

import numpy as np
from PIL import Image
from import VisionDataset
input_img_size = [533, 800]
class FakeDataset(VisionDataset):
def __init__(self, transform):
super().__init__(root=None, transform=transform)
size = 10000
self.img_files = [f'0.jpg' for i in range(size)]
self.targets = np.random.randint(low=0,high=num_classes,

def __getitem__(self, index):
img_file, target = self.img_files[index], self.targets[index]
with torch.profiler.record_function('PIL open'):
img =
if self.transform is not None:
img = self.transform(img)
return img, target

def __len__(self):
return len(self.img_files)

Note that we have wrapped the file reader with a torch.profiler.record_function context manager.

Our input data pipeline includes the following transformations on the image:

  1. PILToTensor converts the PIL image to a PyTorch Tensor.
  2. RandomCrop returns a 256x256 crop at a random offset in the image.
  3. RandomMask is a custom transform that creates a random 256x256 boolean mask and applies it to the image. The transform includes a four-neighbor dilation operation on the mask.
  4. ConvertColor is a custom transformation that converts the image format from RGB to YUV.
  5. Scale is a custom transformation that scales the pixels to the range [0,1].
class RandomMask(torch.nn.Module):
def __init__(self, ratio=0.25):

def dilate_mask(self, mask):
# perform 4 neighbor dilation on mask
with torch.profiler.record_function('dilation'):
from scipy.signal import convolve2d
dilated = convolve2d(mask, [[0, 1, 0],
[1, 1, 1],
[0, 1, 0]], mode='same').astype(bool)
return dilated

def forward(self, img):
with torch.profiler.record_function('random'):
mask = np.random.uniform(size=(img_size, img_size)) < self.ratio
dilated_mask = torch.unsqueeze(torch.tensor(self.dilate_mask(mask)),0)
dilated_mask = dilated_mask.expand(3,-1,-1)
img[dilated_mask] = 0.
return img

def __repr__(self):
return f"{self.__class__.__name__}(ratio={self.ratio})"

class ConvertColor(torch.nn.Module):
def __init__(self):
[[0.299, 0.587, 0.114],
[-0.16874, -0.33126, 0.5],
[0.5, -0.41869, -0.08131]]

def forward(self, img):
img =
img = torch.matmul(self.A,img.view([3,-1])).view(img.shape)
img = img + self.b[:,None,None]
return img

def __repr__(self):
return f"{self.__class__.__name__}()"

class Scale(object):
def __call__(self, img):

def __repr__(self):
return f"{self.__class__.__name__}()"

We chain the transformations using the Compose class which we have modified slightly to include a torch.profiler.record_function context manager around each of the transform invocations.

import torchvision.transforms as T
class CustomCompose(T.Compose):
def __call__(self, img):
for t in self.transforms:
with torch.profiler.record_function(t.__class__.__name__):
img = t(img)
return img

transform = CustomCompose(

In the code block below we define the dataset and dataloader. We configure the DataLoader to use a custom collate function in which we wrap the default collate function with a torch.profiler.record_function context manager.

train_set = FakeDataset(transform=transform)

def custom_collate(batch):
from import default_collate
with torch.profiler.record_function('collate'):
batch = default_collate(batch)
image, label = batch
return image, label

train_loader =, batch_size=256,
num_workers=4, pin_memory=True)

Lastly, we define the model, loss function, optimizer, and the training loop, which we wrap with a profiler context manager.

from statistics import mean, variance
from time import time

device = torch.device("cuda:0")
model = Net().cuda(device)
criterion = nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

t0 = time()
times = []

with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=10, warmup=2, active=10, repeat=1),
) as prof:
for step, data in enumerate(train_loader):
with torch.profiler.record_function('h2d copy'):
inputs, labels = data[0].to(device=device, non_blocking=True),
data[1].to(device=device, non_blocking=True)
if step >= 40:
outputs = model(inputs)
loss = criterion(outputs, labels)
t0 = time()

print(f'average time: {mean(times[1:])}, variance: {variance(times[1:])}')

In the following sections we will use PyTorch Profiler and its associated TensorBoard plugin in order to assess the performance of our model. Our focus will be on the Trace View of the profiler report. Please see the first post in our series for a demonstration of how to use the other sections of the report.

The average step time reported by the script we defined is 1.3 seconds and the average GPU utilization is a very low 18.21%. In the image below we capture the performance results as displayed in the TensorBoard plugin Trace View:

Trace View of Baseline Model (Captured by Author)

We can see that every fourth training step includes a long (~5.5 second) period of data-loading during which the GPU is completely idle. The reason that this occurs on every fourth step is directly related to the number of DataLoader workers we chose — four. Every fourth step we find all of the workers busy producing the samples for the next batch while the GPU waits. This is a clear indication of a bottleneck in the data input pipeline. The question is how do we analyze it? Complicating matters is the fact that the many record_function markers that we inserted into the code are nowhere to be found in the profile trace.

The use of multiple workers in the DataLoader is critical for optimizing performance. Unfortunately, it also makes the profiling process more difficult. Although there exist profilers that support multi-process analysis (e.g., check out VizTracer), the approach we will take in this post is to run, analyze, and optimize our model in single-process mode (i.e., with zero DataLoader workers) and then apply the optimizations to the multi-worker mode. Admittedly, optimizing the speed of a standalone function does not guarantee that multiple (parallel) invocations of the same function will also benefit. However, as we will see in this post, this strategy will enable us to identify and address some core issues that we were not able to identify otherwise, and, at least with regards to the issues discussed here, we will find a strong correlation between the performance impacts on the two modes. But just before we apply this strategy, let us tune our choice of the number of workers.

Determining the optimal number of threads or processes in a multi-process/multi-threaded application, such as ours, can be tricky. On the one hand, if we choose a number that is too low, we might end up under-utilizing the CPU resources. On the other hand, if we go too high, we run the risk of thrashing, an undesired situation in which the operating system spends most of its time managing the multiple threading/processing rather than running our code. In the case of a PyTorch training workload, it is recommended to test out different choices for the DataLoader num_workers setting. A good place to start is to set the number based on the number of CPUs on the host, (e.g., num_workers:=num_cpus/num_gpus). In our case the Amazon EC2 g5.2xlarge has eight vCPUs and, in fact, increasing the number of DataLoader workers to eight results in a slightly better average step time of 1.17 seconds (an 11% boost).

Importantly, look out for other, less obvious, configuration settings that might impact the number of threads or processes being used by the data-input pipeline. For example, opencv-python, a library commonly used for image pre-processing in computer vision workloads, includes the cv2.setNumThreads(int) function for controlling the number of threads.

In the image below we capture a portion of the Trace View when running the script with num_workers set to zero.

Trace View of Baseline Model in Single-process Mode (Captured by Author)

Running the script in this manner enables us to see the record_function labels we set and to identify the RandomMask transform, or more specifically our dilation function, as the most time-consuming operation in the retrieval of each individual sample.

Our current implementation of the dilation function uses a 2D convolution, typically implemented using matrix multiplication and not known to run especially fast on CPU. One option would be to run the dilation on the GPU (as described in this post). However, the overhead involved in the host-device transaction would likely outweigh the potential performance gains from this type of solution, not to mention that we prefer not to increase the load on the GPU.

In the code block below we propose an alternative, more CPU-friendly, implementation of the dilation function that uses boolean operations instead of a convolution:

    def dilate_mask(self, mask):
# perform 4 neighbor dilation on mask
with torch.profiler.record_function('dilation'):
padded = np.pad(mask, [(1,1),(1,1)])
dilated = padded[0:-2,1:-1] | padded[1:-1,1:-1] | padded[2:,1:-1] | padded[1:-1,0:-2]| padded[1:-1,2:]
return dilated

Following this modification, our step time drops to 0.78 seconds, which amounts to an additional 50% improvement. The update single-process Trace-View is displayed below:

Trace View Following Dilation Optimization in Single-process Mode (Captured by Author)

We can see that the dilation operation has shrunk significantly and that the most time-consuming operation is now the PILToTensor transform.

A closer look at the PILToTensor function (see here) reveals three underlying operations:

  1. Loading the PIL image — due to lazy loading property of, the image is loaded here.
  2. The PIL image is converted to a numpy array.
  3. The numpy array is converted to a PyTorch Tensor.

Although the image loading takes the majority of time, we note the extreme wastefulness of applying the subsequent operations to the full-size image only to crop it immediately afterwards. This leads us to our next optimization.

Fortunately, the RandomCrop transformation can be applied directly to the PIL image enabling us to apply the image-size reduction as the first operation on our pipeline:

transform = CustomCompose(

Following this optimization our step time drops to 0.72 seconds, an additional 8% optimization. The Trace View capture below shows that the RandomCrop transformation is now the dominant operation:

Trace View Following Transformation Reordering in Single-process Mode (Captured by Author)

In reality, as before, it is actually the PIL image (lazy) loading that causes the bottleneck, not the random crop.

Ideally, we would be able to optimize this further by limiting the read operation to only the crop in which we are interested. Unfortunately, as of the time of this writing, torchvision does not support this option. In a future post we will demonstrate how we can overcome this shortcoming by implementing our own custom decode_and_crop PyTorch operator.

In our current implementation, each of the image transformations are applied on each image individually. However, some transformations may run more optimally when applied on the entire batch at once. In the code block below we modify our pipeline so that the ColorTransformation and Scale transforms are applied on image batches within our custom collate function:

def batch_transform(img):
img =
A = torch.tensor(
[[0.299, 0.587, 0.114],
[-0.16874, -0.33126, 0.5],
[0.5, -0.41869, -0.08131]]
b = torch.tensor([0., 128., 128.])

A = torch.broadcast_to(A, ([img.shape[0],3,3]))
t_img = torch.bmm(A,img.view(img.shape[0],3,-1))
t_img = t_img + b[None,:, None]
return t_img.view(img.shape)/255

def custom_collate(batch):
from import default_collate
with torch.profiler.record_function('collate'):
batch = default_collate(batch)

image, label = batch
with torch.profiler.record_function('batch_transform'):
image = batch_transform(image)
return image, label

The result of this change is actually a slight increase in the step time, to 0.75 seconds. Although unhelpful in the case of our toy model, the ability to apply certain operations as batch transforms rather than per-sample transforms carries the potential to optimize certain workloads.

The successive optimizations we have applied in this post resulted in an 80% improvement in runtime performance. However, although less severe, there still remains a bottleneck in the input pipeline and the GPU remains highly under-utilized (~30%). Please revisit our previous posts (e.g., here) for additional methods of addressing such issues.

In this post we have focused on performance issues in the training-data input pipeline. As in our previous posts in this series we have chosen PyTorch Profiler and its associated TensorBoard plugin as our weapons of choice and demonstrated their use in accelerating the speed of training. In particular, we showed how running the DataLoader with zero workers increases our ability to identify, analyze, and optimize bottlenecks on the data-input pipeline.

As in our previous posts, we emphasize that the path to successful optimization will vary greatly based on the details of the training project, including the model architecture and training environment. In practice, reaching your goals may be more difficult than in the example we presented here. Some of the techniques we described may have little impact on your performance or might even make it worse. We also note that the precise optimizations that we chose, and the order in which we chose to apply them, was somewhat arbitrary. You are highly encouraged to develop your own tools and techniques for reaching your optimization goals based on the specific details of your project.

Source link

Leave a Comment