PyTorch Model Performance Analysis and Optimization — Part 2 | by Chaim Rand | Jun, 2023


How to Identify and Reduce CPU Computation In Your Training Step with PyTorch Profiler and TensorBoard

Photo by Denise Chan on Unsplash

This is the second part of a series of posts on the topic of analyzing and optimizing a PyTorch model running on a GPU. In our first post we demonstrated the process — and the significant potential — of iteratively analyzing and optimizing a PyTorch model using PyTorch Profiler and TensorBoard. In this post we will focus on a specific type of performance issue that is particularly prevalent in PyTorch due to its use of eager execution: The dependency on the CPU for portions of the model execution. Identifying the presence and source of these kinds of issues can be quite difficult and often requires the use of a dedicated performance analyzer. In this post we will share some tips for identifying such performance issues when using PyTorch Profiler and the PyTorch Profiler TensorBoard plugin.

The Pros and Cons of Eager Execution

One of the main appeals of PyTorch is its eager execution mode. In eager mode, each PyTorch operation that forms the model is executed independently as soon as it is reached. This is in contrast to graph mode in which the entire model is pre-compiled into a single graph in a manner that is optimal for running on the GPU and executed as a whole. Usually, this pre-compilation results in better performance (e.g., see here). In eager mode, the programming context returns to the application following each operation thus allowing us to access and evaluate arbitrary tensors. This makes it easier to build, analyze, and debug ML models. On the other hand, it also makes our model more susceptible to (sometimes accidental) insertion of suboptimal blocks of code. As we will demonstrate, knowing how to identify and fix such blocks of code can have a significant impact on the speed of your model.

In the following blocks we introduce the toy example we will use for our demonstration. The code is very loosely based on the example from our previous post and the loss function defined in this PyTorch tutorial.

We start by defining a simple classification model. Its architecture is not significant for this post.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.models
import torchvision.transforms as T
from torchvision.datasets.vision import VisionDataset
import numpy as np
from PIL import Image

# sample model
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 8, 3, padding=1)
self.conv2 = nn.Conv2d(8, 12, 3, padding=1)
self.conv3 = nn.Conv2d(12, 16, 3, padding=1)
self.conv4 = nn.Conv2d(16, 20, 3, padding=1)
self.conv5 = nn.Conv2d(20, 24, 3, padding=1)
self.conv6 = nn.Conv2d(24, 28, 3, padding=1)
self.conv7 = nn.Conv2d(28, 32, 3, padding=1)
self.conv8 = nn.Conv2d(32, 10, 3, padding=1)
self.pool = nn.MaxPool2d(2, 2)

def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = self.pool(F.relu(self.conv4(x)))
x = self.pool(F.relu(self.conv5(x)))
x = self.pool(F.relu(self.conv6(x)))
x = self.pool(F.relu(self.conv7(x)))
x = self.pool(F.relu(self.conv8(x)))
x = torch.flatten(x, 1) # flatten all dimensions except batch
return x

Next, we define a pretty standard cross-entropy loss function. This loss function will be the main focus of our discussion.

def log_softmax(x):
return x - x.exp().sum(-1).log().unsqueeze(-1)

def weighted_nll(pred, target, weight):
assert target.max() < 10
nll = -pred[range(target.shape[0]), target]
nll = nll * weight[target]
nll = nll / weight[target].sum()
sum_nll = nll.sum()
return sum_nll

# custom loss definition
class CrossEntropyLoss(nn.Module):
def forward(self, input, target):
pred = log_softmax(input)
loss = weighted_nll(pred, target, torch.Tensor([0.1]*10).cuda())
return loss

Last, we define the dataset and the training loop:

# dataset with random images that mimics the properties of CIFAR10
class FakeCIFAR(VisionDataset):
def __init__(self, transform):
super().__init__(root=None, transform=transform)
self.data = np.random.randint(low=0,high=256,size=(10000,32,32,3),dtype=np.uint8)
self.targets = np.random.randint(low=0,high=10,size=(10000),dtype=np.uint8).tolist()

def __getitem__(self, index):
img, target = self.data[index], self.targets[index]
img = Image.fromarray(img)
if self.transform is not None:
img = self.transform(img)
return img, target

def __len__(self) -> int:
return len(self.data)

transform = T.Compose(
[T.Resize(256),
T.PILToTensor()])

train_set = FakeCIFAR(transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=1024,
shuffle=True, num_workers=8, pin_memory=True)

device = torch.device("cuda:0")
model = Net().cuda(device)
criterion = CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()

# training loop wrapped with profiler object
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=4, active=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler(’./log/example’),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, data in enumerate(train_loader):
inputs = data[0].to(device=device, non_blocking=True)
labels = data[1].to(device=device, non_blocking=True)
inputs = (inputs.to(torch.float32) / 255. - 0.5) / 0.5
if step >= (1 + 4 + 3) * 1:
break
outputs = model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
prof.step()

An experienced PyTorch developer may have already noticed that our example contains a number of inefficient lines of code in the loss function. At the same time, there is nothing obviously wrong with it and these types of inefficiencies are not uncommon. If you would like to test your PyTorch proficiency, see if you can find three issues with our implementation of the cross-entropy loss before reading on. In the next sections we will assume that we were not able to find these issues on our own and show how we can use PyTorch Profiler and its associated TensorBoard plugin to identify them.

As in our previous post, we will iteratively run an experiment, identify performance issues, and attempt to fix them. We will run our experiments on an Amazon EC2 g5.2xlarge instance (containing an NVIDIA A10G GPU and 8 vCPUs) and using the official AWS PyTorch 2.0 Docker image. Our choice of training environment was somewhat arbitrary and should not be viewed as an endorsement for any of its components.

Initial Performance Results

In the image below we show the Overview tab of the performance report of the script above.

Performance Overview of Baseline Model (Captured by Author)

As we can see, our GPU utilization is at a relatively high 92.04% and our step time is 216 milliseconds. (As in our previous post, the Overview in torch-tb-profiler version 0.4.1 sums the step time of all three training steps.) From this report alone you may not think that there was anything wrong with our model. However, the Trace View of the performance report tells a completely different story:

Trace View of Baseline Model (Captured by Author)

As highlighted above, the forward pass of our cross-entropy loss alone takes up 211 of the 216 milliseconds of the training step! This is a clear indication that something is wrong. Our loss function contains a small number of calculations compared to the model and should certainly not account for 98% of the step time. Taking a closer look at the call stack, we can see a few function calls that strengthen our suspicions, including “to”, “copy_”, and “cudaStreamSynchronize”. This combination usually indicates that data is being copied from the CPU into the GPU — not something we want to be happening in the middle of our loss calculation. In this case, our performance issue also aligns with a brief dip in the GPU utilization, as highlighted in the image. However, this is not always the case. Often, dips in the GPU utilization will not be aligned with the performance issue or they may not be seen at all.

We now know that we have a performance issue in our loss function and that it is likely to be related to copying tensors from the host to the GPU. However, this might not be enough to identify the precise line of code that is causing the issue. To facilitate our search we will wrap each line of code with a labeled torch.profiler.record_function context manager and rerun the profiling analysis.

# custom loss definition
class CrossEntropyLoss(nn.Module):
def forward(self, input, target):
with torch.profiler.record_function('log_softmax'):
pred = log_softmax(input)
with torch.profiler.record_function('define_weights'):
weights = torch.Tensor([0.1]*10).cuda()
with torch.profiler.record_function('weighted_nll'):
loss = weighted_nll(pred, target, torch.Tensor([0.1]*10).cuda())
return loss

The addition of the labels help us identify the weight definition, or more accurately, the copying of the weights into the GPU, as the problematic line of code.

Performance Issue of Weights Definition as Seen in Trace View (Captured by Author)

Optimization #1: Remove redundant host-to-GPU copies from the training step

Once we have identified our first issue, fixing it is rather trivial. In the code block below, we copy our weight vector to the GPU a single time in the loss init function:

class CrossEntropyLoss(nn.Module):
def __init__(self):
super().__init__()
self.weight = torch.Tensor([0.1]*10).cuda()

def forward(self, input, target):
with torch.profiler.record_function('log_softmax'):
pred = log_softmax(input)
with torch.profiler.record_function('weighted_nll'):
loss = weighted_nll(pred, target, self.weight)
return loss

The image below shows the results of the performance analysis following this fix:

Performance Overview Following Optimization #1 (Captured by Author)

Disappointingly, our first optimization had a very marginal impact on the step time. If we look at the Trace View report, we can see that we have a new severe performance issue that we need to address.

Trace View Following Optimization #1 (Captured by Author)

Our new report indicates an issue coming from our weighted_nll function. As before, we used torch.profiler.record_function to identify the problematic line of code. In this case it is the assert call.

def weighted_nll(pred, target, weight):
with torch.profiler.record_function('assert'):
assert target.max() < 10
with torch.profiler.record_function('range'):
r = range(target.shape[0])
with torch.profiler.record_function('index'):
nll = -pred[r, target]
with torch.profiler.record_function('nll_calc'):
nll = nll * weight[target]
nll = nll/ weight[target].sum()
sum_nll = nll.sum()
return sum_nll

Note that this issue existed in the base experiment, as well, but was hidden by our previous performance issue. It is not uncommon in the course of performance optimization for severe issues, that were previously hidden by other issues, to suddenly appear in this manner.

A closer analysis of the call stack shows calls to “item”, “_local_scalar_dense”, and “cudaMemcpyAsync”. This is often an indication that data is being copied from the GPU to the host. Indeed, our assert call, which is performed on the CPU, requires access to the target tensor residing on the GPU, thus invoking the highly inefficient data copy.

Optimization #2: Remove redundant GPU-to-host copies from the training step

While verifying the legality of the input labels may be warranted, it should be done in a way that does not impact our training performance so negatively. In our case, fixing the issue is a simple matter of moving the assert to the data input pipeline, before the labels are copied into the GPU. Following the removal of the assert our performance still remains mostly unchanged:

Performance Overview Following Optimization #2 (Captured by Author)

Important Note: Although our goal is usually to attempt to reduce copies between the host and the GPU in the forward pass, there are times when this is either not possible (e.g., if we require a kernel that is not supported by the GPU) or undesirable (e.g., if running a particular kernel on the CPU will increase performance).

Analyzing the Trace View introduces us to our next performance issue:

Trace View Following Optimization #2 (Captured by Author)

Once again, we see that our previous optimization has uncovered a new severe performance issue, this time when indexing our pred tensor. The indexes are defined by the r and target tensors. While the target tensor already resides on the GPU, the r tensor, which was defined on the previous line, does not. This, once again, triggers an inefficient host-to-GPU data copy.

Optimization #3: Replace range with torch.arange

Python’s range function outputs a list on the CPU. The presence of any list in your training step should be a red flag. In the code block below, we replace the use of range with torch.arange and configure it to create the output tensor directly on the GPU:

def weighted_nll(pred, target, weight):
with torch.profiler.record_function('range'):
r = torch.arange(target.shape[0], device="cuda:0")
with torch.profiler.record_function('index'):
nll = -pred[r, target]
with torch.profiler.record_function('nll_calc'):
nll = nll * weight[target]
nll = nll/ weight[target].sum()
sum_nll = nll.sum()
return sum_nll

The results of this optimization are shown below:

Performance Overview Following Optimization #3 (Captured by Author)

Now we’re talking!! Our step time has dropped down to 5.8 milliseconds, a performance increase of a whopping 3700%.

The updated Trace View shows that the loss function has dropped to a very reasonable 0.5 milliseconds.

Trace View Following Optimization #3 (Captured by Author)

But there is still room for improvement. Let’s take a closer look at the Trace View of the weighted_nll function which takes up the majority of the loss calculation.

Trace View of weighted_nll Function (Captured by Author)

We can see from the trace that the function is formed from multiple small blocks, each of which is ultimately mapped to an individual CUDA kernel which is loaded onto the GPU via the CudaLaunchKernel call. Ideally, we would be like to reduce the total number of GPU kernels so as to reduce the amount of interaction between the CPU and GPU. One way to do this is to prefer, whenever possible, higher level PyTorch operators, such as torch.nn.NLLLoss. Such functions are presumed to “fuse” together underlying operations, thus requiring a lower number of overall kernels.

Optimization #4: Replace custom NLL with torch.nn.NLLLoss

The code block below contains our updated loss definition, which now uses torch.nn.NLLLoss.

class CrossEntropyLoss(nn.Module):
def __init__(self):
super().__init__()
self.weight = torch.Tensor([0.1]*10).cuda()

def forward(self, input, target):
pred = log_softmax(input)
nll = torch.nn.NLLLoss(self.weight)
loss = nll(pred, target)
return loss

Here we have taken the liberty of introducing another common error which we will proceed to demonstrate.

Using the higher-level function further reduces our step time to 5.3 milliseconds (down from 5.8).

Performance Overview Following Optimization #4 (Captured by Author)

However, if we take a closer look at the Trace View, we can see that a significant portion of the loss function is now spent on initializing the torch.nn.NLLLoss object!

Trace View Following Optimization #4 (Captured by Author)

Looking back at our loss function, we can see that we are initializing a new NLLLoss object in each iteration of the training step. Naturally, object initialization occurs on the CPU, and although (in our case) it is relatively fast, it is something we would like to avoid doing during our training step.

Optimization #5: Refrain from initializing objects in the train step

In the code block below we have modified our loss implementation so that a single instance of torch.nn.NLLLoss is created in the init function.

class CrossEntropyLoss(nn.Module):
def __init__(self):
super().__init__()
self.weight = torch.Tensor([0.1]*10).cuda()
self.nll = torch.nn.NLLLoss(self.weight)

def forward(self, input, target):
pred = log_softmax(input)
loss = self.nll(pred, target)
return loss

The results show yet a further improvement in the step time which now stands at 5.2 milliseconds.

PyTorch includes a built-in torch.nn.CrossEntropyLoss which we now evaluate and compare with our custom loss implementation.

criterion = torch.nn.CrossEntropyLoss().cuda(device)

The resultant step time is a new low of 5 milliseconds for an overall performance boost of 4200% (compared to the 216 milliseconds we started with).

The performance improvement of the forward pass of the loss calculation is even more dramatic: From a starting point of 211 milliseconds, we have dropped all the way down to 79 microseconds(!!), as seen below:

Optimization #7: Compile loss function

For our final optimization attempt, we will configure the loss function to run in graph mode using the torch.compile API. As we discussed at length in this post and demonstrated in the prequel to this post, torch.compile will use techniques such as kernel fusion and out-of-order execution to map the loss function into low-level compute kernels in a manner that is optimal for the underlying training accelerator.

criterion = torch.compile(torch.nn.CrossEntropyLoss().cuda(device))

The image below shows the Trace View result of this experiment.

The first thing we can see is the appearance of terms containing “OptimizedModule” and “dynamo” which are indicative of the use of torch.compile. We can also see that, in practice, model compilation did not reduce the number of kernels loaded by the loss function which means that it did not identify any oportunites for additional kernel fusion. In fact, in our case, the loss compilation actually caused the time of the forward pass of the loss function to increase from 79 to 154 microseconds. It appears that the CrossEntropyLoss is not meaty enough to benefit from this optimization.

You may be wondering why we can’t just apply torch compilation to our initial loss function and rely on it to compile our code in an optimal manner. This could save all the hassle of the step-by-step optimization we described above. The problem with this approach is that although PyTorch 2.0 compilation (as of the time of this writing) does indeed optimize certain types of GPU-to-CPU crossovers, some types will crash the graph compilation, and others will result in the creation of multiple small graphs rather than a single large one. The last category causes graph breaks which essentially limits the torch.compile feature’s ability to boost performance. (One way to address this is to call torch.compile with the fullgraph flag set to True.) See our previous post for more details on using this option.

In the table below we summarize the results of the experiments we have run:

Optimization Experiments Results (By Author)

Our successive optimizations have led to a mind-blowing 4143% performance boost!! Recall that we started with a pretty innocent looking loss function. Without an in-depth analysis of our application’s behavior, we may have never known that there was anything wrong and would have continued on with our lives while paying 41 times(!!) more than we needed to.

You may have noticed that the GPU utilization dropped significantly in our final trials. This indicates major potential for further performance optimization. Although our demonstration has neared its end, our work is not done. See our previous post for some ideas on how to proceed from here.

Let’s summarize some of the things we have learned. We divide the summary into two parts. In the first, we describe some coding habits that may impact training performance. In the second, we recommend some tips for performance profiling. Note that these conclusions are based on the example that we have shared in this post and may not apply to your own use case. Machine learning models vary greatly in property and behavior. Therefore, you are strongly advised to assess these conclusions based on the details of your own project.

Coding Tips

The way in which you implement the forward pass of your model can have a significant impact on its performance. Here we list just a few recommendations based on the example that we covered in this post.

  1. Avoid initializing constant tensors in the forward pass. Do it in the constructor instead.
  2. Avoid using asserts on tensors residing on the GPU in the forward pass. Either move them to the data input pipeline and/or check if PyTorch has any built in methods for performing the data verification that you need.
  3. Avoid the use of lists. Check if using torch.arange to create a tensor directly on the device can be a better alternative.
  4. Use PyTorch operators such as torch.nn.NLLLoss and torch.nn.CrossEntropyLoss rather than creating your own loss implementations.
  5. Avoid initializing objects in the forward pass. Do it in the constructor instead.
  6. Consider using torch.compile when relevant.

Performance Analysis Tips

As we demonstrated, the Trace View of the Tensorboard PyTorch Profiler plugin was critical in identifying the performance issues in our model. Below we summarize some of the primary takeaways from our example:

  1. High GPU utilization is NOT necessarily a sign that your code is running optimally.
  2. Look out for portions of the code that take longer than expected.
  3. Use torch.profiler.record_function to pinpoint performance issues.
  4. Dips in GPU utilization are not necessarily aligned with the source of the performance issue.
  5. Look out for unintended data copies from the host to the GPU. These are typically identified by calls to “to”, “copy_”, and “cudaStreamSynchronize”, which you can search for in the Trace View.
  6. Look out for unintended data copies from the GPU to the host. These are typically identified by calls to “item”, and “cudaStreamSynchronize”, which you can search for in the Trace View.

In this post we have focused on performance issues in training applications resulting from redundant interaction between the CPU and GPU during the forward pass of the training step. We demonstrated how performance analyzers such as PyTorch Profiler and its associated TensorBoard plugin can be used to identify such issues and facilitate significant performance improvement.

As in our previous post, we emphasize that the path to successful optimization will vary greatly based on the details of the training project, including the model architecture and training environment. In practice, reaching your goals may be more difficult than in the example we presented here. Some of the techniques we described may have little impact on your performance or might even make it worse. We also note that the precise optimizations that we chose, and the order in which we chose to apply them, was somewhat arbitrary. You are highly encouraged to develop your own tools and techniques for reaching your optimization goals based on the specific details of your project.



Source link

Leave a Comment