YOLO-NAS: How to Achieve the Best Performance on Object Detection Tasks | by Thomas A Dorfer | May, 2023


A foundational model generated through neural architecture search, innovative quantization blocks, and a robust pre-training paradigm

An image of a busy city street at night, with persons and cars being detected by YOLO-NAS.
Photo by Anubhav Saxena on Unsplash. Processed with YOLO-NAS-L by the author.

In the domain of object detection, YOLO (You Only Look Once) has become a household name. Since the release of the first model in 2015, the YOLO family has been growing steadily, with each new model outperforming its predecessor in mean average precision (mAP) and inference latency.

Two weeks ago, the YOLO family has welcomed yet another member: YOLO-NAS, a novel and foundational model developed by the deep learning company Deci.

In this article, we’ll explore its advantages over previous YOLO models and demonstrate how it can be used for your own object detection tasks.

YOLO-NAS: What’s New?

While previous YOLO models were leading in innovation and performance when it comes to object detection, they did have some limitations. One of the main issues was the lack of proper quantization support, which aims to decrease the model’s memory and computation requirements. Another issue was the insufficient trade-off between accuracy and latency, whereby an improvement in one often resulted in a considerable decline in the other.

By leveraging a concept called neural architecture search (NAS), researchers at Deci addressed these limitations head-on. Essentially, the concept of NAS can be considered a makeover for a trained deep learning model.

Traditionally, neural network architectures were manually designed by human experts based on their experience and intuition. However, this process, which involves the exploration of vast design spaces of possible architectures, has always been very time-consuming and cumbersome.

NAS, on the other hand, automatically re-designs the model’s architecture in order to boost its performance when it comes to things like speed, memory usage, and throughput. It typically involves a search space that defines the set of possible architectural choices, such as the number of layers, layer types, kernel sizes, and connectivity patterns. The search algorithm then assesses different architectures by training and evaluating them on a given task and dataset. Based on these evaluations, the algorithm iteratively explores and refines the architecture space, ultimately returning the one that yields the best performance.

Photo by Google DeepMind on Unsplash

In order to perform NAS, Deci leveraged its proprietary AutoNAC technology, which is an optimization engine that redesigns a model’s architecture to squeeze out maximum inference performance for a specific piece of hardware while at the same time preserving accuracy.

Aside from NAS, another major improvement of this new YOLO member involves the usage of quantization. Quantization, in this context, refers to the conversion of the neural network’s weights, biases, and activations from floating point values to integer values (INT-8), thus making the model more efficient.

This effort is two-fold: (1) The model uses quantization-friendly blocks that combine the advantages of re-parameterization and INT-8 quantization. These blocks use a methodology proposed by Chu et al. (2022), which redesigns the blocks so that the weight and activation distributions they generate are advantageous for quantization. (2) The authors utilize a hybrid quantization method, which selectively quantizes specific layers of the model, thus minimizing information loss and striking a balance between latency and accuracy.

The results of this novel methodology speak for themselves. As can be seen in the graph below, the quantized, medium-sized model, YOLO-NAS-INT8-M, demonstrates a 50% improvement in inference latency, while at the same time sporting a 1 mAP increase in accuracy compared to the latest state-of-the-art model.

A graph showing mean average precision over latency for various YOLO models.
Source: Deci-AI. License: Apache License 2.0

At the time of writing, three models of YOLO-NAS have been released: small, medium, and large, each with a quantized INT-8 counterpart.

A table showing mean average precision and latency for all three YOLO-NAS models as well as their quantized versions.
Source: Deci-AI. License: Apache License 2.0

Not surprisingly, the quantized versions experience a slight drop in precision. However, due to the employment of these novel quantization-friendly blocks as well as selective quantization, this precision drop remains relatively small. In addition, the upside here considerably outweighs the downside, with significant improvements being observed in inference latency.

YOLO-NAS also comes pre-trained on the COCO, Objects365, and Roboflow 100 datasets, which makes it extremely suitable for downstream object detection tasks.

The pre-training regimen leveraged a concept known as knowledge-distillation, which allows the model to learn from its own predictions, rather than relying solely on external, labeled data, in order to improve performance. In this paradigm, a teacher model generates predictions on the training data, which then serve as guidance (or soft targets) for the student model. The student model is trained using both the original labeled data and the soft targets generated by the teacher model. It essentially tries to mimic the teacher model’s predictions while also adjusting its parameters to match the original labeled data. Overall, this approach allows the model to generalize better, reduce overfitting, and achieve higher accuracy, especially when labeled data is not abundantly available.

The training process was further enhanced through the incorporation of distribution focal loss (DFL). DFL is a loss function that extends the concept of focal loss, which addresses the issue of class imbalance by assigning higher weights to hard-to-classify samples. In the context of object detection, DFL is used in the training process by learning box regression as a classification task. It discretizes bounding box predictions into limited options and predicts distributions over these options. The final predictions are then obtained by combining these distributions using weighted sums. By considering the class distribution and adjusting the loss function accordingly, the model is able to increase its detection accuracy for underrepresented classes.

Finally, YOLO-NAS has been made available under an open source license with pre-trained weights available for research use on Deci’s PyTorch-based computer vision library called SuperGradients.

How To Use YOLO-NAS

In order to use YOLO-NAS for inference, we need to install the super_gradients package first:

pip install super-gradients

To set ourselves up for the inference task, let’s take a sample image that we’re going to call image.jpg:

An image showing two colleagues in an office high-fiving.
Photo by krakenimages on Unsplash

In order to perform inference, we can use the following code snippet:

import torch
from super_gradients.training import models

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
yolo_nas_s = models.get("yolo_nas_s", pretrained_weights="coco").to(device)
out = yolo_nas.predict("image.jpg")
out.save("image_yolo.jpg")

First, we need to import torch and models from the super_gradients library. Then, we declare a variable device, which is set to use the first available GPU (if one is available), or, otherwise, is set to use the CPU.

Subsequently, we are specifying that we’d like to use the small version of the model, YOLO-NAS-S, and the pre-trained weights from the COCO dataset. Furthermore, we’re saving the output image with the detected objects as image_yolo.jpg.

Our output image looks as follows:

An image showing two colleagues in an office high-fiving, with objects being detected by YOLO-NAS.
Photo by krakenimages on Unsplash. Processed with YOLO-NAS-S by the author.

We can see various objects being detected spanning a wide range of confidence levels. The model is mostly confident for objects in focus, such as the two persons, cups, and the laptop. However, we also observe some misclassifications, probably due to the objects being out of focus. This includes a penholder being misclassified as a potted plant, and a pen being misclassified as a toothbrush. Amazingly, we can also see that the model accurately detects objects that are only partially visible, such as the chairs that the persons are sitting on, where only the backrest is visible.

Finally, it is worth mentioning that object detection can be performed in exactly the same way with videos by simply changing the input parameter of the predict() call to the corresponding video file.

Conclusion

The YOLO family has grown by yet another member, YOLO-NAS, which is proudly outperforming its younger siblings like YOLOv6, YOLOv7, and YOLOv8.

Through an innovative combination of neural architecture search, quantization support, and a robust pre-training procedure that includes knowledge-distillation and distribution focal loss, YOLO-NAS achieves remarkable trade-offs between precision and inference latency.

Considering the rapid pace at which the landscape of computer vision and object detection keeps evolving, it is highly likely that another YOLO model will soon come to see the light of day.



Source link

Leave a Comment