## A high-level overview of the latest convolutional kernel structures in Deformable Convolutional Networks, DCNv2, DCNv3

As the remarkable success of OpenAI’s ChatGPT has sparked the boom of large language models, many people foresee the next breakthrough in large image models. In this domain, vision models can be prompted to analyze and even generate images and videos in a similar manner to how we currently prompt ChatGPT.

The latest deep learning approaches for large image models have branched into two main directions: those based on convolutional neural networks (CNNs) and those based on transformers. This article will focus on the CNN side and provide a high-level overview of those improved CNN kernel structures.

Traditionally, CNN kernels have been applied to fixed locations in each layer, resulting in all activation units having the same receptive field.

As in the figure below, to perform convolution on an input feature map ** x**, the value at each output location

*p**0*is calculated as an element-wise multiplication and summation between kernel weight

**and a sliding window on**

*w***The sliding window is defined by a grid**

*x.***, which is also the receptive field for**

*R*

*p**0*

**The size of**

*.***remains the same across all locations within the same layer of**

*R**y*.

Each output value is calculated as follows:

where *p**n* enumerates locations in the sliding window (grid ** R**).

The RoI (region of interest) pooling operation, too, operates on bins with a fixed size in each layer. For (*i, j*)-th bin containing *nij* pixels, its pooling outcome is computed as:

Again shape and size of bins are the same in each layer.

Both operations thus become particularly problematic for high-level layers that encode semantics, e.g., objects with varying scales.

DCN proposes deformable convolution and deformable pooling that are more flexible to model those geometric structures. Both operate on the 2D spatial domain, i.e., the operation remains the same across the channel dimension.

**Deformable convolution**

Given input feature map **x**, for each location *p**0 *in the output feature map **y**, DCN adds 2D offsets △*p**n* when enumerating each location *p**n* in a regular grid ** R**.

These offsets are learned from preceding feature maps, obtained via an additional conv layer over the feature map. As these offsets are typically fractional, they are implemented via bilinear interpolation.

**Deformable RoI pooling**

Similar to the convolution operation, pooling offsets △*p**ij* are added to the original binning positions.

As in the figure below, these offsets are learned through a fully connected (FC) layer after the original pooling result.

**Deformable Position-Sentitive (PS) RoI pooling**

When applying deformable operations to PS RoI pooling (Dai et al., n.d.), as illustrated in the figure below, offsets are applied to each score map instead of the input feature map. These offsets are learned through a conv layer instead of an FC layer.

*Position-Sensitive RoI pooling *(Dai et al., n.d.)*: Traditional RoI pooling loses information regarding which object part each region represents. PS RoI pooling is proposed to retain this information by converting input feature maps to k² score maps for each object class, where each score map represents a specific spatial part. So for C object classes, there are total k² (C+1) score maps.*

Although DCN allows for more flexible modelling of the receptive field, it assumes pixels within each receptive field contribute equally to the response, which is often not the case. To better understand the contribution behaviour, authors use three methods to visualize the spatial support:

- Effective receptive fields: gradient of the node response with respect to intensity perturbations of each image pixel
- Effective sampling/bin locations: gradient of the network node with respect to the sampling/bin locations
- Error-bounded saliency regions: progressively masking the parts of the image to find the smallest image region that produces the same response as the entire image

To assign learnable feature amplitude to locations within the receptive field, DCNv2 introduces modulated deformable modules:

For location *p**0*, the offset △** pn** and its amplitude △

*m**n*are learnable through separate conv layers applied to the same input feature map.

DCNv2 revised deformable RoI pooling similarly by adding a learnable amplitude △*m**ij *for each (i,j)-th bin.

DCNv2 also expands the use of deformable conv layers to replace regular conv layers in conv3 to conv5 stages in ResNet-50.

To reduce the parameter size and memory complexity from DCNv2, DCNv3 makes the following adjustments to the kernel structure.

- Inspired by depthwise separable convolution (Chollet, 2017)

*Depthwise separable convolution decouples traditional convolution into: 1. depth-wise convolution: each channel of the input feature is convolved separately with a filter; 2. point-wise convolution: a 1×1 convolution applied across channels.*

The authors propose to let the feature amplitude *m* be the depth-wise part, and the projection weight *w* shared among locations in the grid as the point-wise part.

2. Inspired by group convolution (Krizhevsky, Sutskever and Hinton, 2012)

*Group convolution: Split input channels and output channels into groups and apply separate convolution to each group.*

DCNv3 (Wang et al., 2023) propose splitting the convolution into G groups, each having separate offset △*p**gn* and feature amplitude △*m**gn*.

DCNv3 is hence formulated as:

where *G* is the total number of convolution groups, *w**g* is location irrelevant, △*m**gn* is normalized by the softmax function so that the sum over grid ** R** is 1.

So far DCNv3 based InternImage has demonstrated superior performance in multiple downstream tasks such as detection and segmentation, as shown in the table below, as well as the leaderboard on papers with code. Refer to the original paper for more detailed comparisons.

In this article, we have reviewed kernel structures for regular convolutional networks, along with their latest improvements, including deformable convolutional networks (DCN) and two newer versions: DCNv2 and DCNv3. We discussed the limitations of traditional structures and highlighted the advancements in innovation built upon previous versions. For a deeper understanding of these models, please refer to the papers in the References section.

Special thanks to Kenneth Leung, who inspired me to create this piece and shared amazing ideas. A huge thank you to Kenneth, Melissa Han, and Annie Liao, who contributed to improving this piece. Your insightful suggestions and constructive feedback have significantly impacted the quality and depth of the content.

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H. and Wei, Y. (n.d.). *Deformable Convolutional Networks*. [online] Available at: https://arxiv.org/pdf/1703.06211v3.pdf.

Zhu, X., Hu, H., Lin, S. and Dai, J. (n.d.). *Deformable ConvNets v2: More Deformable, Better Results*. [online] Available at: https://arxiv.org/pdf/1811.11168.pdf.

Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., Wang, X. and Qiao, Y. (n.d.). *InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions*. [online] Available at: https://arxiv.org/pdf/2211.05778.pdf [Accessed 31 Jul. 2023].

Chollet, F. (n.d.). *Xception: Deep Learning with Depthwise Separable Convolutions*. [online] Available at: https://arxiv.org/pdf/1610.02357.pdf.

Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. *Communications of the ACM*, 60(6), pp.84–90. doi:https://doi.org/10.1145/3065386.

Dai, J., Li, Y., He, K. and Sun, J. (n.d.). *R-FCN: Object Detection via Region-based Fully Convolutional Networks*. [online] Available at: https://arxiv.org/pdf/1605.06409v2.pdf.