The concept of Depthwise Separable Convolutions (DSC) was first proposed by Laurent Sifre in their PhD thesis titled Rigid-Motion Scattering For Image Classification. Since then, they have been used successfully in various popular deep convolutional networks such as XceptionNet and MobileNet.

The main difference between a regular convolution, and a DSC is that a DSC is composed of 2 convolutions as described below:

- A
**depthwise grouped convolution**, where the number of input channels m is equal to the number of output channels such that each output channel is affected only by a single input channel. In PyTorch, this is called a “grouped” convolution. You can read more about grouped convolutions in PyTorch here. - A
**pointwise convolution**(filter size=1), which operates like a regular convolution such that each of the n filters operates on all m input channels to produce a single output value.

Let’s perform the same exercise that we did for regular convolutions for DSCs and compute the number of trainable parameters and computations.

**Evaluation Of Trainable Parameters:** The “grouped” convolutions have m filters, each of which has *dₖ x dₖ* learnable parameters which produces m output channels. This results in a total of *m x dₖ x dₖ* learnable parameters. The pointwise convolution has n filters of size *m x 1 x 1* which adds up to *n x m x 1 x 1* learnable parameters. Let’s look at the PyTorch code below to validate our understanding.

`class DepthwiseSeparableConv(nn.Sequential):`

def __init__(self, chin, chout, dk):

super().__init__(

# Depthwise convolution

nn.Conv2d(chin, chin, kernel_size=dk, stride=1, padding=dk-2, bias=False, groups=chin),

# Pointwise convolution

nn.Conv2d(chin, chout, kernel_size=1, bias=False),

)conv2 = DepthwiseSeparableConv(chin=m, chout=n, dk=dk)

print(f"Expected number of parameters: {m * dk * dk + m * 1 * 1 * n}")

print(f"Actual number of parameters: {num_parameters(conv2)}")

Which will print.

`Expected number of parameters: 656`

Actual number of parameters: 656

We can see that the DSC version has roughly *7x* less parameters. Next, let’s focus our attention on the computation costs for a DSC layer.

**Evaluation Of Computational Cost:** Let’s assume our input has spatial dimensions *m x h x w*. In the grouped convolution segment of DSC, we have **m** filters, each with size *dₖ x dₖ*. A filter is applied to its corresponding input channel resulting in the segment cost of *m x dₖ x dₖ x h x w*. For the pointwise convolution, we apply **n** filters of size *m x 1 x 1*** **to produce **n **output channels. This results in the segment cost of *n x m x 1 x 1 x h x w*. We need to add up the costs of the grouped and pointwise operations to compute the total cost. Let’s go ahead and validate this using the torchinfo PyTorch package.

`print(f"Expected total multiplies: {m * dk * dk * h * w + m * 1 * 1 * h * w * n}")`

s2 = summary(conv2, input_size=(1, m, h, w))

print(f"Actual multiplies: {s2.total_mult_adds}")

print(s2)

Which will print.

`Expected total multiplies: 10747904`

Actual multiplies: 10747904

==========================================================================================

Layer (type:depth-idx) Output Shape Param #

==========================================================================================

DepthwiseSeparableConv [1, 32, 128, 128] --

├─Conv2d: 1-1 [1, 16, 128, 128] 144

├─Conv2d: 1-2 [1, 32, 128, 128] 512

==========================================================================================

Total params: 656

Trainable params: 656

Non-trainable params: 0

Total mult-adds (M): 10.75

==========================================================================================

Input size (MB): 1.05

Forward/backward pass size (MB): 6.29

Params size (MB): 0.00

Estimated Total Size (MB): 7.34

==========================================================================================

Let’s compare the sizes and costs of both the convolutions for a few examples to gain some intuition.

## Size and Cost comparison for regular and depthwise separable convolutions

To compare the size and cost of regular and depthwise separable convolution, we will assume an input size of *128 x 128* to the network, a kernel size of *3 x 3*, and a network that progressively halves the spatial dimensions and doubles the number of channel dimensions. We assume a single 2d-conv layer at every step, but in practice, there could be more.

You can see that on average both the size and computational cost of DSC is about 11% to 12% of the cost of regular convolutions for the configuration mentioned above.

Now that we have developed a good understanding of the types of convolutions and their relative costs, you must be wondering if there’s any downside of using DSCs. Everything we’ve seen so far seems to suggest that they are better in every way! Well, we haven’t yet considered an important aspect which is the impact they have on the accuracy of our model. Let’s dive into it via an experiment below.