Machine learning optimization of vector quantization methods used in end-to-end training of neural networks
This post is a short explanation of my paper  published at ICASSP 2023 conference. For more details, please look at the paper under this link.
Vector quantization (VQ) is a data compression technique similar to k-means algorithm which can model any data distribution. Vector quantization has been used in a wide range of applications for speech, image, and video data, such as image generation , speech and audio coding , voice conversion [4,5], music generation , and text-to-speech synthesis [7,8]. The figure below shows how vector quantization (VQ) works. For VQ process, we require a codebook which includes a number of codewords. Applying VQ on a data point (gray dots) means to map it to the closest codeword (blue dots), i.e. replace the value of data point with the closest codeword value. Each voronoi cell (black lines) contains one codeword such that all data points located in that cell will be mapped to that codeword, since it is the closest codeword to data points located in that voronoi cell.
In other words, vector quantization maps the input vector x to the closest codeword within the codebook (CB) using the following formula:
The computational complexity of VQ increases exponentially with the increase in the codebook size (increase in VQ bitrate). Hence, this plain form of VQ is applicable only for limited bitrates (limited codebook sizes). To solve this challenge and apply VQ for higher bitrates and higher dimensional data, we use some variants of VQ such as Residual VQ, Additive VQ, and Product VQ. These methods considers more than one codebook to apply VQ on the data. We will explain these three VQ methods in the following.
Residual VQ quantizes the input vector x by applying M consecutive VQ modules on it. According to the following figure, suppose M=3. We apply the first VQ module on input vector x using the first codebook (CB¹). Then, after finding the closest codeword form first codebook, we calculate the remainder (R1). Afterwards, we pass R1 as input to the next VQ module using the second codebook (CB²). This process will continue for M stages where we would find three closest codeword coming from separate codebooks. At the end, we quantize the input vector x as a summation of M closest codewords.
In a similar way as Residual VQ, Additive VQ quantizes the input vector x by applying M consecutive VQ modules. However, Additive VQ adopts the complex beam searching algorithm to find the closest codewords for the quantization process (you can find the details of beam searching algorithm in this paper ). According to the following figure, we suppose M=3. In Additive VQ, first we search for the closest codeword from the union of all three codebooks (here CB¹, CB², CB³). Then, suppose we find the best codeword from CB². After that, we calculate the residual (R1) and pass it as input to the next VQ module. Since the first codeword is selected from CB², now we search for the closest codeword from the union of CB¹ and CB³. After calculating the residual R2, we pass it as input to the last VQ module, where we do the search using the last codebook (in this case CB¹) which is not yet contributed to the quantization process. At the end, we quantize the input vector x as a summation of M closest codewords.
Product VQ splits the input vector x of dimension D to M independent subspaces of dimension D/M. Then it applies M independent VQ modules to the existing subspaces. At the end, Product VQ quantizes the input vector x as a concatenation of M closest codewords (one per each codebook). The figure below shows the Product VQ when M=3.
Vector quantization (VQ) training means to optimize the codebook(s) such that they model the data distribution in a way that the error of quantization (such as mean squared error) between data points and codebook elements is minimized. To optimize the codebooks for these three above-mentioned variants of VQ (Residual VQ, Additive VQ, and Product VQ) there are different approaches which we will mention in the following.
1. K-means Algorithm (traditional approach):
Based on the literature review, in most of the papers, codebooks for these three VQ methods have been optimized by k-means algorithm.
2. Stochastic Optimization (machine learning algorithms):
Machine learning optimization algorithms are based on gradient calculation. Therefore, it is impossible to optimize vector quantization methods using machine learning optimization, since the argmin function in vector quantization function (first equation above) is not differentiable. In other words, we cannot pass the gradients over vector quantization function in backpropagation. Here we have mentioned two solutions to solve this problem.
2.1. Straight Through Estimator (STE)
STE  solves the problem by simply copying the gradients intactly over VQ module in backpropagation. Hence, it does not consider the influence of vector quantization and leads to a mismatch between the gradients and true behavior of the VQ function.
2.2. Noise Substitution in Vector Quantization (NSVQ):
The NSVQ technique  is our recently proposed method, in which the vector quantization error is simulated by adding noise to the input vector, such that the simulated noise would gain the shape of original VQ error distribution (you can read shortly about NSVQ in this post).
NSVQ technique  has some advantages over STE method  which are listed in the following. 1) NSVQ yields more accurate gradients for VQ function. 2) NSVQ achieves faster convergence for VQ training (codebook optimization). 3) NSVQ does not need any additional hyper-parameter tuning for VQ training (does not require additional loss term for VQ training to be added to the global optimization loss function).
In our paper, we have used our recently proposed NSVQ technique  to optimize three above-mentioned variants of VQ by machine learning optimization. To evaluate the performance of these three VQ methods and study the trade-offs between accuracy, bitrate, and complexity of them, we conducted four different scenarios for experiments. We will explain all these scenarios of experiments in the following.
1. Approximate Nearest Neighbor (ANN) Search
In this experiment, we modeled the distribution of SIFT1M dataset  (128-D image descriptors) by training three VQ methods on its learning set. The SIFT1M image descriptors dataset  includes 10⁶ base vectors, 10⁵ learning vectors, and 10⁴ query vectors for testing purposes. The ground truth contains the set of actual nearest neighbors, from the base vectors to the query vectors. In the ANN search, we first compress the base vectors using the corresponding learnt codebooks trained on the learning set. Then, for each query vector, we find the approximate nearest neighbors from the compressed base vectors by performing an exhaustive search. To assess the quality of data compression, we calculate the recall metric at different values for parameter T, which shows whether the actual nearest neighbor (from groundtruth) exists in the first T computed nearest neighbors. The figure below illustrates the comparison of three variants of VQ optimized by our proposed NSVQ technique with the baseline methods under recall metric. In general, all three machine learning-based optimized VQ methods achieve comparable (even slightly better in case of RVQ) recall values to the baselines.
2. Image Compression using VQ-VAE
In this experiment, we trained a vector quantized variational autoencoder (VQ-VAE) on the training set of CIFAR10 dataset to compress it. To apply the vector quantization in the bottleneck of VQ-VAE, we used each of these three VQ methods. After training, we reconstructed the test images of CIFAR10 using the trained encoder, decoder, and learnt codebooks for each VQ method. To evaluate the quality of reconstructed images, we employ Peak Signal to Noise Ratio (Peak SNR) metric. In addition, we computed the complexity of each VQ method using Weighted Million Operations Per Second (WMOPS) metric, which is under ITU-T standard. The following figure shows the results of this experiment.
According to the complexity figure (in the right), we found that for the same use of computational resources (left vertical red line) and a higher bitrate, Product VQ performs better than Residual VQ. In addition, for the same use of computational resources (right vertical red line) and a higher bitrate, Residual VQ performs better than Additive VQ. Therefore, depending on how much computational resources are available, we can conclude which is the best VQ method to use.
3. Speech Coding
In this experiment, we model the spectral envelope of speech signals by three VQ methods using the speech codec presented in . To evaluate the quality of decoded speech signals, we used perceptual evaluation of speech quality (PESQ) and perceptually weighted signal to noise ratio (pSNR) as objective metrics. The following figure shows the performance of all three VQ methods under PESQ and pSNR criteria. According to the results, we observe that Additive VQ gains higher mean and lower variance than both Residual VQ and Product VQ in both metrics.
4. Toy Examples
In this experiment, we intend to compare the performance of three VQ methods with respect to the correlation in the data. Hence, we prepared two correlated and uncorrelated datasets of dimension 64. Then, we compressed these datasets using these three VQ methods. To evaluate the performance, we computed the mean squared error (MSE) between each dataset and its quantized version. The following figure shows the results for this experiment.
In correlated dataset, since Residual VQ and Additive VQ take the correlation among all data dimensions into account, they have much lower quantization error than Product VQ as expected. On the other hand, Product VQ has better performance than Additive VQ and Residual VQ for uncorrelated data, since there is no correlation among data dimensions and that is exactly what Product VQ presumes.
Using variants of Vector Quantization (VQ) such as Residual VQ, Additive VQ, and Product VQ allows to apply VQ for high bitrates and high dimensional data. These VQ methods have been optimized by classic expectation maximization and k-menas algorithm so far. In this paper, we optimize these VQ methods by machine learning optimization using our recently proposed Noise Substitution in Vector Quantization (NSVQ)  technique. In addition, NSVQ allows end-to-end optimization of VQ methods in Neural Netwroks. We also study the trade-offs between bitrate, accuracy, and complexity of these three VQ methods. Hence, our open-source implementation  helps to make the best choice of VQ method for a particular use-case.
We provide the PyTorch implementation of these VQ methods in the following webpage.