The method of teaching a model to perform this denoising process may actually be a bit counter-intuitive from an initial thought. The model actually learns to denoise a signal by doing the exact opposite — adding noise to a clean signal over and over again until only noise remains. The idea is that if the model can learn how to predict the noise added to a signal at each step, then it can also predict the noise removed at each step for the reverse process. The critical element to make this possible is that the noise being added/removed needs to be of a defined probabilistic distribution (typically Gaussian) so that the noising/denoising steps are predictable and repeatable.
There is far more detail that goes into this process, but this should provide a sound conceptual understanding of what is happening under the hood. If you are interested in learning more about diffusion models (mathematical formulations, scheduling, latent space, etc.), I recommend reading this blog post by AssemblyAI and these papers (DDPM, Improving DDPM, DDIM, Stable Diffusion).
Understanding Audio for Machine Learning
My interest in diffusion stems from the potential that it has shown with generative audio. Traditionally, to train ML algorithms, audio was converted into a spectrogram, which is basically a heatmap of sound energy over time. This was because a spectrogram representation was similar to an image, which computers are exceptional at working with, and it was a significant reduction in data size compared to a raw waveform.
However, with this transformation come some tradeoffs, including a reduction of resolution and a loss of phase information. The phase of an audio signal represents the position of multiple waveforms relative to one another. This can be demonstrated in the difference between a sine and a cosine function. They represent the same exact signal regarding amplitude, the only difference is a 90° (π/2 radians) phase shift between the two. For a more in-depth explanation of phase, check out this video by Akash Murthy.
Phase is a perpetually challenging concept to grasp, even for those who work in audio, but it plays a critical role in creating the timbral qualities of sound. Suffice it to say that it should not be discarded so easily. Phase information can also technically be represented in spectrogram form (the complex portion of the transform), just like magnitude. However, the result is noisy and visually appears random, making it challenging for a model to learn any useful information from it. Because of this drawback, there has been recent interest in refraining from transforming audio into spectrograms and rather leaving it as a raw waveform to train models. While this brings its own set of challenges, both the amplitude and phase information are contained within the single signal of a waveform, providing a model with a more holistic picture of sound to learn from.
This is a key piece of my interest in waveform diffusion, and it has shown promise in yielding high-quality results for generative audio. Waveforms, however, are very dense signals requiring a significant amount of data to represent the range of frequencies humans can hear. For example, the music industry standard sampling rate is 44.1kHz, which means that 44,100 samples are required to represent just 1 second of mono audio. Now double that for stereo playback. Because of this, most waveform diffusion models (that don’t leverage latent diffusion or other compression methods) require high GPU capacity (usually at least 16GB+ VRAM) to store all of the information while being trained.
Many people do not have access to high-powered, high-capacity GPUs, or do not want to pay the fee to rent cloud GPUs for personal projects. Finding myself in this position, but still wanting to explore waveform diffusion models, I decided to develop a waveform diffusion system that could run on my meager local hardware.
I was equipped with an HP Spectre laptop from 2017 with an 8th Gen i7 processor and GeForce MX150 graphics card with 2GB VRAM — not what you would call a powerhouse for training ML models. My goal was to be able to create a model that could train and produce high-quality (44.1kHz) stereo outputs on this system.
The base model architecture consists of a U-Net with attention blocks which is standard for modern diffusion models. A U-Net is a neural network that was originally developed for image (2D) segmentation but has been adapted to audio (1D) for our uses with waveform diffusion. The U-Net architecture gets its name from its U-shaped design.
Very similar to an autoencoder, consisting of an encoder and a decoder, a U-Net also contains skip connections at each level of the network. These skip connections are direct connections between corresponding layers of the encoder and decoder, facilitating the transfer of fine-grained details from the encoder to the decoder. The encoder is responsible for capturing the important features of the input signal, while the decoder is responsible for generating the new audio sample. The encoder gradually reduces the resolution of the input audio, extracting features at different levels of abstraction. The decoder then takes these features and upsamples them, gradually increasing the resolution to generate the final audio sample.
This U-Net also contains self-attention blocks at the lower levels which help maintain the temporal consistency of the output. It is critical for the audio to be downsampled sufficiently to maintain efficiency for sampling during the diffusion process as well as avoid overloading the attention blocks. The model leverages V-Diffusion which is a diffusion technique inspired by DDIM sampling.
To avoid running out of GPU VRAM, the length of the data that the base model was to be trained on needed to be short. Because of this, I decided to train one-shot drum samples due to their inherently short context lengths. After many iterations, the base model length was determined to be 32,768 samples @ 44.1kHz in stereo, which results in approximately 0.75 seconds. This may seem particularly short, but it is plenty of time for most drum samples.
To downsample the audio enough for the attention blocks, several pre-processing transforms were attempted. The hope was that if the audio data could be downsampled without losing significant information prior to training the model, then the number of nodes (neurons) and layers could be maximized without increasing the GPU memory load.
The first transform attempted was a version of “patching”. Originally proposed for images, this process was adapted to audio for our purposes. The input audio sample is grouped by sequential time steps into chunks that are then transposed into channels. This process could then be reversed at the output of the U-Net to un-chunk the audio back to its full length. The un-chunking process created aliasing issues, however, resulting in undesirable high frequency artifacts in the generated audio.
The second transform attempted, proposed by Schneider, is called a “Learned Transform” which consists of single convolutional blocks with large kernel sizes and strides at the start and end of the U-Net. Multiple kernel sizes and strides were attempted (16, 32, 64) coupled with accompanying model variations to appropriately downsample the audio. Again, however, this resulted in aliasing issues in the generated audio, though not as prevalent as the patching transform.
Because of this, I decided that the model architecture would need to be adjusted to accommodate the raw audio with no pre-processing transforms to produce sufficient quality outputs.
This required extending the number of layers within the U-Net to avoid downsampling too quickly and losing important features along the way. After multiple iterations, the best architecture resulted in downsampling by only 2 at each layer. While this required a reduction in the number of nodes per layer, it ultimately produced the best results. Detailed information about the exact number of U-Net levels, layers, nodes, attention features, etc. can be found in the configuration file in the tiny-audio-diffusion repository on GitHub.
I trained 4 separate unconditional models to produce kicks, snare drums, hi-hats, and percussion (all drum sounds). The datasets used for training were small free one-shot samples that I had collected for my music production workflows (all open-source). Larger, more varied datasets would improve the quality and diversity of each model’s generated outputs. The models were trained for a various number of steps and epochs depending on the size of each dataset.
Overall, the quality of the output is quite high in spite of the reduced size of the models. However, there is still some slight high frequency “hiss” remaining, which is likely due to the limited size of the model. This can be seen in the small amount of noise remaining in the waveforms below. Most samples generated are crisp, maintaining transients and broadband timbral characteristics. Sometimes the models add extra noise toward the end of the sample, and this is likely a cost of the limit of layers and nodes of the model.
Listen to some output samples from the models here. Example outputs from each model are shown below.