The Fourrier Transform
The Fourier Transform is a mathematical tool that helps us deconstruct a complex signal into a series of simpler sine and cosine waves, each characterized by a specific frequency and amplitude. Here is a simplified version of the Fourier Transform:
In short, the Fourier Transform is simply telling us that any time series can be decomposed in a continuous sum (the integral of the formula) of primary sines/cosines with different amplitudes. This is exactly what we are looking for (because sound and frequencies spectrum are very closely related)
This method allows us to “break” the signal and identify each of its frequency components, providing a more thorough understanding of the overall sound.
The Fourier Transform for a mixture of two cosine signals
To make it more clear, let’s consider our previous example where we combined two periodic signals with different frequencies. The output, as we’ve observed, was a signal with a modulated amplitude.
The idea behind the Fourier Transform is simply to identify and represent the different components of our combined signal, which can later be represented on a diagram showing the amplitude and frequency of each primary cosine identified.
The resultant diagram showing the primary components of a complex signal with their frequencies and amplitudes is called a spectral diagram, and it contains the “features” composing our signal.
What this diagram tells us is simply that our made-up periodic signal is composed of two “primary” cosine signals:
- One cosine at 2250Hz with an amplitude of 0.5
- One cosine at 5500Hz with an amplitude of 1
The diagram is sufficient on its own to describe completely our main periodic signal: we jumped from the “time domain” to the “frequency domain”.
Extending the concept with a real-life signal
Our bird call, however, is much more complex than this two-frequency time series. It can have an infinity of frequencies mixed together, each contributing to the call’s unique tone color, and the frequency composition can evolve over time.
Instead of applying the Fourier Transform to the whole signal, we will apply it only locally, at a scale that is small enough to have a “regular enough” signal, but long enough to have enough periods of oscillations.
For example, let’s zoom back to the signal at t=0.0918s and t=0.229s and have a look at the spectral diagram. The obtained Fourier Transforms are this time continuous but peak at certain frequencies, which match with the calculations made in the previous chapter of this article.
Secondly, we can determine with more detail the composition of each portion of signals. In particular, we see that the second slice is made of multiple frequency peaks and is “richer” from a harmonic point of view, giving us new information about the “color” we talked about earlier.
Applying the Fourier Transform to a sub-part of the signal as we did above is usually referred to as Short-Time Fourier Transform (SFTF), it’s a strong tool that will be particularly useful to describe the sound locally and follow its evolution over time.
From STFT to Spectrogram
We have now a tool that can be used to identify the different primary components (amplitudes/frequencies) from a slice of a temporal signal locally. We can now apply this method to the whole signal using a sliding window which will extract the features of the sound over time. Note that instead of showing the Spectral diagram as a Scatter Plot, we will now represent it using a Heatmap with the frequency axis displayed vertically and each pixel representing the intensity of that frequency.
Using this representation, we can now stack horizontally the STFTs calculated using the rolling window on the entire signal and visualize the evolution of the frequency spectrum over time through an image. The generated figure is called a spectrogram.
In the spectrogram above, each column of pixels represents the STFT of a small portion of the signal centered on a given timestamp.
There are many types of spectrograms with different scales, and different hyper-parameters to apply the time-frequency transformation. Among them, we can mention the Log Frequency Power Spectrogram, the Constant-Q Transform (CQT), the Mel Spectrogram, etc… Each has its own subtilities but work on the same basis of extracting the frequency features and represent them in a (time x frequency heatmap) that can be interpreted as an image.
A few examples
The advantage of the spectrogram is that it condensates all the important features of the sound in one simple image. Analyzing this image tells you about variations over time in amplitudes, pitch, and color of a sound, which is exactly what we (or an ML/DL algorithm) need to recognize its emitter.
Let’s have a look at a few sounds with a 5s duration with their associated spectrogram.
Why Spectrograms + CNNs are more performant than LSTM on sounds classification
If you never worked on a sound classification system before, you might have considered using a recurrent neural network like an LSTM to extract the relevant features from the sound time series directly.
This would be a bad idea, because even if those models are designed to extract temporal dependencies, they are not efficient at extracting frequency features which, as we saw, are crucial for the sound identification task. LSTM would also be computationally expensive and inefficient by nature (because the data is processed sequentially). This means much less data to process for a given amount of time in comparison to a standard CNN.
On the other hand, converting the time series data into a spectrogram, which is essentially a visual representation of frequency information over time, allows us to use Convolutional Neural Networks (CNNs) which are designed for image data and are very effective at capturing spatial patterns, which in the case of a spectrogram, correspond to frequency patterns over time. This step can be seen as a natural “feature engineering” step, guaranteeing better efficiency for the sound classification task.