Spoken language recognition on Mozilla Common Voice — Audio Transformations. | by Sergey Vilov | Aug, 2023


Photo by Kelly Sikkema on Unsplash

This is the third article on spoken language recognition based on the Mozilla Common Voice dataset. In Part I, we discussed data selection and data preprocessing and in Part II we analysed performance of several neural network classifiers.

The final model achieved 92% accuracy and 97% pairwise accuracy. Since this model suffers from somewhat high variance, the accuracy could potentially be improved by adding more data. One very common way to get extra data is to synthesize it by performing various transformations on the available dataset.

In this article, we will consider 5 popular transformations for audio data augmentation: adding noise, changing speed, changing pitch, time masking, and cut & splice.

The tutorial notebook can be found here.

For illustration purposes, will use the sample common_voice_en_100040 from the Mozilla Common Voice (MCV) dataset. This is the sentence The burning fire had been extinguished.

import librosa as lr
import IPython

signal, sr = lr.load('./transformed/common_voice_en_100040.wav', res_type='kaiser_fast') #load signal

IPython.display.Audio(signal, rate=sr)

Original sample common_voice_en_100040 from MCV.
Original signal waveform (image by the author)

Adding noise is the simplest audio augmentation. The amount of noise is characterised by the signal-to-noise ratio (SNR) — the ratio between maximal signal amplitude and standard deviation of noise. We will generate several noise levels, defined with SNR, and see how they change the signal.

SNRs = (5,10,100,1000) #Signal-to-noise ratio: max amplitude over noise std

noisy_signal = {}

for snr in SNRs:

noise_std = max(abs(signal))/snr #get noise std
noise = noise_std*np.random.randn(len(signal),) #generate noise with given std

noisy_signal[snr] = signal+noise

IPython.display.display(IPython.display.Audio(noisy_signal[5], rate=sr))
IPython.display.display(IPython.display.Audio(noisy_signal[1000], rate=sr))

Signals obtained by superimposing noise with SNR=5 and SNR=1000 on the original MCV sample common_voice_en_100040 (generated by the author).
Signal waveform for several noise levels (image by the author)

So, SNR=1000 sounds almost like the unperturbed audio, while at SNR=5 one can only distinguish the strongest parts of the signal. In practice, the SNR level is hyperparameter that depends on the dataset and the chosen classifier.

The simplest way to change the speed is just to pretend that the signal has a different sample rate. However, this will also change the pitch (how low/high in frequency the audio sounds). Increasing the sampling rate will make the voice sound higher. To illustrate this we shall “increase” the sampling rate for our example by 1.5:

IPython.display.Audio(signal, rate=sr*1.5)
Signal obtained by using a false sampling rate for the original MCV sample common_voice_en_100040 (generated by the author).

Changing the speed without affecting the pitch is more challenging. One needs to use the Phase Vocoder(PV) algorithm. In brief, the input signal is first split into overlapping frames. Then, the spectrum within each frame is computed by applying Fast Fourier Transformation (FFT). The playing speed is then modifyed by resynthetizing frames at a different rate. Since the frequency content of each frame is not affected, the pitch remains the same. The PV interpolates between the frames and uses the phase information to achieve smoothness.

For our experiments, we will use the stretch_wo_loop time stretching function from this PV implementation.

stretching_factor = 1.3

signal_stretched = stretch_wo_loop(signal, stretching_factor)
IPython.display.Audio(signal_stretched, rate=sr)

Signal obtained by varying the speed of the original MCV sample common_voice_en_100040 (generated by the author).
Signal waveform after speed increase (image by the author)

So, the duration of the signal decreased since we increased the speed. However, one can hear that the pitch has not changed. Note that when the stretching factor is substantial, the phase interpolation between frames might not work well. As a result, echo artefacts may appear in the transformed audio.

To alter the pitch without affecting the speed, we can use the same PV time stretch but pretend that the signal has a different sampling rate such that the total duration of the signal stays the same:

IPython.display.Audio(signal_stretched, rate=sr/stretching_factor)
Signal obtained by varying pitch of the original MCV sample common_voice_en_100040 (generated by the author).

Why do we ever bother with this PV while librosa already has time_stretch and pitch_shift functions? Well, these functions transform the signal back to the time domain. When you need to compute embeddings afterwards, you will lose time on redundant Fourier transforms. On the other hand, it is easy to modify the stretch_wo_loop function such that it yields Fourier output without taking the inverse transform. One could probably also try to dig into librosa codes to achieve similar results.

These two transformation were initially proposed in the frequency domain (Park et al. 2019). The idea was to save time on FFT by using precomputed spectra for audio augmentations. For simplicity, we will demonstrate how these transformations work in the time domain. The listed operations can be easily transferred to the frequency domain by replacing the time axis with frame indices.

Time masking

The idea of time masking is to cover up a random region in the signal. The neural network has then less chances to learn signal-specific temporal variations that are not generalizable.

max_mask_length = 0.3 #maximum mask duration, proportion of signal length

L = len(signal)

mask_length = int(L*np.random.rand()*max_mask_length) #randomly choose mask length
mask_start = int((L-mask_length)*np.random.rand()) #randomly choose mask position

masked_signal = signal.copy()
masked_signal[mask_start:mask_start+mask_length] = 0

IPython.display.Audio(masked_signal, rate=sr)

Signal obtained by applying time mask transformation on the original MCV sample common_voice_en_100040 (generated by the author).
Signal waveform after time masking (the masked region is indicated with orange) (image by the author)

Cut & splice

The idea is to replace a randomly selected region of the signal with a random fragment from another signal having the same label. The implementation is almost the same as for time masking, except that a piece of another signal is placed instead of the mask.

other_signal, sr = lr.load('./common_voice_en_100038.wav', res_type='kaiser_fast') #load second signal

max_fragment_length = 0.3 #maximum fragment duration, proportion of signal length

L = min(len(signal), len(other_signal))

mask_length = int(L*np.random.rand()*max_fragment_length) #randomly choose mask length
mask_start = int((L-mask_length)*np.random.rand()) #randomly choose mask position

synth_signal = signal.copy()
synth_signal[mask_start:mask_start+mask_length] = other_signal[mask_start:mask_start+mask_length]

IPython.display.Audio(synth_signal, rate=sr)

Synthetic signal obtained by applying cut&splice transformation on the original MCV sample common_voice_en_100040 (generated by the author).
Signal waveform after cut&splice transformation (the inserted fragment from the other signal is indicated with orange) (image by the author)



Source link

Leave a Comment