Spoken language recognition on Mozilla Common Voice — Part I. | by Sergey Vilov | Aug, 2023

Photo by Sebastian Unrau on Unsplash

One of the most challenging AI tasks is identifying the speaker’s language for the purpose of subsequent speech-to-text conversion. This problem might arise, for example, when people living in the same household and speaking different languages use the same voice-control device such as a garage lock or a smart home system.

In this series of articles, we will try to maximize spoken language recognition accuracy using the Mozilla Common Voice (MCV) Dataset. In particular, we will compare several neural network models trained to distinguish between German, English, Spanish, French, and Russian.

In this first part we shall discuss data selection, preprocessing, and embedding.

MCV is by far the largest publicly available voice dataset, comprising short records (average duration = 5.3s) in as many as 112 languages.

For our language recognition task, we choose 5 languages: German, English, Spanish, French, and Russian. For German, English, Spanish, and French, we consider only accents labelled in MCV as Deutschland Deutsch, United States English, España, and Français de France correspondingly. For each language, we select a subset of adult records among validated samples.

We used a train/val/test split with 40K/5K/5K audio clips per language. To get objective evaluation, we ensured that speakers (client_id) did not overlap between the three sets. When splitting data, we first filled the test and validation sets with records from poorly represented speakers, then we allocated the remaining data to the train set. This improved speakers diversity in the val/test sets and led to a more objective estimation of the generalization error. To avoid that a single speaker dominates in the train set, we limited the maximal number of records per client_id to 2000. On average, we got 26 records per speaker. We also made sure that the number of female records matches the number of male records. Finally, we upsampled the train set if the resulting number of records was below 40K. The final distribution of counts is depicted in the figure below.

Class distribution in the train set (image by the author).

The resulting dataframe with indicated split is available here.

All MCV audio files are provided in the .mp3 format. Although .mp3 is great for compact storage of music, it is not widely supported by audio processing libraries, such as librosa in python. So, we first need to convert all files to .wav. In addition, the original MCV sampling rate is 44kHz. This implies the maximal encoded frequency of 22kHz (according to the Nyquist theorem). This would be an overkill for a spoken language recognition task: in English, for example, most phonemes do not exceed 3kHz in conversational speech. So, we can also lower the sampling rate downto 16kHz. This will not only reduce the file size, but also accelerate generation of embeddings.

Both operations can be executed within one command using ffmpeg:

ffmpeg -y -nostdin -hide_banner -loglevel error -i $input_file.mp3 -ar 16000 $output_file.wav

Relevant information is usually extracted from audio clips by computing embeddings. We will consider four more or less common embeddings for speech recognition / spoken language recognition tasks: mel spectrogram, MFCC, RASTA-PLP, and GFCC.

Mel Spectrogram

The principles of mel spectrograms have widely been discussed on Medium. An awesome step-by-step tutorial on mel spectrograms and MFCC can also be found here.

To obtain the mel spectrogram, the input signal is first subject to pre-emphasis filtering. Then, Fourier Transform is performed on sliding windows consecutively applied to the obtained waveform. After that, the frequency scale is transformed to the mel scale which is linear with respect to human perception of intervals. Finally, a filter bank of overlapping triangular filters is applied to the power spectrum on the mel scale to mimic how the human ear perceives sound.


Mel coefficients are highly correlated, which might be undesirable for some machine learning algorithms (for example, it is more convenient to have a diagonal covariance matrix for Gaussian Mixture Models). To decorrelate mel filter banks, mel-frequency cepstral coefficients (MFCC) are obtained by computing Discrete Cosine Transformation (DCT) on log filterbank energies. Only a few first MFCC are usually used. Exact steps are outlined here.


Perceptual Linear Prediction (PLP) (Hermansky and Hynek, 1990) are another way to compute embeddings for music clips.

Differences between PLP and MFCC lie in the filter-banks, the equal-loudness pre-emphasis, the intensity-to-loudness conversion and in the application of linear prediction (Hönig et al, 2005).

Overview of PLP and MFCC techniques (from Hönig et al, 2005)

PLP were reported (Woodland et al., 1996) to be more robust than MFCC when there is acoustic mismatch between training and test data.

Compared to PLP, RASTA-PLP (Hermansky et al., 1991) performs additional filtering in the logarithmic spectral domain, which makes the method more robust to linear spectral distortions introduced by communication channels.


Gammatone frequency cepstral coefficients (GFCC) were reported to be less noise-sensitive than MFCC (Zhao, 2012; Shao, 2007). Compared to MFCC, Gammatone filters are computed on the equivalent rectangular bandwidth scale (instead of the mel scale) and the cubic root operation (instead of logarithm) is applied prior to computing DCT.

The figure below shows an example signal alongside with its different embeddings:

Example audio file and its embeddings (image by the author).

Comparing embeddings

To choose the most efficient embedding, we trained the Attention LSTM network from De Andrade et al., 2018. For time reasons, we trained the neural network only on 5K clips.

The figure below compares the validation accuracy for all embeddings.

Performance of different embeddings on the 5K dataset (image by the author).

So, mel spectrograms with the first 13 filter banks perform closely to RASTA-PLP with model_order=13.

It is interesting to note that mel spectrograms outperform MFCC. This fits previous claims (see here, and here) that mel spectrograms are a better choice for neural network classifiers.

One other observation is that the performance usually drops for higher number of coefficients. This can be due to overfitting since high order coefficients often represent speaker-related features that are not generalizable to the test set where different speakers are chosen.

Due to time constraints, we did not test any combinations of embeddings, although it was previously observed that they may provide superior accuracy.

Since mel spectrograms are much faster to compute than RASTA-PLP, we will use these embeddings in further experiments.

In Part II we will run several neural network models and chose the one that classifies languages the best.


  • De Andrade, Douglas Coimbra, et al. “A neural attention model for speech command recognition.” arXiv preprint arXiv:1808.08929 (2018).
  • Hermansky, Hynek. “Perceptual linear predictive (PLP) analysis of speech.” the Journal of the Acoustical Society of America 87.4 (1990): 1738–1752.
  • Hönig, Florian, et al. “Revising perceptual linear prediction (PLP).” Ninth European Conference on Speech Communication and Technology. 2005.
  • Hermansky, Hynek, et al. “RASTA-PLP speech analysis.” Proc. IEEE Int’l Conf. Acoustics, speech and signal processing. Vol. 1. 1991.
  • Shao, Yang, Soundararajan Srinivasan, and DeLiang Wang. “Incorporating auditory feature uncertainties in robust speaker identification.” 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07. Vol. 4. IEEE, 2007.
  • Woodland, Philip C., Mark John Francis Gales, and David Pye. “Improving environmental robustness in large vocabulary speech recognition.” 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. Vol. 1. IEEE, 1996.
  • Zhao, Xiaojia, Yang Shao, and DeLiang Wang. “CASA-based robust speaker identification.” IEEE Transactions on Audio, Speech, and Language Processing 20.5 (2012): 1608–1616.

Source link

Leave a Comment