In this post, we explore Google’s innovative approach to training their remarkable text-to-music models, including MusicLM and Noise2Music. We’ll delve into the concept of “fake” datasets and how they were utilized in these breakthrough models. If you’re curious about the inner workings of these techniques and their impact on advancing music AI, you’ve come to the right place.
Large language models (LLMs) like ChatGPT or Bard are trained on huge amounts of unstructured text data. Although it can be computationally expensive to collect the content of millions of websites, there is an abundance of training data on the public web. In contrast, text-to-image models like DALL-E 2 require a totally different kind of dataset consisting of pairs of images with corresponding descriptions.
In the same way, text-to-music models rely on songs with descriptions of their musical content. However, unlike images, labeled music is really hard to find on the internet. Sometimes, metadata like instrumentation, genre, or mood, are available, but full-text in-depth descriptions are exceptionally hard to obtain. This poses a serious problem for researchers and companies trying to collect data to train generative music models.
In early 2023, Google researchers created a lot of buzz around music AI with their breakthrough models, MusicLM and Noise2Music. However, among musicians, little is known about how the data for these models were collected. Let‘s dive into this topic together and learn about some of the tricks applied in Google’s music AI research.
Weakly Associated Labels
For MusicLM and Noise2Music, Google relied on another one of their models called MuLan, which was trained to compute the similarity between any piece of music and any text description. To train MuLan, Google used what we call “weakly associated labels”. Instead of carefully curating a dataset of music with high-quality text descriptions, they purposefully took a different approach.
First, they extracted a 30-second snippet from 44 million music videos available on YouTube, resulting in 370k hours of audio. The music was then labeled with various texts associated with the video: the video title and description, comments, the names of playlists featuring the video, and more. To reduce noise in this dataset, they employed a large language model to identify which associated text information had music-related content and discarded everything that did not.
In my opinion, weakly associated labels can not be considered a “fake” dataset, yet, because the text information was still written by real humans and is undoubtedly associated with the music to some extent. However, this approach definitely prioritizes quantity over quality, which would have raised concerns among most machine learning researchers in the past. And Google was just getting started…
Noise2Music is a generative music AI based on diffusion technology, which was also used in image generation models like DALL-E or Midjourney.
To train Noise2Music, Google took their previous approach to the extreme and transitioned from weakly associated labels to fully artificial labels. In what they refer to as “pseudo labeling”, the authors adopted a remarkable method to collect music description texts. They prompted a large language model (LaMDA) to write multiple descriptions for 150k popular songs, resulting in 4 million descriptions. Here is an example of such a description:
“Don’t Stop Me Now” by Queen : The energetic rock song builds on a piano, bass guitar, and drums. The singers are excited, ready to go, and uplifting.
Subsequently, the researchers removed the song and artist names to produce descriptions that could, in principle, apply to other songs, as well. However, even with these descriptions in hand, the researchers still needed to match them with suitable songs to obtain a large labeled dataset. Here is where MuLan, their model trained on weakly associated labels, proved to be useful.
The researchers collected a large dataset of unlabeled music, resulting in 340k hours of music. For each of these tracks, they utilized MuLan to identify the artificially generated song description that best matched it. Essentially, each piece of music is not mapped to a text describing the song itself, but to a description that encapsulates music similar to it.
In traditional machine learning, the labels assigned to each observation (in this case, a piece of music) should ideally represent an objective truth. However, music descriptions inherently lack objectivity, presenting the first problem. Additionally, by utilizing audio-to-text mapping technology, the labels no longer reflect a “truthful” representation of what is happening in the song. They do not provide an accurate description of the music. Given these apparent flaws, one may wonder why this approach still yields useful results.
Bias vs. Noise
When a dataset’s labels are not accurately assigned, there can be two main causes: bias and noise. Bias refers to a consistent tendency for the labels to be untruthful in a particular way. For instance, if the dataset frequently labels instrumental pieces as songs but never identifies songs as instrumental pieces, it demonstrates a bias toward predicting the presence of vocals.
On the other hand, noise indicates a general variability in the labels, regardless of the direction. For example, if every track is labeled as a “sad piano piece,” the dataset is heavily biased, as it consistently provides an inaccurate label for many songs. However, since it applies the same label to every track, there is no variability and therefore no noise present in the dataset.
By mapping tracks to descriptive texts written for other tracks, we introduce noise. This is because, for most tracks, it is unlikely that there exists a perfect description for it in the dataset. Consequently, most labels are a little bit off, i.e. untruthful, which results in noise. However, are the labels biased?
Since the available descriptions were generated for popular songs, it is reasonable to assume that the pool of descriptions is biased toward (western) popular music. Nevertheless, with 4 million descriptions based on 150k unique songs, one would expect a diverse range of descriptions to choose from. Additionally, most labeled music datasets exhibit the same bias, so this is not a unique disadvantage of this approach compared to others. What truly sets this approach apart is the introduction of added noise.
Why Noise can be O.K. in Machine Learning
Training a machine learning model on a biased dataset is usually not a desirable approach because it would result in the model learning and replicating a biased understanding of the task at hand. However, training a machine learning model on unbiased but noisy data can still yield impressive results. Allow me to illustrate this with an example.
Consider the figure below, which depicts two datasets consisting of orange and blue points. In the noise-free dataset, the blue and orange points are perfectly separable. However, in the noisy dataset, some orange points have shifted into the blue point cluster, and vice versa. Despite this added noise, if we examine the trained models, we observe that both models identify roughly the same patterns. This is because, even in the presence of noise, the AI learns to identify the optimal pattern that minimizes errors as much as possible.
This example demonstrates that an AI can indeed learn from noisy datasets, such as the one generated by Google. However, the main challenge lies in the fact that the noisier the dataset is, the larger amount of training data required to effectively train the model. This rationale is justified by the understanding that a noisy dataset inherently contains less valuable information compared to an equivalent noise-free dataset of the same size.
In conclusion, Google employed innovative techniques to address the challenge of limited labeled music data in training their generative music AI models. They utilized weakly associated labels for MuLan, leveraging text information from various sources related to music videos, and employed a language model to filter out irrelevant data. When developing Noise2Music, they introduced fake labels by generating multiple descriptions for popular songs and mapping them to suitable tracks using their pre-trained model.
While these approaches may deviate from traditional labeling methods, they still proved effective. Despite introducing noise, the models were still able to learn and identify optimal patterns. Although the utilization of fake datasets may be considered unconventional, it highlights the immense potential of modern language models in creating large and valuable datasets for machine learning.