On June 13th, 2023, Meta (formerly Facebook) made waves in the music and AI communities with the release of their generative music model, MusicGen. This model not only surpasses Google’s MusicLM, which was launched earlier this year, in terms of capabilities but is also trained on licensed music data and open-sourced for non-commercial use.
In addition to generating audio from a text prompt, MusicGen can also generate music based on a given reference melody, a feature known as melody conditioning. In this blog post, I will demonstrate how Meta implemented this useful and fascinating functionality into their model. But before we delve into that, let’s first understand how melody conditioning works in practice.
The following is a short electronic music snippet that I produced for this article. It features electronic drums, two dominant 808 bass and two syncopated synths. When listening to it, try to identify the “main melody” of the track.
Using MusicGen, I can now generate music in other genres that stick to the same main melody. All I need for that is my base track and a text prompt describing how the new piece should sound.
A grand orchestral arrangement with thunderous percussion, epic brass fanfares, and soaring strings, creating a cinematic atmosphere fit for a heroic battle.
classic reggae track with an electronic guitar solo
smooth jazz, with a saxophone solo, piano chords, and snare full drums
How Good are the Results?
Although MusicGen doesn’t adhere closely to my text prompts and creates music that is slightly different from what I asked for, the generated pieces still accurately reflect the requested genre and, more importantly, each piece showcases its own interpretation of the main melody from the base track.
While the results are not perfect, I find the capabilities of this model to be quite impressive. The fact that MusicGen has been one of the most popular models on HuggingFace ever since its release further emphasizes its significance. With that said, let’s delve deeper into the technical aspects of how melody conditioning works.
Almost all current generative music models follow the same procedure during training. They are provided with a large database of music tracks accompanied by corresponding text descriptions. The model learns the relationship between words and sounds, as well as how to convert a given text prompt into a coherent and enjoyable piece of music. During the training process, the model optimizes its own compositions by comparing them to the real music tracks in the dataset. This enables the model to identify its strengths and areas that require improvement.
The issue lies in the fact that once a machine learning model is trained for a specific task, such as text-to-music generation, it is limited to that particular task. While it is possible to make MusicGen perform certain tasks that it was not explicitly trained for, like continuing a given piece of music, it cannot be expected to tackle every music generation request. For instance, it cannot simply take a melody and transform it into a different genre. This would be like throwing potatoes into a toaster and expecting fries to come out. Instead, a separate model must be trained to implement this functionality.
Let’s explore how Meta adapted the model training procedure to enable MusicGen to generate variations of a given melody based on a text prompt. However, there are several challenges associated with this approach. One of the primary obstacles is the ambiguity in identifying “the melody” of a song and representing it in a computationally meaningful way. Nonetheless, for the purpose of understanding the new training procedure at a broader level, let’s assume a consensus on what constitutes “the melody” and how it can be easily extracted and fed into the model. In this scenario, the adjusted training method can be outlined as follows:
For each track in the database, the first step is to extract its melody. Subsequently, the model is fed with both the track’s text description and its corresponding melody, prompting the model to recreate the original track. Essentially, this approach simplifies the original training objective, where the model was solely tasked with recreating the track based on text.
To understand why we do this, let’s ask ourselves what the AI model learns in this training procedure. In essence, it learns how a melody can be turned into a full piece of music based on a text description. This means that after the training, we can provide the model with a melody and request it to compose a piece of music with any genre, mood, or instrumentation. To the model, this is the same “semi-blind” generation task it has successfully accomplished countless times during training.
Having grasped the technique employed by Meta to teach the model melody-conditioned music generation, we still need to tackle the challenge of precisely defining what constitutes “the melody.”
The truth is, there is no objective method to determine or extract “the melody” of a polyphonic musical piece, except when all instruments are playing in unison. While there is often a prominent instrument such as a voice, guitar, or violin, it does not necessarily imply that the other instruments are not part of “the melody.” Take Queen’s “Bohemian Rhapsody” as an example. When you think of the song, you might first recall Freddie Mercury’s main vocal melodies. However, does that mean the piano in the intro, the background singers in the middle section, and the electric guitar before “So you think you can stone me […]” are not part of the melody?
One method for extracting “the melody” of a song is to consider the most prominent melody as the most dominant one, typically identified as the loudest melody in the mix. The chromagram is a widely utilized representation that visually displays the most dominant musical notes throughout a track. Below, you can find the chromagram of the reference track, initially with the complete instrumentation and then excluding drums and bass. On the left side, the most relevant notes for the melody (B, F#, G) are highlighted in blue.
Both chromagrams accurately depict the primary melody notes, with the version of the track without drums and bass providing a clearer visualization of the melody. Meta’s study also revealed the same observation, which led them to utilize their source separation tool (DEMUCS) to remove any disturbing rhythmic elements from the track. This process results in a sufficiently representative rendition of “the melody,” which can then be fed to the model.
In summary, we can now connect the pieces to understand the underlying process when requesting MusicGen to perform melody-conditioned generation. Here is a visual representation of the workflow:
While MusicGen shows promising advancements in melody-conditioning, it is important to acknowledge that the technology is still a work-in-progress. Chromagrams, even when drums and bass are removed, offer an imperfect representation of a track’s melody. One limitation is that chromagrams categorize all notes into the 12 western pitch classes, meaning they capture the transition between two pitch classes but not the direction (up or down) of the melody.
For instance, the melodic interval between moving from C4 to G4 (a perfect fifth) differs significantly from moving from C4 to G3 (a perfect fourth). However, in a chromagram, both intervals would appear the same. The issue worsens with octave jumps, as the chromagram would indicate the melody stayed on the same note. Consider how a chromagram would misinterpret the emotional octave jump performed by Céline Dion in “My Heart Will Go On” during the line “wher-e-ver you are” as a stable melodic movement. To demonstrate this, just look at the chromagram for the chorus in A-ha’s “Take on Me”, below. Does this reflect your idea of the song’s melody?
Another challenge is the inherent bias of the chromagram. It performs well in capturing the melody of some songs while completely missing the mark in others. This bias is systematic rather than random. Songs with dominant melodies, minimal interval jumps, and unison playing are better represented by the chromagram compared to songs with complex melodies spread across multiple instruments and featuring large interval jumps.
Furthermore, the limitations of the generative AI model itself are worth noting. The output audio still exhibits noticeable differences from human-made music, and maintaining a consistent style over a six-second interval remains a struggle. Moreover, MusicGen falls short in faithfully capturing the more intricate aspects of the text prompt, as evidenced by the examples provided earlier. It will require further technological advancements for melody-conditioned generation to reach a level where it can be used not only for amusement and inspiration but also for generating end-user-friendly music.
How can we improve the AI?
From my perspective, one of the primary concerns that future research should address regarding melody-conditioned music generation is the extraction and representation of “the melody” from a track. While the chromagram is a well-established and straightforward signal processing method, there are numerous newer and experimental approaches that utilize deep learning for this purpose. It would be exciting to witness companies like Meta drawing inspiration from these advancements, many of which are covered in a comprehensive 72-page review by Reddy et al. (2022).
Regarding the quality of the model itself, both the audio quality and the comprehension of text inputs can be enhanced through scaling up the size of the model and training data, as well as the development of more efficient algorithms for this specific task. In my opinion, the release of MusicLM in January 2023 resembles a “GPT-2 moment.” We are beginning to witness the capabilities of these models, but significant improvements are still needed across various aspects. If this analogy holds true, we can anticipate the release of a music generation model akin to GPT-3 sooner than we might expect.
How does this impact musicians?
As is often the case with generative music AI, concerns arise regarding the potential negative impact on the work and livelihoods of music creators. I expect that in the future, it will become increasingly challenging to earn a living by creating variations of existing melodies. This is particularly evident in scenarios such as jingle production, where companies can effortlessly generate numerous variations of a characteristic jingle melody at minimal cost for new ad campaigns or personalized advertisements. Undoubtedly, this poses a threat to musicians who rely on such activities as a significant source of income. I reiterate my plea for creatives involved in producing music valued for its objective musical qualities rather than subjective, human qualities (such as stock music or jingles) to explore alternative income sources to prepare for the future.
On the positive side, melody-conditioned music generation presents an incredible tool for enhancing human creativity. If someone develops a captivating and memorable melody, they can quickly generate examples of how it might sound in various genres. This process can help identify the ideal genre and style to bring the music to life. Moreover, it offers an opportunity to revisit past projects within one’s music catalogue, exploring their potential when translated into different genres or styles. Finally, this technology lowers the entry barrier for creatively inclined individuals without formal musical training to enter the field. Anyone can now come up with a melody, hum it into a smartphone microphone, and share remarkable arrangements of their ideas with friends, family, or even attempt to reach a wider audience.
The question of whether AI music generation is beneficial to our societies remains open for debate. However, I firmly believe that melody-conditioned music generation is one of the use cases of this technology that genuinely enhances the work of both professional and aspiring creatives. It adds value by offering new avenues for exploration. I am eagerly looking forward to witnessing further advancements in this field in the near future.