FURI | Fall 2023

Multimodal Self-Supervised Approach to Text-to-Music Generation

Data icon, disabled. Four grey bars arranged like a vertical bar chart.

Discrete Music Language Models have the ability to generate musical pieces with various genres in discrete formats such as MIDI and piano roll, which offers more flexibility and customizability. Meanwhile, recent works show that current Discrete Music Language Models have limitations when it comes to the relationship between text and music domains, which raises challenges to step toward text-music generation. Therefore, this research focuses on developing conditioning discrete music ability by proposing a transformer model that merges an encoder text language model and a decoder music language model, with the purpose of capturing the text-music relationship to learn text-music generation.

View the poster