
The primary goal of the Instruct-MusicGen project developed by researchers from C4DM, Queen Mary University of London, Sony AI, and Music X Lab, MBZUAI, is to address the challenge of text-to-music editing. This involves using textual queries to modify music, such as changing its style or adjusting instrumental components2. The aim is to develop a more efficient and effective method that leverages pre-trained models to perform high-quality music editing based on textual instructions.

The audio fusion module enhances the model's ability to process external audio inputs by allowing the model to accept and process these inputs more effectively3. This is achieved by duplicating self-attention modules and incorporating cross-attention between the original music and the conditional audio. By doing so, the module enables the model to perform precise audio editing based on the external audio inputs.

The text fusion module in Instruct-MusicGen plays a crucial role in handling instruction inputs for music editing. It modifies the behavior of the text encoder to enable the model to interpret and execute text-based editing commands effectively. By finetuning the cross-attention module between the text embedding and the music representations, the text fusion module allows the model to understand and follow the semantic meaning of editing instructions. This capability enhances the model's ability to perform tasks such as adding, separating, and removing stems from music audio based on textual instructions. The text fusion module, in combination with the audio fusion module, ensures that Instruct-MusicGen can process both precise audio conditions and text-based instructions simultaneously, resulting in precise music editing based on textual queries.