Diff-A-Riff enhances music production by generating high-quality instrumental accompaniments that seamlessly integrate with a given musical context. It offers versatile control, allowing users to condition both audio and text prompts, and produces outputs with 48kHz pseudo-stereo audio. The system significantly reduces inference time and memory usage, making it a valuable tool for music producers and artists.
Diff-A-Riff ensures high-quality audio output by using a combination of latent diffusion models and consistency autoencoders. The input audio is compressed into a latent representation using a pre-trained consistency autoencoder, which guarantees high-quality decoding through a generative decoder. This compressed representation is then fed into the latent diffusion model, which generates new audio in the latent space, conditioned on the input context and optional style references from either text or audio embeddings.
The primary function of Sony's AI, Diff-A-Riff, is to generate high-quality instrumental accompaniments for any music. It is a versatile AI system that seamlessly integrates with a given musical context, focusing on one instrument at a time5. The tool is based on two deep-learning techniques: latent diffusion models and consistency autoencoders. It offers users versatile control, allowing them to condition both audio and text prompts, and produces high-quality outputs with pseudo-stereo audio of 48kHz1.