Google DeepMind's V2A technology synchronizes audio with video by encoding video input into a compressed representation and then using a diffusion model to iteratively refine the audio from random noise. This process is guided by visual input from the video and natural language prompts, resulting in synchronized, realistic audio that closely aligns with the prompt instructions and video content.
Natural language prompts play a crucial role in V2A's audio generation by providing additional context for the desired audio output. Users can define "positive prompts" to guide the output towards desired sounds or "negative prompts" to steer it away from unwanted noises. This flexibility gives users control over V2A's audio output, enabling rapid experimentation with different soundtracks and helping them choose the best match for their creative vision.
The team explored autoregressive and diffusion methods to find the best AI architecture for their video-to-audio technology. They found that the results for generating audio using the diffusion method were the most convincing and realistic regarding the synchronization of audio and visuals.