MARS5 TTS differs from GPT and Gemini models as it specializes in text-to-speech synthesis, featuring a unique two-stage AR-NAR architecture. While GPT and Gemini focus on text generation and understanding, MARS5 excels in producing high-quality, controllable speech output. Its use of DDPM in the NAR stage and prosodic control through text formatting sets it apart in speech synthesis.
MARS5 TTS offers two inference modes: a fast "shallow clone" that doesn't require the reference audio's transcript, and a slower but higher-quality "deep clone" that utilizes the prompt transcript. The shallow clone provides a quicker synthesis, while the deep clone offers enhanced quality and cloning accuracy at the cost of longer processing time.
The reference audio for MARS5 TTS should be between 2-12 seconds long, with 6-second samples yielding optimal results5. The system supports two inference modes: a fast "shallow clone" that doesn't require the reference audio's transcript, and a slower but higher-quality "deep clone" that utilizes the prompt transcript.