The primary applications of Text-to-Speech (TTS) synthesis technology mentioned in the article include virtual assistants, audiobooks, and accessibility tools. These applications aim to create systems that can generate speech indistinguishable from human voices, making them integral in various domains.
VALL-E 2 introduces two key enhancements to improve zero-shot TTS capabilities: repetition-aware sampling and grouped code modeling5.
Repetition-aware sampling is an improvement over the random sampling used in VALL-E. It adaptively employs either random or nucleus sampling for each time step token prediction based on the token repetition in the decoding history. This enhances the stability of the decoding process and circumvents the infinite loop issue encountered in VALL-E.
Grouped code modeling, on the other hand, partitions the codec codes into groups, each of which is modeled in a single frame in the AR modeling process. This approach not only accelerates inference by reducing the sequence length but also improves performance by mitigating the long context modeling problem.
These enhancements in VALL-E 2 significantly improve its zero-shot learning capabilities, enabling it to generate high-quality, natural speech with improved efficiency and robustness.
VALL-E 2 was trained on the Libriheavy dataset and evaluated on the LibriSpeech and VCTK datasets. The performance evaluation results showed significant improvements in zero-shot TTS scenarios. On the LibriSpeech dataset, VALL-E 2 achieved a Word Error Rate (WER) of 4.2% for 3-second prompts and 3.8% for 5-second prompts, compared to the ground truth WER of 4.5%. The similarity score (SIM) for VALL-E 2 was 0.803 with single sampling and improved to 0.807 with five-time sampling, indicating highly accurate and natural speech synthesis. In the VCTK dataset, the model achieved a robustness score of 0.3 and a naturalness score of 4.47 for 3-second prompts, demonstrating its superior performance in diverse speaker scenarios.