AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation
0. Contents
1. Abstract
Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios. In this paper, a tiny VITS-based TTS model, named AdaVITS, for low computing resource speaker adaptation is proposed. To effectively reduce parameters and computational complexity, an iSTFT-based decoder is particularly proposed. Besides, sharing the density estimate across flow blocks and replacing linear attention with scaled-dot attention are introduced to reduce the parameters and computational complexity. To deal with the instability caused by the simplified model, phonetic posteriorgram (PPG) is utilized as linguistic feature via a text-to-PPG module. Experiment shows that AdaVITS can generate stable and natural speech in speaker adaptation with 8.97M model parameters and 0.72GFlops computational complexity.
2. Demos on VCTK
Use the pre-trained model trains on LibriTTS. For speaker adaptation, 20 utterances are randomly selected for each speaker.
speaker | Record | AdaVITS v1 | AdaVITS v2 | AdaVITS-e | FS2-o+hifiganv1 | FS2-l+hifiganv2 | VITS |
---|---|---|---|---|---|---|---|
p225 | Text: She can scoop these things into three red bags, and we will go meet her Wednesday at the train station. | ||||||
p225 | Text: The rainbow is a division of white light into many beautiful colors. | ||||||
p227 | Text: Movement has to take place. | ||||||
p227 | Text: However, there is an issue, isn't there ? | ||||||
p228 | Text: I am just trying to do my job. | ||||||
p228 | Text: The workers do not want to read about their futures in newspapers. | ||||||
p230 | Text: They can leave at any time. | ||||||
p230 | Text: We think all other measures are not exhausted. | ||||||
p231 | Text: The rate of growth in road traffic is already beginning to slow. | ||||||
p231 | Text: Neither side would reveal the details of the offer. | ||||||
p232 | Text: Many complicated ideas about the rainbow have been formed. | ||||||
p232 | Text: Finally, he paid for the movie. | ||||||
p233 | Text: Of course, we will need to strengthen the squad for Europe. | ||||||
p233 | Text: Nobody else was with me. | ||||||
p243 | Text: They always want to give a performance. | ||||||
p243 | Text: The fans pay for their season tickets. | ||||||
p254 | Text: My mother is a widow. | ||||||
p254 | Text: It was a breathtaking moment. | ||||||
p256 | Text: Her home is perhaps a couple of miles from the town centre. | ||||||
p256 | Text: Chelsea was a great club. | ||||||
Short summary: The proposed AdaVITS achieves better naturalness and less computational complexity with FS2-l+hifiganv2. Although it is a gap with VITS/FS2-o+hifiganv1, it has better pronunciation stability and less parameter and computational complexity. The result is shown in the table below.
3. Demos on Chinese
Use the pre-trained model trains on closed source dataset. For speaker adaptation, 20 utterances are randomly selected for each speaker.
3.1 Demos on open source dataset: Blizzard Challenge 2019
speaker | proposed | ||||
---|---|---|---|---|---|
3.2 Demos on closed source dataset
speaker | proposed | ||||
---|---|---|---|---|---|