AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

0. Contents

  1. Abstract
  2. Demos on VCTK
  3. Demos on Chinese


1. Abstract

Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios. In this paper, a tiny VITS-based TTS model, named AdaVITS, for low computing resource speaker adaptation is proposed. To effectively reduce parameters and computational complexity, an iSTFT-based decoder is particularly proposed. Besides, sharing the density estimate across flow blocks and replacing linear attention with scaled-dot attention are introduced to reduce the parameters and computational complexity. To deal with the instability caused by the simplified model, phonetic posteriorgram (PPG) is utilized as linguistic feature via a text-to-PPG module. Experiment shows that AdaVITS can generate stable and natural speech in speaker adaptation with 8.97M model parameters and 0.72GFlops computational complexity.



2. Demos on VCTK

Use the pre-trained model trains on LibriTTS. For speaker adaptation, 20 utterances are randomly selected for each speaker.

speaker Record AdaVITS v1 AdaVITS v2 AdaVITS-e FS2-o+hifiganv1 FS2-l+hifiganv2 VITS
p225
Text: She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
p225
Text: The rainbow is a division of white light into many beautiful colors.
p227
Text: Movement has to take place.
p227
Text: However, there is an issue, isn't there ?
p228
Text: I am just trying to do my job.
p228
Text: The workers do not want to read about their futures in newspapers.
p230
Text: They can leave at any time.
p230
Text: We think all other measures are not exhausted.
p231
Text: The rate of growth in road traffic is already beginning to slow.
p231
Text: Neither side would reveal the details of the offer.
p232
Text: Many complicated ideas about the rainbow have been formed.
p232
Text: Finally, he paid for the movie.
p233
Text: Of course, we will need to strengthen the squad for Europe.
p233
Text: Nobody else was with me.
p243
Text: They always want to give a performance.
p243
Text: The fans pay for their season tickets.
p254
Text: My mother is a widow.
p254
Text: It was a breathtaking moment.
p256
Text: Her home is perhaps a couple of miles from the town centre.
p256
Text: Chelsea was a great club.

Short summary: The proposed AdaVITS achieves better naturalness and less computational complexity with FS2-l+hifiganv2. Although it is a gap with VITS/FS2-o+hifiganv1, it has better pronunciation stability and less parameter and computational complexity. The result is shown in the table below.

3. Demos on Chinese

Use the pre-trained model trains on closed source dataset. For speaker adaptation, 20 utterances are randomly selected for each speaker.

3.1 Demos on open source dataset: Blizzard Challenge 2019

speaker proposed

3.2 Demos on closed source dataset

speaker proposed