AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

0. Contents

Abstract
Demos on VCTK
Demos on Chinese

1. Abstract

Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios. In this paper, a tiny VITS-based TTS model, named AdaVITS, for low computing resource speaker adaptation is proposed. To effectively reduce parameters and computational complexity, an iSTFT-based decoder is particularly proposed. Besides, sharing the density estimate across flow blocks and replacing linear attention with scaled-dot attention are introduced to reduce the parameters and computational complexity. To deal with the instability caused by the simplified model, phonetic posteriorgram (PPG) is utilized as linguistic feature via a text-to-PPG module. Experiment shows that AdaVITS can generate stable and natural speech in speaker adaptation with 8.97M model parameters and 0.72GFlops computational complexity.

2. Demos on VCTK

Use the pre-trained model trains on LibriTTS. For speaker adaptation, 20 utterances are randomly selected for each speaker.

speaker	Record	AdaVITS v1	AdaVITS v2	AdaVITS-e	FS2-o+hifiganv1	FS2-l+hifiganv2	VITS
p225
p225	Text: She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

p225
p225	Text: The rainbow is a division of white light into many beautiful colors.

p227
p227	Text: Movement has to take place.

p227
p227	Text: However, there is an issue, isn't there ?

p228
p228	Text: I am just trying to do my job.

p228
p228	Text: The workers do not want to read about their futures in newspapers.

p230
p230	Text: They can leave at any time.

p230
p230	Text: We think all other measures are not exhausted.

p231
p231	Text: The rate of growth in road traffic is already beginning to slow.

p231
p231	Text: Neither side would reveal the details of the offer.

p232
p232	Text: Many complicated ideas about the rainbow have been formed.

p232
p232	Text: Finally, he paid for the movie.

p233
p233	Text: Of course, we will need to strengthen the squad for Europe.

p233
p233	Text: Nobody else was with me.

p243
p243	Text: They always want to give a performance.

p243
p243	Text: The fans pay for their season tickets.

p254
p254	Text: My mother is a widow.

p254
p254	Text: It was a breathtaking moment.

p256
p256	Text: Her home is perhaps a couple of miles from the town centre.

p256
p256	Text: Chelsea was a great club.

Short summary: The proposed AdaVITS achieves better naturalness and less computational complexity with FS2-l+hifiganv2. Although it is a gap with VITS/FS2-o+hifiganv1, it has better pronunciation stability and less parameter and computational complexity. The result is shown in the table below.

3. Demos on Chinese

Use the pre-trained model trains on closed source dataset. For speaker adaptation, 20 utterances are randomly selected for each speaker.

3.1 Demos on open source dataset: Blizzard Challenge 2019

speaker	proposed

3.2 Demos on closed source dataset

speaker	proposed