Audio samples for "FCL-taco2: Towards fast, controllable and lightweight text-to-speech synthesis"
Authors: Disong Wang, Liqun Deng, Yang Zhang, Nianzu Zheng, Yu Ting Yueng, Xiao Chen, Xunying Liu and Helen Meng
Contents
1. Method comparison
- GT: Ground-truth speech
- GT(Mel+PWG): Speech synthesized by GT mel-spectrograms with Parallel-WaveGAN vocoder
- Tacotron2: One of state-of-the-art autoregressive TTS model
- Fastspeech2: One of state-of-the-art non-autoregressive TTS model
- FCL-taco2-T: Proposed teacher FCL-taco2 model of similar size to Tacotron2
- FCL-taco2-S: Proposed student FCL-taco2 model with much smaller footprint.
1.1. English experiments
No. |
GT |
GT(Mel+PWG) |
Tacotron2 |
Fastpseech2 |
FCL-taco2-T |
FCL-taco2-S |
Text |
1 |
|
|
|
|
|
|
Agents are instructed that it is not their responsibility to investigate or evaluate a present danger. |
2 |
|
|
|
|
|
|
that Oswald had told him that he had worked and been married in the Soviet Union. |
3 |
|
|
|
|
|
|
unemployment, automation, and the use of military forces to suppress other populations. |
4 |
|
|
|
|
|
|
He stated several times that he was a Communist but apparently never joined any Communist Party. |
5 |
|
|
|
|
|
|
and becoming merely the recipient of information gathered by others would become limited solely to acts of physical alertness and personal courage. |
1.2. Chinese experiments
No. |
GT |
GT(Mel+PWG) |
Tacotron2 |
Fastpseech2 |
FCL-taco2-T |
FCL-taco2-S |
Text |
1 |
|
|
|
|
|
|
不容易被激流冲走,还有利于它潜泳,所以它爱吞石块。 |
2 |
|
|
|
|
|
|
当它们脱离原有运行轨道后,散落到地球表面,那就是陨石。 |
3 |
|
|
|
|
|
|
是机器人,机器人不允许玩过山车。 |
4 |
|
|
|
|
|
|
然后集中地反射出去,所以夜晚看起来好像会发光。 |
5 |
|
|
|
|
|
|
分布在宇宙的细小物体,滑过大气层时会发光发热,这就是流星雨了。 |
2. Impact of different knowledge distillation strategies:
- FCL-taco2-S: student model trained with three proposed knowledge distillation strategies
- w/o MSD: without Mel-Spectrogram Distillation
- w/o HRD: without Hidden Representation Distillation
- w/o PD: without Prosody Distillation
- w/o MSD+PD: without Mel-Spectrogram Distillation and Prosody Distillation
- w/o KD: without Knowledge Distillation
2.1. English experiments
No. |
FCL-taco2-S |
w/o MSD |
w/o HRD |
w/o PD |
w/o MSD+PD |
w/o KD |
Text |
1 |
|
|
|
|
|
|
and becoming merely the recipient of information gathered by others would become limited solely to acts of physical alertness and personal courage. |
2 |
|
|
|
|
|
|
Agents are instructed that it is not their responsibility to investigate or evaluate a present danger. |
3 |
|
|
|
|
|
|
that Oswald had told him that he had worked and been married in the Soviet Union. |
4 |
|
|
|
|
|
|
unemployment, automation, and the use of military forces to suppress other populations. |
5 |
|
|
|
|
|
|
He stated several times that he was a Communist but apparently never joined any Communist Party. |
2.2. Chinese experiments
No. |
FCL-taco2-S |
w/o MSD |
w/o HRD |
w/o PD |
w/o MSD+PD |
w/o KD |
Text |
1 |
|
|
|
|
|
|
不容易被激流冲走,还有利于它潜泳,所以它爱吞石块。 |
2 |
|
|
|
|
|
|
当它们脱离原有运行轨道后,散落到地球表面,那就是陨石。 |
3 |
|
|
|
|
|
|
是机器人,机器人不允许玩过山车。 |
4 |
|
|
|
|
|
|
然后集中地反射出去,所以夜晚看起来好像会发光。 |
5 |
|
|
|
|
|
|
分布在宇宙的细小物体,滑过大气层时会发光发热,这就是流星雨了。 |
3. Prosody manipulation:
3.1. Pitch manipulation: use the predicted F0 multiplied with a ratio (r) to generate the speech
- r=1: F0 <- F0
- r=0.5: F0 <- F0 x 0.5
- r=0.75: F0 <- F0 x 0.75
- r=1.5: F0 <- F0 x 1.5
- r=1.25: F0 <- F0 x 1.25
- r=↗: r linearly increases from 0.5 to 1.5 phoneme by phoneme
- r=↘: r linearly decreases from 1.5 to 0.5 phoneme by phoneme
3.1.2. English experiments
No. |
r=1 |
r=0.5 |
r=0.75 |
r=1.25 |
r=1.5 |
r=↗ |
r=↘ |
Text |
1 |
|
|
|
|
|
|
|
In no characters is the contrast between the ugly and vulgar illegibility of the modern type. |
2 |
|
|
|
|
|
|
|
The due relation of letter to pictures and other ornament was thoroughly understood by the old printers; so that |
3 |
|
|
|
|
|
|
|
as it was occupied and appropriated in eighteen ten. |
3.1.2 Chinese experiments
No. |
r=1. |
r=0.5 |
r=0.75 |
r=1.25 |
r=1.5 |
r=↗ |
r=↘ |
Text |
1 |
|
|
|
|
|
|
|
就是嘛,摔跤手防守严密,无懈可击。 |
2 |
|
|
|
|
|
|
|
为了能让政府继续资助。 |
3 |
|
|
|
|
|
|
|
他们的射门击中门框次数多达6次。 |
3.2. Duration manipulation: use the predicted duration multiplied with a ratio (r) to generate the speech
- r=1: d <- d
- r=0.5: d <- d x 0.5
- r=0.75: d <- d x 0.75
- r=1.5: d <- d x 1.5
- r=1.25: d <- d x 1.25
- r=↗: r linearly increases from 0.5 to 1.5 phoneme by phoneme
- r=↘: r linearly decreases from 1.5 to 0.5 phoneme by phoneme
3.2.1. English experiments
No. |
r=1 |
r=0.5 |
r=0.75 |
r=1.25 |
r=1.5 |
r=↗ |
r=↘ |
Text |
1 |
|
|
|
|
|
|
|
In no characters is the contrast between the ugly and vulgar illegibility of the modern type. |
2 |
|
|
|
|
|
|
|
The due relation of letter to pictures and other ornament was thoroughly understood by the old printers; so that |
3 |
|
|
|
|
|
|
|
as it was occupied and appropriated in eighteen ten. |
3.2.2. Chinese experiments
No. |
r=1 |
r=0.5 |
r=0.75 |
r=1.25 |
r=1.5 |
r=↗ |
r=↘ |
Text |
1 |
|
|
|
|
|
|
|
就是嘛,摔跤手防守严密,无懈可击。 |
2 |
|
|
|
|
|
|
|
为了能让政府继续资助。 |
3 |
|
|
|
|
|
|
|
他们的射门击中门框次数多达6次。 |