Demo for Multi-speaker Video-to-Speech Synthesis
1The Chinese University of Hong Kong, 2Tencent AI Lab
Content
Video-to-Speech on GRID - Seen speakers (s1, s2, s4 and s29), where partial data of s1, s2, s4 and s29 has been used for training
Note: GL denotes Griffin-Lim, PWG denotes Parallel WaveGAN vocoder.
s1 (Seen):
No. | Silent video | Ground-truth | XTS | Lip2Wav | VCVTS + GL (ours) | VCVTS + PWG (ours) | Text |
---|---|---|---|---|---|---|---|
1 | Bin blue at s three again | ||||||
2 | Place green with l one soon | ||||||
3 | Set white with i eight now |
s4 (Seen):
No. | Silent video | Ground-truth | XTS | Lip2Wav | VCVTS + GL (ours) | VCVTS + PWG (ours) | Text |
---|---|---|---|---|---|---|---|
1 | Bin blue in r six again | ||||||
2 | Place green at d six soon | ||||||
3 | Set white by b nine please |
s2 (Seen):
No. | Silent video | Ground-truth | XTS | Lip2Wav | VCVTS + GL (ours) | VCVTS + PWG (ours) | Text |
---|---|---|---|---|---|---|---|
1 | Bin blue in l four again | ||||||
2 | Lay green with a zero again | ||||||
3 | Place white at j five now |
s29 (Seen):
No. | Silent video | Ground-truth | XTS | Lip2Wav | VCVTS + GL (ours) | VCVTS + PWG (ours) | Text |
---|---|---|---|---|---|---|---|
1 | Bin green at k six now | ||||||
2 | Lay red by v one soon | ||||||
3 | Place green in n seven again |
s1 (Unseen):
No. | Silent video | Ground-truth | XTS | Lip2Wav | VCVTS + GL (ours) | VCVTS + PWG (ours) | Text |
---|---|---|---|---|---|---|---|
1 | Bin blue at s three again | ||||||
2 | Place green with l one soon | ||||||
3 | Set white with i eight now |
s4 (Unseen):
No. | Silent video | Ground-truth | XTS | Lip2Wav | VCVTS + GL (ours) | VCVTS + PWG (ours) | Text |
---|---|---|---|---|---|---|---|
1 | Bin blue in r six again | ||||||
2 | Place green at d six soon | ||||||
3 | Set white by b nine please |
s2 (Unseen):
No. | Silent video | Ground-truth | XTS | Lip2Wav | VCVTS + GL (ours) | VCVTS + PWG (ours) | Text |
---|---|---|---|---|---|---|---|
1 | Bin blue in l four again | ||||||
2 | Lay green with a zero again | ||||||
3 | Place with at j five now |
s29 (Unseen):
No. | Silent video | Ground-truth | XTS | Lip2Wav | VCVTS + GL (ours) | VCVTS + PWG (ours) | Text |
---|---|---|---|---|---|---|---|
1 | Bin green at k six now | ||||||
2 | Lay red by v one soon | ||||||
3 | Place green in n seven again |
Video-to-Speech on LRW - Unseen speakers
No. | Silent video | Ground-truth | Lip2Wav | VCVTS + GL (ours) | VCVTS + PWG (ours) | Text |
---|---|---|---|---|---|---|
1 | Accused | |||||
2 | Believe | |||||
3 | Company | |||||
4 | Extra | |||||
5 | Further | |||||
6 | Growing | |||||
7 | Happen | |||||
8 | Impact | |||||
9 | National | |||||
10 | Return | |||||
11 | Scottish | |||||
12 | Yesterday |
Speaker identity control
For each video, the lip sequence is randomly selected from the test split of LRW, PWG is used to synthesize waveform.
Using Griffin-Lim to synthesize waveform:
No. | Ground-truth | Generate-from-gt-spk | Generate-from-{LRW-test-ANNOUNCED_00001} | Generate-from-{GRID-s2-bwwn5n} | Generate-from-{LRW-test-MEETING_00002} | Generate-from-{GRID-s29-prag2n} | Text |
---|---|---|---|---|---|---|---|
1 | Different | ||||||
2 | Increase | ||||||
3 | Labour | ||||||
4 | Nothing |
Using Parallel WaveGAN to synthesize waveform:
No. | Ground-truth | Generate-from-gt-spk | Generate-from-{LRW-test-INFORMATION_00002} | Generate-from-{LRW-test-ISLAMIC_00001} | Generate-from-{GRID-s4-lrby7p} | Generate-from-{GRID-s1-pbivzp} | Text |
---|---|---|---|---|---|---|---|
1 | Authorities | ||||||
2 | Better | ||||||
3 | Education | ||||||
4 | Following |