Demo for Multi-speaker Video-to-Speech Synthesis

Disong Wang1      Shan Yang2      Dan Su2      Xunying Liu1      Dong Yu2      Helen Meng1     

1The Chinese University of Hong Kong, 2Tencent AI Lab



s1 (Seen):

No. Silent video Ground-truth XTS Lip2Wav VCVTS + GL (ours) VCVTS + PWG (ours) Text
1 Bin blue at s three again
2 Place green with l one soon
3 Set white with i eight now


s4 (Seen):

No. Silent video Ground-truth XTS Lip2Wav VCVTS + GL (ours) VCVTS + PWG (ours) Text
1 Bin blue in r six again
2 Place green at d six soon
3 Set white by b nine please


s2 (Seen):

No. Silent video Ground-truth XTS Lip2Wav VCVTS + GL (ours) VCVTS + PWG (ours) Text
1 Bin blue in l four again
2 Lay green with a zero again
3 Place white at j five now


s29 (Seen):

No. Silent video Ground-truth XTS Lip2Wav VCVTS + GL (ours) VCVTS + PWG (ours) Text
1 Bin green at k six now
2 Lay red by v one soon
3 Place green in n seven again



s1 (Unseen):

No. Silent video Ground-truth XTS Lip2Wav VCVTS + GL (ours) VCVTS + PWG (ours) Text
1 Bin blue at s three again
2 Place green with l one soon
3 Set white with i eight now


s4 (Unseen):

No. Silent video Ground-truth XTS Lip2Wav VCVTS + GL (ours) VCVTS + PWG (ours) Text
1 Bin blue in r six again
2 Place green at d six soon
3 Set white by b nine please


s2 (Unseen):

No. Silent video Ground-truth XTS Lip2Wav VCVTS + GL (ours) VCVTS + PWG (ours) Text
1 Bin blue in l four again
2 Lay green with a zero again
3 Place with at j five now


s29 (Unseen):

No. Silent video Ground-truth XTS Lip2Wav VCVTS + GL (ours) VCVTS + PWG (ours) Text
1 Bin green at k six now
2 Lay red by v one soon
3 Place green in n seven again



Video-to-Speech on LRW - Unseen speakers

  • As speaker label is unavailable for each video of LRW, we hypothesize that each video has a unique speaker identity, so testing speakers are treated as unseen speakers.
  • LRW only provides a word as the annotation for each video, where the word is spoken, so we only display the word in 'Text' region of the following table.
  • No. Silent video Ground-truth Lip2Wav VCVTS + GL (ours) VCVTS + PWG (ours) Text
    1 Accused
    2 Believe
    3 Company
    4 Extra
    5 Further
    6 Growing
    7 Happen
    8 Impact
    9 National
    10 Return
    11 Scottish
    12 Yesterday



    Speaker identity control

    For each video, the lip sequence is randomly selected from the test split of LRW, PWG is used to synthesize waveform.

  • Ground-truth: Original video.
  • Generate-from-gt-spk: Speech generated from ground-truth speaker.
  • Generate-from-{ref-utterance}: Speech generated by using {ref-utterance} to control the speaker identity.
  • Using Griffin-Lim to synthesize waveform:

    No. Ground-truth Generate-from-gt-spk Generate-from-{LRW-test-ANNOUNCED_00001} Generate-from-{GRID-s2-bwwn5n} Generate-from-{LRW-test-MEETING_00002} Generate-from-{GRID-s29-prag2n} Text
    1 Different
    2 Increase
    3 Labour
    4 Nothing


    Using Parallel WaveGAN to synthesize waveform:

    No. Ground-truth Generate-from-gt-spk Generate-from-{LRW-test-INFORMATION_00002} Generate-from-{LRW-test-ISLAMIC_00001} Generate-from-{GRID-s4-lrby7p} Generate-from-{GRID-s1-pbivzp} Text
    1 Authorities
    2 Better
    3 Education
    4 Following