Demo for Multi-speaker Video-to-Speech Synthesis

Disong Wang¹ Shan Yang² Dan Su² Xunying Liu¹ Dong Yu² Helen Meng¹

¹The Chinese University of Hong Kong, ²Tencent AI Lab

Content

Quick overview
Video-to-Speech on GRID
- Seen speakers
  - s1 (Seen)
  - s4 (Seen)
  - s2 (Seen)
  - s29 (Seen)
- Uneen speakers
  - s1 (Unseen)
  - s4 (Unseen)
  - s2 (Unseen)
  - s29 (Unseen)
Video-to-Speech on LRW
Speaker identity control

Quick overview

Video-to-Speech on GRID - Seen speakers (s1, s2, s4 and s29), where partial data of s1, s2, s4 and s29 has been used for training

Note: GL denotes Griffin-Lim, PWG denotes Parallel WaveGAN vocoder.

s1 (Seen):

No.	Silent video	Ground-truth	XTS	Lip2Wav	VCVTS + GL (ours)	VCVTS + PWG (ours)	Text
1							Bin blue at s three again
2							Place green with l one soon
3							Set white with i eight now

s4 (Seen):

No.	Silent video	Ground-truth	XTS	Lip2Wav	VCVTS + GL (ours)	VCVTS + PWG (ours)	Text
1							Bin blue in r six again
2							Place green at d six soon
3							Set white by b nine please

s2 (Seen):

No.	Silent video	Ground-truth	XTS	Lip2Wav	VCVTS + GL (ours)	VCVTS + PWG (ours)	Text
1							Bin blue in l four again
2							Lay green with a zero again
3							Place white at j five now

s29 (Seen):

No.	Silent video	Ground-truth	XTS	Lip2Wav	VCVTS + GL (ours)	VCVTS + PWG (ours)	Text
1							Bin green at k six now
2							Lay red by v one soon
3							Place green in n seven again

Video-to-Speech on GRID - Unseen speakers (s1, s2, s4 and s29), where data of s1, s2, s4 and s29 was not used for training

s1 (Unseen):

No.	Silent video	Ground-truth	XTS	Lip2Wav	VCVTS + GL (ours)	VCVTS + PWG (ours)	Text
1							Bin blue at s three again
2							Place green with l one soon
3							Set white with i eight now

s4 (Unseen):

No.	Silent video	Ground-truth	XTS	Lip2Wav	VCVTS + GL (ours)	VCVTS + PWG (ours)	Text
1							Bin blue in r six again
2							Place green at d six soon
3							Set white by b nine please

s2 (Unseen):

No.	Silent video	Ground-truth	XTS	Lip2Wav	VCVTS + GL (ours)	VCVTS + PWG (ours)	Text
1							Bin blue in l four again
2							Lay green with a zero again
3							Place with at j five now

s29 (Unseen):

No.	Silent video	Ground-truth	XTS	Lip2Wav	VCVTS + GL (ours)	VCVTS + PWG (ours)	Text
1							Bin green at k six now
2							Lay red by v one soon
3							Place green in n seven again

Video-to-Speech on LRW - Unseen speakers

As speaker label is unavailable for each video of LRW, we hypothesize that each video has a unique speaker identity, so testing speakers are treated as unseen speakers.

LRW only provides a word as the annotation for each video, where the word is spoken, so we only display the word in 'Text' region of the following table.