end-to-end dysarthric speech reconstruction

Audio samples: End-to-end voice conversion via cross-modal knowledge distillation for dysarthric speech reconstruction

Authors: Disong Wang, Jianwei Yu, Xixin Wu, Songxiang Liu, Lifa Sun, Xunying Liu and Helen Meng

System comparison

Original: Original dysarthric speech.

JDNMF: Joint Dictionary learning Non-negative Matrix Factorization based VC.

DBLSTM: Deep Bidirectional Long-Short Term Memory based VC.

ASR-TTS: Employ the ASR to obtain the dysarthric speech recognition results, which are used as the inputs of the TTS to synthesize normal speech.

Proposed: Proposed dysarthric speech reconstruction system.

Some additional observations and analysis

(1) The proposed method can effectively remove incorrect articulation repetitions, e.g., M05 speaker said 'whiskey key' for 'whiskey' and 'ab ablutions' for 'ablutions', the reconstructed speech has the correct articulations 'whiskey' and 'ablutions', please listen to M05-No.2 and M05-No.4 for more details.

(2) Compared with the ASR-TTS, the proposed method preserves more similiar content with the original speech, please listen to M07-No.8, M07-No.10, F03-No.5 and F03-No.9 for more details. For ASR-TTS, when the ASR results are wrong, the TTS generates the speech with totally wrong content. However, the proposed method can extract and use the apropriate linguistic representations to generate the speech with more original content preservations, thus the recontructed speech sounds more like original content.

Dysarthric speech reconstruction for different speakers

4 dysarthric speakers from 4 groups with different speech intelligibility are used for experiments: F05(high), M05(mid), M07(low), F03(very low). 'F' and 'M' denote female and male respectively.

The proposed method can be extended to other conversion task, such as speaker identity, emotion, speaking style and accent conversion, etc.

By replacing the proposed single-speaker TTS with multi-speaker TTS, the proposed system can generate the high-quality speech that preserves both speaker identity and content, which is our future work.