(1) The proposed method can effectively remove incorrect articulation repetitions, e.g., M05 speaker said 'whiskey key' for 'whiskey' and 'ab ablutions' for 'ablutions', the reconstructed speech has the correct articulations 'whiskey' and 'ablutions', please listen to M05-No.2 and M05-No.4 for more details.
(2) Compared with the ASR-TTS, the proposed method preserves more similiar content with the original speech, please listen to M07-No.8, M07-No.10, F03-No.5 and F03-No.9 for more details. For ASR-TTS, when the ASR results are wrong, the TTS generates the speech with totally wrong content. However, the proposed method can extract and use the apropriate linguistic representations to generate the speech with more original content preservations, thus the recontructed speech sounds more like original content.
Dysarthric speech reconstruction for different speakers
4 dysarthric speakers from 4 groups with different speech intelligibility are used for experiments: F05(high), M05(mid), M07(low), F03(very low). 'F' and 'M' denote female and male respectively.
The proposed method can be extended to other conversion task, such as speaker identity, emotion, speaking style and accent conversion, etc.
By replacing the proposed single-speaker TTS with multi-speaker TTS, the proposed system can generate the high-quality speech that preserves both speaker identity and content, which is our future work.