Comparison of Multi-Scale Speaker Vectors and S-Vectors for Zero-Shot Speech Synthesis

Abstract

We compare a novel speaker encoder model, called Multi-Scale Speaker (MSS) Vectors, with state-of-the-art s-vectors model for zero-shot speech synthesis. The s-vectors model relies on a modified transformer self-attention network for its architecture. The MSS vectors model introduces a multi-scale approach to the s-vectors model. Results demonstrate that our model produces more natural and similar-sounding synthesized speech for unseen speakers in a zero-shot speech synthesis system.

Department(s)

Computer Science

Document Type

Conference Proceeding

DOI

10.1109/ISM55400.2022.00055

Keywords

speaker adaptation, speaker embedding, speaker encoder, text to speech

Publication Date

1-1-2022

Journal Title

Proceedings 2022 IEEE International Symposium on Multimedia Ism 2022

Share

COinS