Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis
Abstract
Recent advances in deep learning have allowed the task of synthesizing speech from text, otherwise known as Text-to-Speech (TTS), to be addressed with more powerful and effective techniques than previously possible. In this paper, we propose a novel speaker encoder model utilizing some of those techniques, such as self-attention and transformers. The proposed model results in a speaker embedding vector called Multi-Scale Speaker (MSS) vectors. MSS vectors aim to make improvements over the current state-of-the-art speaker embeddings for more natural and similar-sounding synthesized speech for unseen speakers in a zero-shot speech synthesis model. The proposed architecture relies on encoder layers which are coupled with a multi-scale approach for learning both local and global mel-spectra features of a reference speaker. We compared our proposed model against a modified generalized end-to-end (GE2E) speaker encoder using mel-cepstrum distortion and cosine similarity measures as well as mean opinion scores. Evaluation results confirm our encoder's advantages over a state-of-the-art speaker encoder model.
Department(s)
Computer Science
Document Type
Conference Proceeding
DOI
10.1109/COMPSAC54236.2022.00093
Keywords
Neural network, speaker adaptation, speaker embedding, speaker encoder, text to speech
Publication Date
1-1-2022
Recommended Citation
Cory, Tristin and Iqbal, Razib, "Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis" (2022). Faculty Scholarship. 795.
https://bearworks.missouristate.edu/articles00/795
Journal Title
Proceedings 2022 IEEE 46th Annual Computers Software and Applications Conference Compsac 2022