Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis

Abstract

Recent advances in deep learning have allowed the task of synthesizing speech from text, otherwise known as Text-to-Speech (TTS), to be addressed with more powerful and effective techniques than previously possible. In this paper, we propose a novel speaker encoder model utilizing some of those techniques, such as self-attention and transformers. The proposed model results in a speaker embedding vector called Multi-Scale Speaker (MSS) vectors. MSS vectors aim to make improvements over the current state-of-the-art speaker embeddings for more natural and similar-sounding synthesized speech for unseen speakers in a zero-shot speech synthesis model. The proposed architecture relies on encoder layers which are coupled with a multi-scale approach for learning both local and global mel-spectra features of a reference speaker. We compared our proposed model against a modified generalized end-to-end (GE2E) speaker encoder using mel-cepstrum distortion and cosine similarity measures as well as mean opinion scores. Evaluation results confirm our encoder's advantages over a state-of-the-art speaker encoder model.

Department(s)

Computer Science

Document Type

Conference Proceeding

DOI

10.1109/COMPSAC54236.2022.00093

Keywords

Neural network, speaker adaptation, speaker embedding, speaker encoder, text to speech

Publication Date

1-1-2022

Journal Title

Proceedings 2022 IEEE 46th Annual Computers Software and Applications Conference Compsac 2022

Share

COinS