Date of Graduation

Fall 2022

Degree

Master of Natural and Applied Science in Computer Science

Department

Computer Science

Committee Chair

Razib Iqbal

Abstract

Spoken communication, for many, is an essential part of everyday life. Some individuals can lose or not be born with the ability to speak. To function on a day-to-day basis, these individuals find other ways of communication. Adaptive speech synthesis is one of those ways. It recreates a user’s previous voice or creates a voice that blends with their regional dialect. Current adaptive speech synthesis techniques that achieve human-like speech require thirty minutes, to a few hours of high-quality audio recordings of a target speaker. This amount of recorded audio is not commonly possessed by people in need of a speech synthesis system. One adaptive speech synthesis technique that requires only ten to thirty seconds of data is called zero-shot. However, there are currently no zero-shot speech synthesis methods able to produce human-like speech or replicate a speaker with a high degree of similarity. In this thesis, I propose a novel speaker encoder model to make zero-shot speech synthesis more human-like. The proposed model results in a speaker embedding vector called Multi-Scale Speaker (MSS) vectors. MSS-vectors aim to improve current state-of-the-art speaker embeddings for more natural and similar-sounding synthesized speech for unseen speakers in a zero-shot speech synthesis model. The proposed architecture relies on encoder layers, which are coupled with a multi-scale approach to learning both local and global mel-spectra features of a reference speaker. To evaluate the proposed approach, I compare the MSS-vectors model against a modified generalized end-to-end (GE2E) speaker encoder, as well as the s-vector speaker encoder. My comparison includes quantitative measures, such as mel-cepstrum distortion and cosine similarity measures, as well as subject mean opinion scores from human listening surveys. The experimental results from these evaluations indicate improvements over current state-of-the-art speaker encoder models, and thus a shift towards more human-like speech.

Keywords

speaker adaptation, speaker embedding, speaker encoder, text to speech, speech synthesis, zero-shot

Subject Categories

Biomedical | Digital Communications and Networking | Speech and Hearing Science | Systems and Communications

Copyright

© Tristin W. Cory

Open Access

Share

COinS