Date of Graduation

Summer 2022

Degree

Master of Natural and Applied Science in Computer Science

Department

Computer Science

Committee Chair

Razib Iqbal

Abstract

With the proliferation of smart home devices like Google Home or Amazon Alexa, significant research endeavors are being carried out to improve the user experience while interacting with these smart assistants. One such dimension in this endeavor is ongoing research on successful emotion detection from short voice commands used in smart home environment. Besides facial expression and body language, etc., speech plays a pivotal role in the classification of emotions when it comes to smart home application. Upon successful implementation of accurate emotion recognition, the smart devices will be able to intelligently and empathetically suggest appropriate actions based on the users’ current emotional state. Keeping that in focus, this research aims to advance the existing literature on emotion detection from voice commands in smart home applications. Initially, I chose two publicly available datasets as audio conversation datasets to highlight my application's most effective classification algorithm. Through a comparative analysis, I have concluded that the Tree-based Pipeline Optimization Tool (TPOT) algorithm outperforms other machine learning techniques to detect accurate emotion from an audio. On a concurrent study, I observed that Mel Frequency Cepstral Coefficient (MFCC) in combination with Mel Spectrogram (MEL) result in higher classification accuracy than other existing audio feature combinations available in literature. Upon this conclusion, I have adapted TPOT combined with MEL and MFCC audio feature for our in-house smart home dataset. This Institutional Review Board (IRB) approved in-house dataset contains 5000 smart home voice commands covering five distinctive emotional states from 12 different users. Moving forward, I proposed four new audio features named Chunk Gap Length (CGL), Mean Chunk Duration (MCD), Mean Word Duration Per Chunk (MWDPC), and Per Chunk Word Count (PCWC) to be utilized along with existing MFCC and MEL for improving the accuracy of emotion detection. My evaluation results show that combinations of custom features with MFCC and MEL provide better accuracy in detecting the correct emotion compared to MFCC and MEL alone.

Keywords

speech emotion recognition, machine learning, smart home, voice command, Mel Frequency Cepstral Coefficient (MFCC), Mel Spectrogram (MEL), pitch, Sound Pressure Level (SPL), Chunk Gap Length (CGL), Tree-based Pipeline Optimization Tool (TPOT)

Subject Categories

Computer and Systems Architecture | Signal Processing | Speech and Rhetorical Studies

Copyright

© Sunanda Guha

Open Access

Share

COinS