Date of Graduation
Fall 2023
Degree
Master of Natural and Applied Science in Computer Science
Department
Computer Science
Committee Chair
Jamil Saquer
Abstract
Social media has become a domain that involves a lot of hate speech. Some users feel entitled to engage in abusive conversations by sending abusive messages, tweets, or photos to other users. It is critical to detect hate speech and prevent innocent users from becoming victims. In this study, I explore the effectiveness and performance of various machine learning methods employing text processing techniques to create a robust system for hate speech identification. I assess the performance of Naïve Bayes, Support Vector Machines, Decision Trees, Random Forests, Logistic Regression, and K Nearest Neighbors using three distinct datasets sourced from social media posts. To gauge the optimal approach, I employ Term Frequency-Inverse Document Frequency (TF-IDF), unigrams, bigrams, trigrams, a combination of unigrams and bigrams, and a combination of unigrams, bigrams, and trigrams for the machine learning models to analyze the text corpus. Given the imbalanced nature of the datasets, I implement both under-sampling and over-sampling techniques to investigate their impact on the results. I also investigated the performance of different deep learning algorithms on the three datasets. The results show that the Biderctional Encoders Representations from Transformers (BERT) model gives the best performance among all the models on imbalanced datasets by achieving an F1-score of 90.6% on one of the datasets, and F1-scores of 89.7% and 88.2% on the other two datasets. Comparative analysis reveals that BERT and Robustly Optimized BERT Pretraining Approach (RoBERTa) outperform traditional Machine Learning (ML) algorithms, with F1-scores approximately 20% higher. The investigation indicates that RoBERTa, with its enhanced training strategies, comes remarkably close to the performance of BERT. The outcomes show the transformative impact of deep learning and pretrained models on hate speech detection, with larger, more diverse datasets further enhancing model performance.
Keywords
hate speech, machine learning, social media, BERT, deep learning, RoBERTa, pretrained models, text classification
Subject Categories
Computational Linguistics | Computer and Systems Architecture | Data Storage Systems | Science and Technology Studies
Copyright
© Nabil Shawkat
Recommended Citation
Shawkat, Nabil, "Evaluation of Different Machine Learning, Deep Learning and Text Processing Techniques for Hate Speech Detection" (2023). MSU Graduate Theses. 3913.
https://bearworks.missouristate.edu/theses/3913
Open Access
Included in
Computational Linguistics Commons, Computer and Systems Architecture Commons, Data Storage Systems Commons, Science and Technology Studies Commons