Date of Graduation

Fall 2023

Degree

Master of Natural and Applied Science in Computer Science

Department

Computer Science

Committee Chair

Jamil Saquer

Abstract

Social media has become a domain that involves a lot of hate speech. Some users feel entitled to engage in abusive conversations by sending abusive messages, tweets, or photos to other users. It is critical to detect hate speech and prevent innocent users from becoming victims. In this study, I explore the effectiveness and performance of various machine learning methods employing text processing techniques to create a robust system for hate speech identification. I assess the performance of Naïve Bayes, Support Vector Machines, Decision Trees, Random Forests, Logistic Regression, and K Nearest Neighbors using three distinct datasets sourced from social media posts. To gauge the optimal approach, I employ Term Frequency-Inverse Document Frequency (TF-IDF), unigrams, bigrams, trigrams, a combination of unigrams and bigrams, and a combination of unigrams, bigrams, and trigrams for the machine learning models to analyze the text corpus. Given the imbalanced nature of the datasets, I implement both under-sampling and over-sampling techniques to investigate their impact on the results. I also investigated the performance of different deep learning algorithms on the three datasets. The results show that the Biderctional Encoders Representations from Transformers (BERT) model gives the best performance among all the models on imbalanced datasets by achieving an F1-score of 90.6% on one of the datasets, and F1-scores of 89.7% and 88.2% on the other two datasets. Comparative analysis reveals that BERT and Robustly Optimized BERT Pretraining Approach (RoBERTa) outperform traditional Machine Learning (ML) algorithms, with F1-scores approximately 20% higher. The investigation indicates that RoBERTa, with its enhanced training strategies, comes remarkably close to the performance of BERT. The outcomes show the transformative impact of deep learning and pretrained models on hate speech detection, with larger, more diverse datasets further enhancing model performance.

Keywords

hate speech, machine learning, social media, BERT, deep learning, RoBERTa, pretrained models, text classification

Subject Categories

Computational Linguistics | Computer and Systems Architecture | Data Storage Systems | Science and Technology Studies

Copyright

© Nabil Shawkat

Open Access

Share

COinS