Abstract
With growing concerns over hate speech, social media platforms provide policies for monitoring hate content. Nowadays, platforms like Twitter and Facebook rely on humans and machines as content moderators. As for machine moderators, many studies proposed hate speech detection using machine learning approaches. This study investigated which pre-trained text embedding (Word2Vec, GloVe, FastText, Elmo, and BERT) is the best for each tokenization level (word, subword, and character) and which neural network architecture (CNN, LSTM, and CNN-LSTM) is the best as an encoding method for hate speech and offensive language detection. The character-level GloVe with CNN-LSTM performed best among all tested methods. GloVe (character level) scored 93% for F1-score and 92% for accuracy. At the word level, BERT word embedding with CNN-LSTM had the best classification scores of 90% F1-score and 91% accuracy. At the subword level, CNN-LSTM and CNN fared best with BERT word embeddings, which had 86% for both accuracy and F1-score. The performance findings show that pre-trained embeddings at different tokenization levels capture diverse information. Moreover, with an average of 85% for F1-score and 86% for accuracy, CNN-LSTM yielded the best score for almost all text embedding regardless of the tokenization level compared to CNN and LSTM. These results show that CNN-LSTM complements each other to capture sequential and local patterns in the input text.
| Original language | English |
|---|---|
| Title of host publication | Recent Advances on Soft Computing and Data Mining |
| Publisher | Springer |
| Pages | 186-195 |
| Number of pages | 10 |
| ISBN (Print) | 9783031669644 |
| DOIs | |
| Publication status | Published (VoR) - 30 Jul 2024 |
| Event | Soft Computing and Data Mining - Putrajaya, Malaysia Duration: 21 Aug 2024 → 22 Aug 2024 Conference number: 6th |
Publication series
| Name | Lecture Notes in Networks and Systems |
|---|---|
| ISSN (Print) | 2367-3370 |
Conference
| Conference | Soft Computing and Data Mining |
|---|---|
| Abbreviated title | SCDM 2024 |
| Country/Territory | Malaysia |
| City | Putrajaya |
| Period | 21/08/24 → 22/08/24 |