Comparative Performance of Multi-level Pre-trained Embeddings on CNN, LSTM and CNN-LSTM for Hate Speech and Offensive Language Detection

Noor Azeera Abdul Aziz, Anazida Zainal, Bander Ali Saleh Al-Rimy, Fuad Abdulgaleel Abdoh Ghaleb, Rozaida Ghazali (Editor), Nazri Mohd Nawi (Editor), Mustafa Mat Deris (Editor), Jemal H. Abawajy (Editor), Nureize Arbaiy (Editor)

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    Abstract

    With growing concerns over hate speech, social media platforms provide policies for monitoring hate content. Nowadays, platforms like Twitter and Facebook rely on humans and machines as content moderators. As for machine moderators, many studies proposed hate speech detection using machine learning approaches. This study investigated which pre-trained text embedding (Word2Vec, GloVe, FastText, Elmo, and BERT) is the best for each tokenization level (word, subword, and character) and which neural network architecture (CNN, LSTM, and CNN-LSTM) is the best as an encoding method for hate speech and offensive language detection. The character-level GloVe with CNN-LSTM performed best among all tested methods. GloVe (character level) scored 93% for F1-score and 92% for accuracy. At the word level, BERT word embedding with CNN-LSTM had the best classification scores of 90% F1-score and 91% accuracy. At the subword level, CNN-LSTM and CNN fared best with BERT word embeddings, which had 86% for both accuracy and F1-score. The performance findings show that pre-trained embeddings at different tokenization levels capture diverse information. Moreover, with an average of 85% for F1-score and 86% for accuracy, CNN-LSTM yielded the best score for almost all text embedding regardless of the tokenization level compared to CNN and LSTM. These results show that CNN-LSTM complements each other to capture sequential and local patterns in the input text.
    Original languageEnglish
    Title of host publicationRecent Advances on Soft Computing and Data Mining
    PublisherSpringer
    Pages186-195
    Number of pages10
    ISBN (Print)9783031669644
    DOIs
    Publication statusPublished (VoR) - 30 Jul 2024
    EventSoft Computing and Data Mining - Putrajaya, Malaysia
    Duration: 21 Aug 202422 Aug 2024
    Conference number: 6th

    Publication series

    NameLecture Notes in Networks and Systems
    ISSN (Print)2367-3370

    Conference

    ConferenceSoft Computing and Data Mining
    Abbreviated titleSCDM 2024
    Country/TerritoryMalaysia
    CityPutrajaya
    Period21/08/2422/08/24

    Fingerprint

    Dive into the research topics of 'Comparative Performance of Multi-level Pre-trained Embeddings on CNN, LSTM and CNN-LSTM for Hate Speech and Offensive Language Detection'. Together they form a unique fingerprint.

    Cite this