Efficient Textual Similarity using Semantic MinHashing

Waqas Nawaz, Maryam Baig, Kifayat Ullah Khan

    Research output: Contribution to conferencePaperpeer-review

    Abstract

    Quantifying the likeness between words, sentences, paragraphs, and documents plays a crucial role in various applications of natural language processing (NLP). As Bert, Elmo, and Roberta exemplified, contemporary methodologies leverage neural networks to generate embeddings, necessitating substantial data and training time for cutting-edge performance. Alternatively, semantic similarity metrics are based on knowledge bases like WordNet, using approaches such as the shortest path between words. MinHashing, a nimble technique, quickly approximates Jaccard similarity scores for document pairs. In this study, we propose employing MinHashing to gauge semantic scores by enhancing original documents with information from semantic networks, incorporating relationships such as syn-onyms, antonyms, hyponyms, and hypernyms. This augmentation improves lexical similarity based on semantic insights. The MinHash algorithm calculates compact signatures for extended vectors, mitigating dimensionality concerns. The similarity of these signatures reflects the semantic score between the documents. Our method achieves approximately 64 % accuracy in the MRPC and SICK data sets.
    Original languageEnglish
    Publication statusPublished (VoR) - 11 Apr 2024
    Event2024 IEEE International Conference on Big Data and Smart Computing - Bangkok, Thailand
    Duration: 18 Feb 202421 Feb 2024

    Conference

    Conference2024 IEEE International Conference on Big Data and Smart Computing
    Country/TerritoryThailand
    CityBangkok
    Period18/02/2421/02/24

    Keywords

    • MinHashing
    • Semantic similarity
    • WordNet
    • Natural Language Processing (NLP)
    • Jaccard similarity
    • Algorithm

    Fingerprint

    Dive into the research topics of 'Efficient Textual Similarity using Semantic MinHashing'. Together they form a unique fingerprint.

    Cite this