Abstract
Quantifying the likeness between words, sentences, paragraphs, and documents plays a crucial role in various applications of natural language processing (NLP). As Bert, Elmo, and Roberta exemplified, contemporary methodologies leverage neural networks to generate embeddings, necessitating substantial data and training time for cutting-edge performance. Alternatively, semantic similarity metrics are based on knowledge bases like WordNet, using approaches such as the shortest path between words. MinHashing, a nimble technique, quickly approximates Jaccard similarity scores for document pairs. In this study, we propose employing MinHashing to gauge semantic scores by enhancing original documents with information from semantic networks, incorporating relationships such as syn-onyms, antonyms, hyponyms, and hypernyms. This augmentation improves lexical similarity based on semantic insights. The MinHash algorithm calculates compact signatures for extended vectors, mitigating dimensionality concerns. The similarity of these signatures reflects the semantic score between the documents. Our method achieves approximately 64 % accuracy in the MRPC and SICK data sets.
Original language | English |
---|---|
Publication status | Published (VoR) - 11 Apr 2024 |
Event | 2024 IEEE International Conference on Big Data and Smart Computing - Bangkok, Thailand Duration: 18 Feb 2024 → 21 Feb 2024 |
Conference
Conference | 2024 IEEE International Conference on Big Data and Smart Computing |
---|---|
Country/Territory | Thailand |
City | Bangkok |
Period | 18/02/24 → 21/02/24 |
Keywords
- MinHashing
- Semantic similarity
- WordNet
- Natural Language Processing (NLP)
- Jaccard similarity
- Algorithm