Resumen |
The field of computational linguistics has been impacting various issues in language disciplines. The enormous growth of machine learning algorithms and Natural Language Processing (NLP) empowers its advancement and brings huge benefits to societies. For instance, machine translation, text summarization, sentence auto-completion, and sentiment analysis are a few of its benefits. However, leveraging this opportunity for low-resourced languages is challenging due to the lack of available electronic datasets. This paper presents a lexicon-based language relatedness analysis on Ethiopian low-resourced languages. The languages Wolaita, Dawuro, Gamo, and Gofa belong to the Ethiopian Omotic language family and share rich linguistic cultures and similarities. However, the extent of their inter-relatedness remains unknown. To address this gap, we collected and prepared novel corpora from the Bible and academic texts. We employed the TF-IDF technique for feature extraction and used the cosine similarity method to measure the similarities among these languages. In addition to cosine similarity, we used Euclidean distance to measure the spatial distances between the languages. The experiment results showed that Wolaita and Gofa exhibited high relatedness (33.4%), while Dawuro and Gamo demonstrated low relatedness (12.1%). © 2024 The Authors. Published by Elsevier B.V. |