Resumen |
Named Entity Recognition (NER) is a fundamental task that identifies and classifies entities into predefined categories from unstructured text. As textual data continues to grow and span diverse linguistic communities, NER is rarely studied as a multilingual task, particularly for low-resource languages. While many researchers have focused on name identification in various high-resource languages, only a few research efforts have addressed NER for the Urdu script. This is primarily due to a lack of resources and annotated datasets. Furthermore, previous research has mostly concentrated on monolingual techniques, leaving significant gaps in addressing multilingual challenges, especially for the Urdu language. To fill this gap, this study makes four key contributions. First, we created a unique multilingual dataset (UE-NER-2025) sourced from Twitter, which contains 182,411 tokens and 8 uniquely annotated entity types. Second, we applied two novel techniques that are relatively new to the UE-NER-2025 dataset: 1) a joint multilingual approach and 2) a joint translation-based approach. Third, we conducted 30 different experiments using 5-fold cross-validation, combining traditional supervised learning with token-based feature extraction, deep learning with pre-trained word embeddings such as FastText and GloVe, and advanced transfer learning models using contextual embeddings, to evaluate their effectiveness in enhancing NER performance for both English and Urdu, particularly addressing the challenges of low-resource and morphologically rich languages. Finally, we performed statistical analysis on our top-performing models to determine whether the differences in performance were statistically significant or occurred by chance. Based on the analysis of the results, our transformer-based language model (XLM-RoBERTa-base) achieved strong performance compared to traditional supervised learning models. We observed a performance improvement of 3.99% in the English translation-based approach, 3.72% in the multilingual approach, and 2.32% in the Urdu translation-based approach over traditional supervised learning (RF in Urdu = 0.927, in English = 0.9258, and multilingual = 0.9272). © 2013 IEEE. |