SABER

Autores
Hussain Nisar
Qasim Amna
Mehak Gull
Kolesnikova Olga
Gelbukh Alexander
Sidorov Grigori

Título	ORUD-Detect: A Comprehensive Approach to Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning–Deep Learning Models with Embedding Techniques
Tipo	Revista
Sub-tipo	CONACYT
Descripción	Information
Resumen	With the rapid expansion of social media, detecting offensive language has become critically important for healthy online interactions. This poses a considerable challenge for low-resource languages such as Roman Urdu which are widely spoken on platforms like Facebook. In this paper, we perform a comprehensive study of offensive language detection models on Roman Urdu datasets using both Machine Learning (ML) and Deep Learning (DL) approaches. We present a dataset of 89,968 Facebook comments and extensive preprocessing techniques such as TF-IDF features, Word2Vec, and fastText embeddings to address linguistic idiosyncrasies and code-mixed aspects of Roman Urdu. Among the ML models, a linear kernel Support Vector Machine (SVM) model scored the best performance, with an F1 score of 94.76, followed by SVM models with radial and polynomial kernels. Even the use of BoW uni-gram features with naive Bayes produced competitive results, with an F1 score of 94.26. The DL models performed well, with Bi-LSTM returning an F1 score of 98.00 with Word2Vec embeddings and fastText-based Bi-RNN performing at 97.00, showcasing the inference of contextual embeddings and soft similarity. The CNN model also gave a good result, with an F1 score of 96.00. The CNN model also achieved an F1 score of 96.00. This study presents hybrid ML and DL approaches to improve offensive language detection approaches for low-resource languages. This research opens up new doors to providing safer online environments for widespread Roman Urdu users. © 2025 by the authors.
Observaciones	DOI 10.3390/info16020139
Lugar	Basel
País	Suiza
No. de páginas	Article number 139
Vol. / Cap.	v. 16 no. 2
Inicio	2025-02-01
Fin
ISBN/ISSN