Resumen |
With the rapid expansion of social media, detecting offensive language has become critically important for healthy online interactions. This poses a considerable challenge for low-resource languages such as Roman Urdu which are widely spoken on platforms like Facebook. In this paper, we perform a comprehensive study of offensive language detection models on Roman Urdu datasets using both Machine Learning (ML) and Deep Learning (DL) approaches. We present a dataset of 89,968 Facebook comments and extensive preprocessing techniques such as TF-IDF features, Word2Vec, and fastText embeddings to address linguistic idiosyncrasies and code-mixed aspects of Roman Urdu. Among the ML models, a linear kernel Support Vector Machine (SVM) model scored the best performance, with an F1 score of 94.76, followed by SVM models with radial and polynomial kernels. Even the use of BoW uni-gram features with naive Bayes produced competitive results, with an F1 score of 94.26. The DL models performed well, with Bi-LSTM returning an F1 score of 98.00 with Word2Vec embeddings and fastText-based Bi-RNN performing at 97.00, showcasing the inference of contextual embeddings and soft similarity. The CNN model also gave a good result, with an F1 score of 96.00. The CNN model also achieved an F1 score of 96.00. This study presents hybrid ML and DL approaches to improve offensive language detection approaches for low-resource languages. This research opens up new doors to providing safer online environments for widespread Roman Urdu users. © 2025 by the authors. |