| Resumen |
With the rapid proliferation of social media platforms, cyberbullying has emerged as a critical social challenge, leading to severe economic, psychological, and social consequences. A significant proportion of users, particularly adolescents and young adults, report experiencing digital harassment, aggressive online behavior, and emotional distress due to cyberbullying. This study addresses the pressing issue of detecting cyberbullying using advanced machine learning techniques on text data extracted from platforms such as Twitter and Formspring. Given the prevalent issue of insufficient labeled data, manual annotation techniques were used to label an initially unlabeled corpus. Two additional publicly available labeled datasets were also incorporated to enhance model training and validation. Several classifiers, including Random Forest, Voting Classifier, Linear Support Vector Machine, Gaussian Naive Bayes, Standard Support Vector Machine, Convolutional Neural Network, Long-Short-Term Memory, and Bidirectional Encoder Representations from Transformers models, were trained and evaluated. Initial experiments on the raw data yielded suboptimal performance. To address this, the SelectKBest algorithm using the chi-square test was applied for feature dimensionality reduction, improving learning efficiency and model generalization. In the final phase, a hybrid model incorporating the transformer-based Bidirectional Encoder Representations from Transformers architecture and linguistic-lexical features was developed. This refined model achieved a high classification accuracy of 98. 6%, significantly outperforming the previous baselines. The proposed framework also demonstrated better performance in identifying challenging categories such as hate, threat, and sexuality, with F1 scores improving to over 98%. This research emphasizes the importance of annotated data, effective feature engineering, and deep learning techniques in addressing the nuanced and context-dependent nature of cyberbullying detection. Future work will focus on adapting the model to multilingual datasets, particularly for underrepresented languages such as Arabic, Urdu, and Roman Urdu, to broaden its applicability across diverse linguistic communities. |