Autores
Amjad - Maaz
Ashraf Noman
Zhila Alisa
Sidorov Grigori
Gelbukh Alexander
Título Threatening Language Detection and Target Identification in Urdu Tweets
Tipo Revista
Sub-tipo JCR
Descripción IEEE Access
Resumen Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character n-gram counts or word n-gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word n-gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.
Observaciones DOI 10.1109/ACCESS.2021.3112500
Lugar New Jersey
País Estados Unidos
No. de páginas 128302-128313
Vol. / Cap. v. 9
Inicio 2021-09-14
Fin
ISBN/ISSN