Autores
Calvo Castro Francisco Hiram
Título Simple TF·IDF is not the best you can get for regionalism classification
Tipo Revista
Sub-tipo ISI
Descripción Lecture Notes in Computer Science; 15th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014
Resumen In broadly spoken languages such as English or Spanish, there are words akin to a particular region. For example, there are words typically used in the UK such as cooker, while stove is preferred for that concept in the US. Identifying the particular words a region cultivates involves discriminating them from the set of common words to all regions. This yields the problema where a term’s frequency should be salient enough to be considered of importance, while being a common term tames this salience. This is the known problem of Term Frequency versus the Inverse Document Frequency; nevertheless, typical TF·IDF applications do not include weighting factors. In this work we propose several alternative formulae empirically, and then we conclude that we need to dig in a broader search space; thereby, we propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region (400). We present performance examples for the Spanish variations across the Americas and Spain.
Observaciones (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Code 105034
Lugar Kathmandu
País Nepal
No. de páginas 92-101
Vol. / Cap. 8403
Inicio 2014-04-06
Fin 2014-04-12
ISBN/ISSN 978-3-642-54905-2