SABER

Autores
Calvo Castro Francisco Hiram

Título	Simple TF·IDF is not the best you can get for regionalism classification
Tipo	Revista
Sub-tipo	ISI
Descripción	Lecture Notes in Computer Science; 15th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014
Resumen	In broadly spoken languages such as English or Spanish, there are words akin to a particular region. For example, there are words typically used in the UK such as cooker, while stove is preferred for that concept in the US. Identifying the particular words a region cultivates involves discriminating them from the set of common words to all regions. This yields the problema where a term’s frequency should be salient enough to be considered of importance, while being a common term tames this salience. This is the known problem of Term Frequency versus the Inverse Document Frequency; nevertheless, typical TF·IDF applications do not include weighting factors. In this work we propose several alternative formulae empirically, and then we conclude that we need to dig in a broader search space; thereby, we propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region (400). We present performance examples for the Spanish variations across the Americas and Spain.
Observaciones	(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Code 105034
Lugar	Kathmandu
País	Nepal
No. de páginas	92-101
Vol. / Cap.	8403
Inicio	2014-04-06
Fin	2014-04-12
ISBN/ISSN	978-3-642-54905-2