SINAI at Twitter-Normalization 2013 SINAI en Twitter Normalization 2013 Arturo Montejo Ráez, M. Carlos Diaz Galiano, Eugenio Martı́nez Cámara, M. Teresa Martı́n Valdivia, Miguel A. Garcı́a Cumbreras, L. Alfonso Ureña López Universidad de Jaén Campus Las Lagunillas 23071 {amontejo, mcdiaz, emcamara, maite, magc, laurena}@ujaen.es Resumen: Este artı́culo presenta el sistema de normalización de tweets desarrol- lado por el grupo SINAI. Realizamos una serie de conversiones a partir de lexicones de traducción y un corrector ortográfico. Nuestro sistema obtiene un resultado de accuracy bajo, un 37.6%, y analizando los resultados necesita mejorarse en varios aspectos tales como diminutivos y superlativos, tratamiento de entidades o abreviat- uras. Palabras clave: normalización en Twitter, tokenización, traducción, corrector Abstract: In this paper, we present the Twitter-normalization system developed by the SINAI group. Our system performs a series of conversions on the text by the use of translation lexicons and a spell checker. We obtain a poor result, only 37.6% of accuracy, and after the analysis of these results our system should be improved in areas such as the treatment of diminutives and superlatives, entities or abbreviations. Keywords: Twitter normalization, tokenization, translation 1 Introduction and objectives words in short text messages and propose a Twitter is a popular medium for broadcast- method for identifying and normalizing ill- ing news, staying in touch with friends, and formed words that doesn’t require any an- sharing opinions. Several researches have notations. They use a classifier to detect ill- been focused on this new microblogging plat- formed words, and generate correction candi- form that is changing the communication way dates based on morphophonemic similarity. among people. However, tweets often contain On the other hand, most of the studies highly irregular syntax and nonstandard use on short text normalization only deal with of a language. In addition, Twitter posts fre- English tweets while more and more, other quently include URLs, as well as markup syn- languages are increasingly used on Twitter. tax, which further decreases the amount of For example, there are some works deal- characters available for content. Because of ing with Spanish tweets (Moreno-Ortiz and these limits, users have created a novel syntax Hernández, 2013) but very few are focused to communicate their messages with as much on the normalization process. brevity as possible. While this brevity allows This paper describes a system which nor- tweets to contain more information, it makes malizes Spanish Twitter posts, converting them harder to mine and analyze the infor- them into a more standard form and so nat- mation due to its lack of standardization. ural language processing (NLP) techniques Several works have studied the normal- can be more easily applied to them. Next ization problem for short text. For exam- section describes our approach based on the ple, (Kaufmann and Kalita, 2010) describe a use of translation lexicons and spell checking. novel system which normalizes these Twitter Then, the evaluation process is commented posts standard form of English by taking a and, in addition, we have accomplished an two step-approach, first preprocess tweets to analysis of the obtained results. remove as much noise as possible and then feed them into a machine translation model 2 System Architecture to convert them into standard English. (Han Our system performs a series of conversions and Baldwin, 2011) target out-of-vocabulary on the text, which is, step by step, trans- formed into a final normalized form. We have source spell checker that works well with Uni- not considered annotation-based approaches code strings, which makes it very suitable for like those followed by well-known systems like multilingual texts. Also, it allows multiple GATE1 or proposed by recommendations like dictionaries to be used concurrently and the the UIMA specification2 . Instead, we have addition of further vocabularies to be con- chosen a straightforward solution, where first sidered as correct forms, so we can integrate the text is tokenized with special attention on more lexicons. Aspell works by converting Twitter related items (like emoticons, men- the misspelled word (that is, the word not tions or hashtags) and then each token is included in their dictionaries) into a sounds converted into some sort of canonical form like equivalent. Then proposes a list of words by the use of translation lexicons and a spell with one or two edit distances from the orig- checker. Details of each module are given in inal words sounds like. An edit distance is the following subsections. one replacement, insertion or deletion of one single character. 2.1 Tokenization We have added into Aspell the following Tokenization allows the segmentation of texts lexicons: into their most simple units of meaning: terms. In our case, multi-word forms are not • Main provinces and cities in Spain, considered, so each term is either related to extracted from the INE (Statistics Na- a word or to other type of information like: tional Institute of Spain)5 emoticons, HTML tags, telephone numbers, • Interjections like “ajá”, “jolı́n” or mentions, hashtags, dates, URLs, e-mail ad- “puf ” among others. This list is a se- dresses and some other minor items. Case lection from the ones proposed in Wik- is preserved during the tokenization process tionary6 and, as result, we obtain a list of strings to feed next modules. • Twitter jargon and neologisms, with terms like “Facebook ” or “tuiteo”, se- 2.2 Translation tables lected from an on-line glossary7 . A translation table allows for the replacement of certain forms of strings into other forms. • Named entities, generated from In this way, we can recognize some expres- Wikipedia and containing more than sions and translate them to more convenient 650 different named entities. Also, po- representations. In this step, the following litical parties and main political leaders translation tables have been considered: have been added to this list manually. 1. Abbreviations. Expressions like “a2 ” 2.4 Automatic spelling correction are translation into “adiós”, “q” into After receiving a list of possible spelling cor- “que” and so on up to twelve possible rections from the previous module, the sys- Spanish abbreviations commonly used in tem selects the most common term, accord- “texting” communication. ing to a list of words sort by frequency gen- 2. Laughings. This translation table erated by (Vega et al., 2011). Although more make intensive use of regular expressions sophisticated solutions could be used here in order to capture most possible forms (like considering surrounding words as con- of laughing expressions found in text. In text for candidate selection), our attempts this way, “aajajajaaj ” would be replaced applying techniques taken from word sense by “ja”, for example. desambiguation approaches did not lead to significant improvements. 2.3 Spell checking To consider surrounding words as context, first we have calculated a table with normal- For this module we have used the GNU As- ized pointwise mutual information (NPMI) pell3 spell checker and its binding for Python, of lemmatized words in the same sentence. aspell-python 4 . GNU Aspell is an open 5 http://www.ine.es/daco/daco42/codmun/cod\ 1 http://gate.ac.uk/ _provincia.htm 2 6 http://uima.apache.org http://es.wiktionary.org/wiki/Categor\ 3 http://aspell.net/ %C3\%ADa:ES:Interjecciones 4 7 http://0x80.pl/proj/aspell-python/ http://estwitter.com/glosario/ To calculate this table we have used a dump our system as an entity, so it was re- of Spanish Wikipedia8 articles and calculated placed by the word Vacas. the NPMI values of the first 10.000 lemmas 4. Abbreviations: Although we have com- most frequents. Second, we have computed piled a bag of abbreviations, after the the sum of NPMI values of a candidate with publications of the results we have re- each word of the context. Finally, we have alized that they are not enough and we selected the candidate with the best sum of need to add more abbreviations. NPMI. We have detected some errors in the or- 3 Evaluation and results ganization results. Laughing expressions like The performance reached by the system “jajaja” have been normalized in some tweets showed above are not good according with but other have not, so we do not know if our the results published by the organization. Af- right normalization of some laughing expres- ter a deep analysis of the results we have re- sions have been considered as correct. Other alized that we have to improve the following example of words that we think they have issues: to be normalized is “que” that some users write as “q”. In some tweets like “#Escor- 1. Diminutives and superlatives: We have pio Puedes sentir q el camino es muy oscuro, followed an approach based on a Spanish será mejor q busques q alguien te ayude a lemma dictionary. The Spanish lemma iluminarlo puede ser algun amigo.”, the or- dictionary used was the offered by the ganizers considered “q” well written and we project LingPipe9 . This dictionary does are not agree. The organizers also think that not include a great amount of diminu- the word “dı́as” without accent is well writ- tives and superlatives, so one of the ten and it is not. Due to that, we think that weaknesses of our system is the detec- the test corpus have to be improved for future tion of this kind of words and a set of editions of the workshop. the errors are caused by them. Those are some of the reasons because our 2. New words: Aspell is a dictionary bases system has reached only 37.6% of accuracy. on spell checker. It is also possible to add more list of words to Aspell with the 4 Conclusions and ongoing work aim of enlarging the tool coverage. How- In this paper, we have proposed a normal- ever the coverage of all the Spanish lan- ization system for tweets that performs a se- guage is not easy. Other problem is the ries of conversions on the text by the use of new Spanish words included in the RAE translation lexicons and a spell checker. We (Royal Spanish Language Academy) be- found that most illformed words are based cause those ones are difficult to find out on morphophonemic variation and proposed in the classic spell checker tools. Al- a cascade method to convert each tweet. Our though, we have appended to the Aspell system has reached only 37.6% of accuracy. dictionaries new Spanish words, they Our future work will be focused on resolve have not been enough and the system some problems discovered such as the treat- has failed in words such as “flipante” or ment of diminutives and superlatives, enti- “sobao”. ties or abbreviations. Furthermore, we want to adapt our normalization system for subse- 3. Entities: The misclassification of enti- quent processes such as sentiment analysis or ties has been other error of our system. text classification. The entities without any error must be classified as 1 (CORRECT, NO VARIA- Acknowledgements TION) but our system considered them This work has been partially supported by correct. Also the entity recognition a grant from the Fondo Europeo de De- power of our system is not strength, so sarrollo Regional (FEDER), TEXT-COOL some of the errors are related with this 2.0 project (TIN2009-13391-C04-02) and AT- problem. A clear example is the entity TOS project (TIN2012-38536-C03-0) from Vallecas, which was not recognized by the Spanish Government. The project 8 http://dumps.wikimedia.org/eswiki/ AORESCU (TIC - 07684) from the regional 9 http://alias-i.com/lingpipe/ government of Junta de Andalucı́a partially supports this manuscript. Also,this paper is partially funded by the European Com- mission under the Seventh (FP7 - 2007- 2013) Framework Programme for Research and Technological Development through the FIRST project (FP7-287607). This publica- tion reflects the views only of the authors, and the Commission cannot be held respon- sible for any use which may be made of the information contained therein. References Han, Bo and Timothy Baldwin. 2011. Lex- ical normalisation of short text messages: Makn sens a# twitter. In ACL, pages 368–378. Kaufmann, Max and Jugal Kalita. 2010. Syntactic normalization of twitter mes- sages. In International conference on nat- ural language processing, Kharagpur, In- dia. Moreno-Ortiz, Antonio and Chantal Pérez Hernández. 2013. Lexicon-based senti- ment analysis of twitter messages in span- ish. Procesamiento del Lenguaje Natural, 50(0). Vega, Fernando Cuetos, Marı́a González Nosti, Analı́a Barbón Gutiérrez, and Marc Brysbaert. 2011. Subtlex-esp: Spanish word frequencies based on film subtitles. Psicológica: Revista de metodologı́a y psi- cologı́a experimental, 32(2):133–143.