A Method to Lexical Normalisation of Tweets∗ Un método de normalización léxica de tweets Pablo Gamallo y Marcos Garcia José Ramom Pichel CITIUS Imaxin Software Univ. de Santiago de Comp. jramompichel@imaxin.com pablo.gamallo@usc.es Resumen: Este artı́culo describe una estrategia de normalización léxica de pal- abras “out-of-vocabulary” (OOV) en tweets escritos en español. Para corregir OOV incorrectos, el sistema de normalización genera candidatos “in-vocabulary” (IV) que aparecen en diferentes recursos léxicos y selecciona el más adecuado. Nuestro método genera dos tipos de candidatos, primarios y secundarios, que serán ordena- dos de diferentes maneras en el proceso de selección del mejor candidato. Palabras clave: Normalización léxica, Mensajes cortos de texto, Procesamiento de tweets Abstract: This paper describes a strategy to perform lexical normalisation of out- of-vocabulary (OOV) words in Spanish tweets. To correct any ill-formed OOV, the normalisation system generates in-vocabulary (IV) candidates found in several lexical resources, and selects the best one. Our method generates two types of candidates, primary and secondary IV candidates, which will be ranked in different ways to select the best candidate. Keywords: Lexical Normalisation, Short Text Message, Tweet Processing 1 Introduction achieves very high precision, but with low re- In this paper, we describe a strategy call. As recall relies on the size of the dic- to perform lexical normalisation of out-of- tionary, (Han y Baldwing, 2012a) propose vocabulary (OOV) words in Spanish tweets. to build wide-coverage normalisation dictio- The task can be described as follows. Given naries in an automatic way, by considering an OOV, the algorithm must decide whether that lexical variants occur in similar con- the OOV is either correct or ill-formed and, texts to their standard forms. Normalisation for the latter case, it must propose an in- dictionary should only contain unambiguous vocabulary (IV) word found in a lexical re- “variant-standard” pairs. Ambiguous vari- source to restore the incorrect OOV. ants will be tackled using the second strat- There has been few work on lexical nor- egy. malisation in short messages. So far, the The second strategy is applied when the most successful strategy to normalise En- OOV is a lexical variant that has not been glish tweets is described in (Han y Baldwing, found in the normalisation dictionary. It con- 2012b; Han y Baldwing, 2013). They propose sists of the following two tasks: merging two different strategies: normalisa- tion dictionary lookup and selection of the • Generation of IV candidates (standard best in-vocabulary (IV) candidate. forms) for each particular OOV (lexical The first strategy simply consists in look- variant). ing up a normalisation dictionary, which con- • Candidate selection of the best IV can- tains specific abreviations and other types didate. of lexical variants found in the Twitter lan- guage. Each lexical variant is associated The objective of the first task is to build, to its standard form, for instance gl → for each OOV, a list of standard forms which girlf riend. The dictionary lookup method were derived from the OOV using different ∗ This work has been supported by Ministerio de processes. For instance: reduction of charac- Ciencia e Innovación, within the project OntoPedia, ter repetitions (e.g., carrrr → car), or gener- ref: FFI2010-14986. ation of those IV words whose Edit distance with regard to the target OOV is within a training data, which are not available for mi- given threshold. croblogs. The second task consists in selecting the best candidate out of the list generated in the 2 The method previous step. Two different selection meth- The normalisation method we propose com- ods can be used: string similarity and con- bines the main strategies and tasks described text inference. To compute string similarity in (Han y Baldwing, 2013), namely: normal- between the OOV and the different IV candi- isation dictionary lookup, generation of IV dates, several measures and strategies can be candidates, and selection of the best IV can- used: lexical Edit distance, phonemic Edit didate with context information. In addition, distance, the longest common subsequence, given the conditions of the Workshop, we also affix substring, and so on. For context infer- include in our algorithm ill-formed OOV de- ence, the IV candidates of a given OOV can tection. be ranked and then filtered on the basis of The design of our algorithm was moti- their local contexts. Local contexts are com- vated by the conclusions we draw from the pared against a language model. The main analysis of the development corpus. We problem of this method is that the local con- observed that the most frequent types of text of an OOV is often constituted by other incorrect Spanish OOV are the following: incorrect lexical variants that are not found (1) Uppercase/lowercase confusion: patri → in the language model. P atri ; (2) character repetition for emphasis: These two selection methods (string simi- Buuenoo → Bueno ; (3) language-dependent larity and context inference) are complemen- spelling problems, namely for Spanish: miss- tary and then can be used together to select ing accents and letter confusion (v/b, g/j, the best candidate. ll/y, h/∅ . . . ). There are, at least, two significant differ- These three types of errors can be solved ences between the task evaluated in (Han y using simple specific rules. For the remain- Baldwing, 2013) and that proposed at the ing phenomena, which correspond to more Tweet Normalization Workshop at SEPLN heterogeneous problems, we will make use of 2013. On the one hand, the task in (Han y generic strategies such as those described in Baldwing, 2013) relies on the basic assump- the previous section: dictionary lookup and tion that lexical variants have already been selection of the best IV candidate. For de- identified. This means that only ill-formed tection of correct/incorrect OOV, we use the OOV are taken as input of the selection pro- following method: if no IV associated to an cess. By contrast, the task defined by the OOV is found using specific rules or generic Workshop guidelines includes the detection of strategies, then the OOV is considered as cor- ill-formed OOV. On the other hand, in (Han rect. Otherwise it is taken as an ill-formed y Baldwing, 2013) the correspondences one- OOV. Text is lemmatised and PoS tagged us- to-several are not considered, for instance ing FreeLing (Padró y Stanilovsky, 2012). imo → in my opinion. At the Workshop, Our method contains two modules: a set by contrast, it is required to search for one- of lexical resources and an algorithm to de- to-several correspondences, since the IV stan- tect and correct ill-formed OOV. dard forms used to correct OOV can be mul- tiwords. In sum, the task defined at the 2.1 Lexical resources Tweet Normalization Workshop is more com- Our system makes use of three different lexi- plex than that described in (Han y Baldwing, cal resources: 2013). Finally, there are other approaches to ND Normalisation dictionary, containing in- SMS and tweet normalisation based on very correct lexical variants and their stan- different strategies. For instance (Beaufort dard forms. et al., 2010) and (Kaufmann y Kalita, 2010) SD Standard dictionary, a list of correct make use of the Statistical Machine Transla- forms generated from the lemmas found tion framework, as well as of the noisy chan- in the Real Academia Española dictio- nel model, very common in speech process- nary (DRAE). ing. The main problem of these approaches is that they rely on large quantities of labelled PND Proper names dictionary, containing proper names extracted from the Span- out. The result is a list of 107, 980 unigrams ish Wikipedia. taking part in the names of persons, loca- tions, and organisations. In the following, we describe how these three dictionaries have been built. 2.2 The algorithm 2.1.1 Normalisation Dictionary (ND) The system takes a list of OOV as input. An OOV is considered as correct if the Dic- It was mainly built using the develope- tionary Lookup process is true. Dictionary ment data distributed by the organizers lookup is a process that consists in searching for the Tweet Normalization Workshop at a token in one of the three lexical dictionar- SEPLN 2013. We also used as source of ies: ND, SD, or PD. If the OOV is found data the list of emoticons accesible from in one of them, then it is considered as cor- http://en.wikipedia.org/wiki/List_of_ rect. However, even if Dictionary Lookup is emoticons, as well as the list of Spanish false, the OOV will be considered as correct abreviations released in http://www.rae. if Affix Check is true. Affix Check is a pro- es/dpd/apendices/apendice2.html. Our cess that extracts regular suffixes and prefixes final normalisation dictionary contains 824 from the OOV and verifies whether the stem entries. of the OOV takes part of an entry found in 2.1.2 Standard Dictionary (SD) one of the three dictionaries. Otherwise, the The standard dictionary is constituted by OOV can be incorrect. all the forms automatically generated from Given an incorrect OOV, we generate a the lemmas found in DRAE. These lemmas list of variants. A variant of an OOV is an have been extracted and freely distributed by IV candidate if either Dictionary Lookup or the project http://olea.org/proyectos/ Affix Check is true. We distinguish between lemarios. Verb forms were generated with primary and secondary variants. the Cilenis verb conjugator (Gamallo et al., 2.2.1 Generation of primary variants 2013), whereas we used specific morpho- Primary variants of an OOV are its most logical rules to generate noun and adjec- likely IV candidates, according to the type of tive forms. The final dictionary consists of errors we found in the development corpus. 778, 149 forms, which is significatively larger Primary variants will be favoured in the pro- than that provided by the last version of cess of candidate selection: if at least a pri- FreeLing (556, 509 Spanish forms in FreeLing mary variant is found, then the system does 3.0). not consider secondary variants. 2.1.3 Proper Names Dictionary Primary variants of an OOV are those IV (PND) candidates derived from the OOV that only To make easier the detection of correct OOV differ from the source OOV with regard to (for instance, proper names and domain- one of these linguistic phenomena: Upper- specific terms that are not in a standard vo- case/lowercase confusion, character repeti- cabulary), it is useful to make use of a large tion, or frequent Spanish spelling errors. The list of OOV extracted from an enclyclopaedic frequent spelling errors include, not only typ- resource, for instance the Wikipedia. Sev- ical problems with accents and frequent let- eral PND were automatically extracted. Fi- ter confusions (v/b, j/g, etc), but also some nally, the PND allowing the best performance phonemic conventions, namely the use of “x” in the normalisation task was extraced as for “ch” (e.g. xicle → chicle). Primary follows: First, using CorpusPedia (Gamallo variants generated by simplifying repetition y González, 2010), a simplified format de- include the cases of interjection reduction: rived from the original downloadable XML jejeeje → je. For uppercase and lowercase file (Wikipedia Dump of May 2011), the variation, we take into account that words names of articles belonging to categories re- can be written with only lowercase letters, lated to persons, locations, and organisations with capitalisation (proper names or first po- were identified, by using the strategy de- sition in the sentence), or with only upper- scribed in (Gamallo y Garcia, 2011). Then, case letters (e.g. acronyms). For instance, these names were tokenized and those uni- given the OOV “pedro”, two other variants grams whose lowercase variants are found in are generated: “Pedro” and “PEDRO”. If the standard dictionary (SD) were filtered one of them is found in the lexical resources, then it is considered as a primary IV candi- lected. date. Let us note that a primary variant is considered an IV candidate if either Dictio- 3 Experiments nary Lookup or Affix Check is true. Some experiments were performed using as 2.2.2 Generation of secondary test set the development corpus provided by variants the organisation of the Tweet Normaliza- tion Workshop. This corpus contains 500 If no primary variant is found as IV candi- tweets and 651 OOV manually corrected. date, then a large list of secondary variants The language model used by our system was is generated using Edit distance. In our ex- built from two text sources: the collection periments, we only generate those variants of 227,255 tweets provided by the Workshop, that have Edit distance 1 with regard to the which were captured between April 1st and original OOV. Dictionary Lookup and Affix 2nd of 2013, and a collection of news from El Check allow us to identify the list of sec- Pais and El Mundo captured via RSS Crawl- ondary IV candidates. In the next step, we ing. In sum, the language model was created select the best candidate. from 50MB of text. The normalisation dic- 2.2.3 Candidate selection tionary contains annotated information from To select the best IV candidate of a given the sample corpus with 100 tweets provided OOV, we compare the local context of each by the Workshop. For the final tests, this candidate against a language model contain- dictionary also includes the annotated pairs ing bigrams of tokens found within a window of the development corpus. of size 4 (2 tokens to the left and 2 to the Two versions of our system were tested, right of a given token). More precisely, for “Standard” and “Restricted”, and com- each candidate, chi-square measure is com- pared against two baselines: “Baseline1” and puted by considering observed frequencies in “Baseline2”. The standard version has been the local context against expected frequen- described in the previous section. The re- cies in the language model. The language stricted version includes a constraint on short model was built by selecting lemmas of the proper names and short acronyms (with less following list of PoS categories: nouns, verbs, than 5 letters). The constraint prevents short adjectives, prepositions, and adverbs. Text proper names and acronyms from being ex- was processed with FreeLing. We also in- panded with secondary variants. For in- troduced an important restriction that takes stance, if the OOV is “BBC”, the system does into account whether the IV candidate is ei- not create IV candidates such as “BBV”, ther a primary or a secondary variant. A “ABC”, and so on. In Baseline1, we do primary variant is always selected even if its not separate primary from secondary vari- chi-square score is 0. It means that a primary ants, and all IV candidates are treated as pri- variant is always selected even if it is not mary variants. Baseline2 does not separate found in the language model. By contrast, primary from secondary variants, and all IV for secondary variants, the chi-square must candidates are treated as secondary variants. be higher than 0 to be selected. Candidates Table 1 shows the results obtained from are ranked considering chi-square values and the experiments performed on the develop- the above restriction. The best IV candidate ment set. The best performance is achieved on the top of the rank is selected and given with “Restricted”, which is based on the algo- as correction of the OOV. At the end, we rithm that makes use of restrictions on short apply the capitalisation rule which considers proper names. The low scores reached by the the position of the original ill-formed OOV in baseline systems clearly show that candidates the sentence: if it is the first word in the sen- must be separated at different levels to be tence, then the selected IV candidate must treated in different ways. In the test set, “Re- be written with its first letter in uppercase. stricted” achieved 66.3% accuracy, the sec- Finally, if no IV candidate (primary or sec- ond best score among the 13 participants in ondary variant) is selected, then the OOV is the Tweet-Norm Competition. considered as correct. So, correct OOV are detected in two different ways: first, if Dic- Bibliografı́a tionary Lookup or Affix Check is true for the Beaufort, Richard, Sophie Roekhaut, Louise- original OOV, or if no IV candidate is se- Amélie Cougnon, y Cédrick Fairon. 2010. Systems pos neg accuracy Baseline1 273 378 41.80 Baseline2 288 363 44.10 Standard 444 207 67.99 Restrictive 451 200 69,06 Table 1: Results from the development set A hybrid rule/model-based finite-state framework for normalizing SMS mes- sages. En 48th Annual Meeting of the As- sociation for Computational Linguistics, páginas 770–779, Uppsala, Sweden. Gamallo, P., M. Garcia, I. González, M. Mu noz, y I. del Rı́o. 2013. Learning verb inflection using Cilenis conjugators. Eu- rocall Review, 21(1):12–19. Gamallo, Pablo y Marcos Garcia. 2011. A resource-based method for named en- tity extraction and classification. LNCS, 7026:610–623. Gamallo, Pablo y Isaac González. 2010. Wikipedia as a multilingual source of com- parable corpora. En LREC 2010 Work- shop on Building and Using Comparable Corpora, páginas 19–26, Valeta, Malta. Han, B. y T. Baldwing. 2012a. Automati- cally constructing a normalisation dictio- nary for microblogs. En Conference on Empirical Methods in Natural Language Processing and Natural Language Learn- ing (EMNLP-CoNLL 2012), Jeju, Korea. Han, B. y T. Baldwing. 2012b. Lexical nor- malisation of short text messages: Makn sens a twitter. En 49th Annual Meeting of the Association for Computational Lin- guistics, páginas 368–378, Portland, Ore- gon, USA. Han, B. y T. Baldwing. 2013. Lexical nor- malisation of social media text. ACM Transactions on Intelligent Systems and Technology, 4(1):15–27. Kaufmann, J. y J. Kalita. 2010. Syn- tactic normalization of twitter messages. En Conference on Natural Language Pro- cessin, Kharagpur, India. Padró, Lluı́s. y Evgeny Stanilovsky. 2012. Freeling 3.0: Towards wider multilingual- ity. En Conference on Language Re- sources and Evaluation (LREC’12), Istan- bul, Turkey.