unimelb: Spanish Text Normalisation unimelb: Normalización de texto en español Bo Han,1,2 Paul Cook1 and Timothy Baldwin1,2 1 Department of Computing and Information Systems, The University of Melbourne 2 NICTA Victoria Research Lab hanb@student.unimelb.edu.au, paulcook@unimelb.edu.au, tb@ldwin.net Resumen: El presente artı́culo describe una aproximación a la normalización de texto basada en léxico para tweets en español. En primer lugar se realiza una comparación entre la normalización de texto en español e inglés y se plantea la hipótesis de que se puede adaptar un enfoque similar ya planteado previamente para el inglés. Para ello, se construye un léxico de normalización a partir de un corpus, utilizando similaridad distribucional, y se combina con otros léxicos existentes (por ejemplo diccionarioes de jerga de Internet en español). Estos léxicos permiten una solución rápida basada en búsquedas. Los resultados experimentales indican que el léxico derivado del corpus complementa bien a los léxicos existentes, pero que la solución puede mejorarse con un mejor manejo de ciertos tipos de palabras, como las entidades con nombre. Palabras clave: Twitter, español, normalización de texto Abstract: This paper describes a lexicon-based text normalisation approach for Spanish tweets. We first compare English and Spanish text normalisation, and hypothesise that an approach previously proposed for English can be adapted to Spanish. A corpus-derived normalisation lexicon is built using distributional sim- ilarity, and is combined with existing lexicons (e.g., containing Spanish Internet slang). These lexicons enable a very fast, look-up based approach to text normali- sation. Experimental results indicate that the corpus-derived lexicon complements existing lexicons, but that the approach could be improved through better handling of certain word types, such as named entities. Keywords: Twitter, Spanish, Text Normalisation 1 Introduction proach to Spanish text normalisation. In par- ticular, we adapt the method of Han, Cook, A tremendous amount of user-generated text and Baldwin (2012) to build a normalisa- is produced on social media sites such as tion lexicon that maps non-standard words Twitter and Facebook, and can be lever- to their standard forms relative to a vocabu- aged for natural language processing (NLP) lary, i.e., out-of-vocabulary (OOV) words are tasks such as sentiment analysis (Jiang et mapped deterministically to in-vocabulary al., 2011) and event detection (Weng and (IV) words. This enables a very fast, look-up Lee, 2011). However, this user-generated text based approach to text normalisation. In our is noisy, and contains various non-standard approach an OOV word is first looked up in words, e.g., jajaja (“ja”) and queee (“que”). an automatically-derived normalisation lexi- These non-standard words are not recognised con that is complemented with entries from by off-the-shelf NLP tools, and may conse- Spanish Internet slang dictionaries and the quently degrade the utility of NLP on so- development data. If the OOV word is found cial media. One way to tackle this chal- in this lexicon it is normalised according to its lenge is text normalisation — restoring these entry, otherwise it is left unchanged. During non-standard words to their canonical forms, this normalisation step, OOV words and the e.g., transforming jajaja to “ja” and queee resulting normalisations are down-cased, so to “que” (Eisenstein, 2013; Han, Cook, and a final case restoration step is performed to Baldwin, 2013). appropriately capitalise the lowercased nor- This paper proposes a lexicon-based ap- malisations. 2 Comparing English and 3.1 Resources Spanish Text Normalisation Our normalisation transforms OOV forms to The lexicon-based normalisation approach of IV words, and thus a Spanish lexicon is re- Han, Cook, and Baldwin (2012) was evalu- quired to determine what is OOV. To this ated on English tweets. In this section we end, we use the Freeling 3.0 Spanish dic- consider the plausibility of adapting their tionary (Padró and Stanilovsky, 2012) which method from English to Spanish, and iden- contains 669k words. tify the following key factors: We collected 146 Spanish Internet slang expressions and cell phone abbreviations Orthography: if we consider diacriticised from the web (Slang lexicon).2 We fur- letters as single characters, Spanish has more ther extracted normalisation pairs from the characters than English, and diacritics can development data (Dev Lexicon). lead to differences in meaning, e.g., más means “more”, and mas means “but”. The Analysing the development data we no- method of Han, Cook, and Baldwin (2012) ticed that many person names are not cor- uses Levenshtein distance to measure string rectly capitalised. We formed Name Lex- similarity. We simply convert all characters icon from a list of 277 common Spanish to fused Unicode code points (treating á and names.3 This lexicon maps lowercase person a as different characters) and compute Lev- names to their correctly capitalised forms. enshtein distance over these forms. 3.2 Corpus-derived Lexicon Word segmentation: Spanish and En- The small, manually-crafted normalisation glish words both largely use whitespace seg- lexicons from Section 3.1 have low coverage mentation, so similar tokenisation strategies over non-standard words. To improve cov- can be used. erage, we automatically derive a much larger Morphophonemics: Phonetic modeling normalisation lexicon based on distributional of words — a component of the method similarity (Dist Lexicon) by adapting the of Han, Cook, and Baldwin (2012) — is method of Han, Cook, and Baldwin (2012). available for Spanish using an off-the-shelf We collected 283 million Spanish tweets Double Metaphone implementation.1 via the Twitter Streaming API4 from Lexical resources: A lexicon and slang 21/09/2011–28/02/2012. Spanish tweets dictionary — key resources for the method of were identified using langid.py (Lui and Han, Cook, and Baldwin (2012) — are avail- Baldwin, 2012). The tweets were tokenised able for Spanish. using a simplified English Twitter tokeniser Overall, English and Spanish text share (O’Connor, Krieger, and Ahn, 2010). Ex- important features, and we hypothesise that cessive repetitions of characters (i.e., ≥ 3) adapting a lexicon-based English normalisa- in words are shortened to one character to tion system to Spanish is feasible. ensure different variations of the same pat- One important component of this Spanish tern are merged. To improve coverage, we re- normalisation task is case restoration: e.g., moved the restriction from the original work maria as a name should be normalised to that only OOVs with ≥ 4 letters were con- “Maria”. Most previous English Twitter nor- sidered as candidates for normalisation. malisation tasks have focused on lowercase For a given OOV, we define its confusion words and ignored capitalisation. set to be all IV words with Levenshtein dis- tance ≤ 2 in terms of characters or ≤ 1 in 3 System Description terms of Double Metaphone code. We rank the items in the confusion set according to The system consists of two steps: (1) down- their distributional similarity to the OOV. case all OOVs and normalise them based on Han, Cook, and Baldwin (2012) considered a normalisation lexicon which combines en- many configurations of distributional similar- tries from existing lexicons (Section 3.1) and ity for normalisation of English tweets. We entries automatically learnt from a Twitter corpus (Section 3.2); (2) restore case for nor- 2 http://goo.gl/wgCFSs and http://goo.gl/ malised words (Section 3.3). xsYkDe, both accessed on 26/06/2013 3 https://en.wikipedia.org/wiki/Spanish_ 1 https://github.com/amsqr/ naming_customs 4 Spanish-Metaphone https://dev.twitter.com Rank callendo guau Lexicon Accuracy 1 cayendo 0.713 y 1.756 Combined Lexicon 0.52 2 saliendo 3.896 que 1.873 − Slang Lexicon 0.51 3 fallando 4.303 la 2.488 − Dev Lexicon 0.46 4 rallando 6.761 a 2.649 − Dist Lexicon 0.42 5 valiendo 6.878 no 3.206 − Name Lexicon 0.51 + Edit distance 0.54 Table 1: The KL divergence for the top-five Baseline 0.20 candidates for callendo and guau. Table 2: Accuracy of lexicon-based normali- sation systems. “−” indicates the removal of a particular lexicon. 3.3 Case Restoration We set the case of each token that was normalised in the previous step (which is down-cased at the current stage) to its most- frequent casing in our corpus of Spanish tweets. We also capitalise all normalised to- Figure 1: KL divergence ratio cut-off vs. pre- kens occurring at the beginning of a tweet, or cision of the derived normalisation lexicon on following a period or question mark. the development data and Slang Lexicon. 4 Results and Discussion use the same settings they selected: con- We evaluated the lexicons using classification text is represented by positionally-indexed accuracy, the official metric for this shared bigrams using a window size of ±2 tokens; task, on the tweet-norm test data. This met- similarity is measured using KL divergence. ric divides the number of correct proposals — An entry in the normalisation dictionary then OOVs correctly normalised or left unchanged consists of the OOV and its top-ranked IV. — by the number of OOVs in the collection. From development data, we observe that This is termed “precision” by the task organ- in many cases when a correct normalisation isers, but a true measure of precision would is identified, there is a large difference in be based on the number of OOVs that were KL divergence between the first- and second- actually normalised. We therefore use the ranked IVs. Conversely, if the KL divergence term “accuracy” here. of the first- and second-ranked normalisation We submitted two runs for the task. The candidates is similar, the normalisation is of- first, Combined Lexicon (Table 4), uses ten less reliable. As shown in Table 3.2, only the combination of lexicons from Section callendo (“cayendo”) is a correctly-derived 3, and achieves an accuracy of 0.52. The sec- (OOV, IV) pair, but guau (“y”) is not. ond run builds on Combined Lexicon but Motivated by this observation, we filter incorporates normalisation based on charac- the derived (OOV, IV) pairs by the KL di- ter edit distance for words with many re- vergence ratio of the first- and second-ranked peated characters. We observed that such IV words for the OOV. Setting a high thresh- words are often non-standard, and tend not old on this KL divergence ratio increases the to occur in the lexicons because of their rel- reliability of the derived lexicon, but reduces atively low frequency. For words with ≥ 3 its coverage. This ratio was tested for values repeated characters, we remove all but one of from 1.0 to 3.0 with a step size of 0.1 over the the repeated characters, and then select the development data and Slang Lexicon. As most similar IV word according to character- shown in Figure 1, the best precision (94.0%) based Levenshtein distance. The accuracy of is achieved when the ratio is 1.9.5 We directly this run is 0.54 (+ Edit distance, Table 4). use this setting to derive the final lexicon, We further consider an ablative analysis of instead of further re-ranking the (OOV,IV) the component lexicons of Combined Lex- pairs using string similarity. icon. As shown in Table 4, when Slang Lexicon (− Slang Lexicon) or Name 5 Here precision is defined as #correct normalisations #normalisations . Lexicon (− Name Lexicon) are excluded, accuracy declines only slightly. Although this Error type Number Percentage suggests that existing resources play only a minor role in the normalisation of Spanish Incorrect lexical form 22 37% tweets, this is likely due in part to the rela- Not available 19 32% tively small size of Slang Lexicon, which is Accent error 10 17% much smaller than similar English resources Case error 5 8% One to many 2 3% that have been effectively exploited in nor- Annotation error 1 2% malisation — i.e., 145 Spanish entries versus 5k English entries used by Han and Baldwin Table 3: Categorisation of false positives. (2011). Furthermore, Slang Lexicon might have little impact due to differences between deletions), or present in the test data, but not Spanish Twitter and SMS, the latter being found in the tweets, and excluded in the gold the primary focus of Slang Lexicon. standard. These error types are denoted as On the other hand, normalisation lexi- “Not available” in Table 4, and account for cons derived from tweets — whether based the second largest source of false positives. on the development data (Dev Lexicon) or Incorrect accents and casing account for automatically learnt (Dist Lexicon) — sub- 17% and 8% of false positives, respectively. stantially impact on accuracy (− Dev Lex- In both of these cases, contextual informa- icon and − Dist Lexicon). These findings tion, which is not incorporated in the pro- for the automatically derived Dist Lexicon posed approach, could be helpful. Finally, are in line with previous findings for English we identified two one-to-many normalisations Twitter normalisation (Han, Cook, and Bald- (which are outside the scope of our normali- win, 2012) that indicate that such lexicons sation system), and one case we judged to be can substantially improve recall with little an annotation error. impact on precision. We analysed a random sample of 20 of the We considered an experiment in which we 280 false negatives, and found irregular char- used Combined Lexicon, but ignored case acter repetitions and named entities to be the in the evaluation; the accuracy was 0.56. main sources of errors, e.g., uajajajaa (“ja”) This corresponds to the upper-bound on ac- and Pedroo (“Pedro”).6 The lexicon-based curacy if our system performed case restora- approach could be improved, for example, by tion perfectly, and suggests that improving using additional regular expressions to cap- the case restoration of our system would not ture repetitions of character sequences. Er- lead to substantial gains in accuracy. rors involving named entities reveal the lim- In the final row of Table 4 we show re- itations of using the Freeling 3.0 Spanish sults for a baseline method which makes no dictionary as the IV lexicon, as it has limited attempt to normalise the input. All lexicon- coverage of named entities. A corpus-derived based methods improve substantially over lexicon (e.g., from Wikipedia) could help im- this baseline. prove the coverage. To further analyse our lexicon-based nor- malisation approach, we categorise the errors for both false positives (OOVs that were nor- 5 Summary malised, but incorrectly so) and false nega- In this paper, we applied a lexicon-based ap- tives (OOVs that were not normalised, but proach to normalise non-standard words in should have been). As shown in Table 4, 37% Spanish tweets. Our analysis suggests that of false positives are incorrect lexical forms, the corpus-derived lexicon based on distribu- e.g., algerooo is normalised to “algero” and tional similarity improves accuracy, but that not its correct form “alegra”. Further ex- this approach is limited in terms of flexibility amination shows that 23% of these cases are (e.g., to capture accent variation) and lex- incorrectly normalised to “que”, suggesting icon coverage (e.g., of named entities). In that distributional similarity alone is insuffi- future work, we plan to expand the IV lexi- cient to capture normalisations for some non- con, and incorporate contextual information standard words. to improve normalisation involving accents Surprisingly, we found some OOVs in- and casing. cluded in the test data, but excluded from 6 the gold-standard annotations (due to tweet Pedro is not in our collected list of Spanish names. Acknowledgements Computational Linguistics (ACL 2012) NICTA is funded by the Australian gov- Demo Session, pages 25–30, Jeju, Repub- ernment as represented by Department of lic of Korea. Broadband, Communication and Digital O’Connor, Brendan, Michel Krieger, and Economy, and the Australian Research Coun- David Ahn. 2010. TweetMotif: Ex- cil through the ICT centre of Excellence pro- ploratory search and topic summarization gramme. The authors would like to thank for Twitter. In Proceedings of Fourth In- the anonymous reviewers for their valuable ternational AAAI Conference on Weblogs feedback and language expertise. and Social Media, pages 384–385, Wash- ington, USA. References Padró, Lluı́s and Evgeny Stanilovsky. 2012. Eisenstein, Jacob. 2013. What to do about Freeling 3.0: Towards wider multilin- bad language on the internet. In Proceed- guality. In Proceedings of the Eighth ings of the 2013 Conference of the North International Conference on Language American Chapter of the Association for Resources and Evaluation (LREC-2012), Computational Linguistics: Human Lan- pages 2473–2479, Istanbul, Turkey. guage Technologies (NAACL HLT 2013), pages 359–369, Atlanta, USA. Weng, Jianshu and Bu-Sung Lee. 2011. Event detection in Twitter. In Pro- Han, Bo and Timothy Baldwin. 2011. Lex- ceedings of the Fifth International AAAI ical normalisation of short text messages: Conference on Weblogs and Social Media, Makn sens a #twitter. In Proceedings of Barcelona, Spain. the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Hu- man Language Technologies (ACL HLT 2011), pages 368–378, Portland, Oregon, USA. Han, Bo, Paul Cook, and Timothy Baldwin. 2012. Automatically constructing a nor- malisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Nat- ural Language Learning, pages 421–432, Jeju Island, Korea. Association for Com- putational Linguistics. Han, Bo, Paul Cook, and Timothy Baldwin. 2013. Lexical normalisation of short text messages. ACM Transactions on Intel- ligent Systems and Technology, 4(1):5:1– 5:27. Jiang, Long, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. 2011. Target- dependent Twitter sentiment classifica- tion. In Proceedings of the 49th An- nual Meeting of the Association for Com- putational Linguistics: Human Language Technologies (ACL HLT 2011), pages 151–160, Portland, Oregon, USA. Lui, Marco and Timothy Baldwin. 2012. langid.py: An off-the-shelf language iden- tification tool. In Proceedings of the 50th Annual Meeting of the Association for