unimelb: Spanish Text Normalisation
                       unimelb: Normalización de texto en español

                    Bo Han,1,2 Paul Cook1 and Timothy Baldwin1,2
    1
        Department of Computing and Information Systems, The University of Melbourne
                             2
                               NICTA Victoria Research Lab
         hanb@student.unimelb.edu.au, paulcook@unimelb.edu.au, tb@ldwin.net

         Resumen: El presente artı́culo describe una aproximación a la normalización de
         texto basada en léxico para tweets en español. En primer lugar se realiza una
         comparación entre la normalización de texto en español e inglés y se plantea la
         hipótesis de que se puede adaptar un enfoque similar ya planteado previamente para
         el inglés. Para ello, se construye un léxico de normalización a partir de un corpus,
         utilizando similaridad distribucional, y se combina con otros léxicos existentes (por
         ejemplo diccionarioes de jerga de Internet en español). Estos léxicos permiten una
         solución rápida basada en búsquedas. Los resultados experimentales indican que
         el léxico derivado del corpus complementa bien a los léxicos existentes, pero que la
         solución puede mejorarse con un mejor manejo de ciertos tipos de palabras, como
         las entidades con nombre.
         Palabras clave: Twitter, español, normalización de texto
         Abstract: This paper describes a lexicon-based text normalisation approach for
         Spanish tweets. We first compare English and Spanish text normalisation, and
         hypothesise that an approach previously proposed for English can be adapted to
         Spanish. A corpus-derived normalisation lexicon is built using distributional sim-
         ilarity, and is combined with existing lexicons (e.g., containing Spanish Internet
         slang). These lexicons enable a very fast, look-up based approach to text normali-
         sation. Experimental results indicate that the corpus-derived lexicon complements
         existing lexicons, but that the approach could be improved through better handling
         of certain word types, such as named entities.
         Keywords: Twitter, Spanish, Text Normalisation

1       Introduction                                   proach to Spanish text normalisation. In par-
                                                       ticular, we adapt the method of Han, Cook,
A tremendous amount of user-generated text             and Baldwin (2012) to build a normalisa-
is produced on social media sites such as              tion lexicon that maps non-standard words
Twitter and Facebook, and can be lever-                to their standard forms relative to a vocabu-
aged for natural language processing (NLP)             lary, i.e., out-of-vocabulary (OOV) words are
tasks such as sentiment analysis (Jiang et             mapped deterministically to in-vocabulary
al., 2011) and event detection (Weng and               (IV) words. This enables a very fast, look-up
Lee, 2011). However, this user-generated text          based approach to text normalisation. In our
is noisy, and contains various non-standard            approach an OOV word is first looked up in
words, e.g., jajaja (“ja”) and queee (“que”).          an automatically-derived normalisation lexi-
These non-standard words are not recognised            con that is complemented with entries from
by off-the-shelf NLP tools, and may conse-             Spanish Internet slang dictionaries and the
quently degrade the utility of NLP on so-              development data. If the OOV word is found
cial media. One way to tackle this chal-               in this lexicon it is normalised according to its
lenge is text normalisation — restoring these          entry, otherwise it is left unchanged. During
non-standard words to their canonical forms,           this normalisation step, OOV words and the
e.g., transforming jajaja to “ja” and queee            resulting normalisations are down-cased, so
to “que” (Eisenstein, 2013; Han, Cook, and             a final case restoration step is performed to
Baldwin, 2013).                                        appropriately capitalise the lowercased nor-
    This paper proposes a lexicon-based ap-            malisations.
2   Comparing English and                         3.1   Resources
    Spanish Text Normalisation                    Our normalisation transforms OOV forms to
The lexicon-based normalisation approach of       IV words, and thus a Spanish lexicon is re-
Han, Cook, and Baldwin (2012) was evalu-          quired to determine what is OOV. To this
ated on English tweets. In this section we        end, we use the Freeling 3.0 Spanish dic-
consider the plausibility of adapting their       tionary (Padró and Stanilovsky, 2012) which
method from English to Spanish, and iden-         contains 669k words.
tify the following key factors:                      We collected 146 Spanish Internet slang
                                                  expressions and cell phone abbreviations
Orthography: if we consider diacriticised
                                                  from the web (Slang lexicon).2 We fur-
letters as single characters, Spanish has more
                                                  ther extracted normalisation pairs from the
characters than English, and diacritics can
                                                  development data (Dev Lexicon).
lead to differences in meaning, e.g., más
means “more”, and mas means “but”. The               Analysing the development data we no-
method of Han, Cook, and Baldwin (2012)           ticed that many person names are not cor-
uses Levenshtein distance to measure string       rectly capitalised. We formed Name Lex-
similarity. We simply convert all characters      icon from a list of 277 common Spanish
to fused Unicode code points (treating á and     names.3 This lexicon maps lowercase person
a as different characters) and compute Lev-       names to their correctly capitalised forms.
enshtein distance over these forms.               3.2   Corpus-derived Lexicon
Word segmentation: Spanish and En-                The small, manually-crafted normalisation
glish words both largely use whitespace seg-      lexicons from Section 3.1 have low coverage
mentation, so similar tokenisation strategies     over non-standard words. To improve cov-
can be used.                                      erage, we automatically derive a much larger
Morphophonemics: Phonetic modeling                normalisation lexicon based on distributional
of words — a component of the method              similarity (Dist Lexicon) by adapting the
of Han, Cook, and Baldwin (2012) — is             method of Han, Cook, and Baldwin (2012).
available for Spanish using an off-the-shelf         We collected 283 million Spanish tweets
Double Metaphone implementation.1                 via the Twitter Streaming API4 from
Lexical resources: A lexicon and slang            21/09/2011–28/02/2012.        Spanish tweets
dictionary — key resources for the method of      were identified using langid.py (Lui and
Han, Cook, and Baldwin (2012) — are avail-        Baldwin, 2012). The tweets were tokenised
able for Spanish.                                 using a simplified English Twitter tokeniser
   Overall, English and Spanish text share        (O’Connor, Krieger, and Ahn, 2010). Ex-
important features, and we hypothesise that       cessive repetitions of characters (i.e., ≥ 3)
adapting a lexicon-based English normalisa-       in words are shortened to one character to
tion system to Spanish is feasible.               ensure different variations of the same pat-
   One important component of this Spanish        tern are merged. To improve coverage, we re-
normalisation task is case restoration: e.g.,     moved the restriction from the original work
maria as a name should be normalised to           that only OOVs with ≥ 4 letters were con-
“Maria”. Most previous English Twitter nor-       sidered as candidates for normalisation.
malisation tasks have focused on lowercase           For a given OOV, we define its confusion
words and ignored capitalisation.                 set to be all IV words with Levenshtein dis-
                                                  tance ≤ 2 in terms of characters or ≤ 1 in
3   System Description                            terms of Double Metaphone code. We rank
                                                  the items in the confusion set according to
The system consists of two steps: (1) down-
                                                  their distributional similarity to the OOV.
case all OOVs and normalise them based on
                                                  Han, Cook, and Baldwin (2012) considered
a normalisation lexicon which combines en-
                                                  many configurations of distributional similar-
tries from existing lexicons (Section 3.1) and
                                                  ity for normalisation of English tweets. We
entries automatically learnt from a Twitter
corpus (Section 3.2); (2) restore case for nor-      2
                                                       http://goo.gl/wgCFSs and http://goo.gl/
malised words (Section 3.3).                      xsYkDe, both accessed on 26/06/2013
                                                     3
                                                       https://en.wikipedia.org/wiki/Spanish_
   1
     https://github.com/amsqr/                    naming_customs
                                                     4
Spanish-Metaphone                                      https://dev.twitter.com
       Rank           callendo              guau                       Lexicon               Accuracy
          1      cayendo      0.713        y   1.756                  Combined Lexicon          0.52
          2      saliendo     3.896     que    1.873                    − Slang Lexicon         0.51
          3      fallando     4.303       la   2.488                    − Dev Lexicon           0.46
          4      rallando     6.761        a   2.649                    − Dist Lexicon          0.42
          5      valiendo     6.878      no    3.206                    − Name Lexicon          0.51
                                                                        + Edit distance         0.54
Table 1: The KL divergence for the top-five                           Baseline                  0.20
candidates for callendo and guau.
                                                               Table 2: Accuracy of lexicon-based normali-
                                                               sation systems. “−” indicates the removal of
                                                               a particular lexicon.

                                                               3.3    Case Restoration
                                                               We set the case of each token that was
                                                               normalised in the previous step (which is
                                                               down-cased at the current stage) to its most-
                                                               frequent casing in our corpus of Spanish
                                                               tweets. We also capitalise all normalised to-
Figure 1: KL divergence ratio cut-off vs. pre-                 kens occurring at the beginning of a tweet, or
cision of the derived normalisation lexicon on                 following a period or question mark.
the development data and Slang Lexicon.
                                                               4     Results and Discussion
use the same settings they selected: con-                      We evaluated the lexicons using classification
text is represented by positionally-indexed                    accuracy, the official metric for this shared
bigrams using a window size of ±2 tokens;                      task, on the tweet-norm test data. This met-
similarity is measured using KL divergence.                    ric divides the number of correct proposals —
An entry in the normalisation dictionary then                  OOVs correctly normalised or left unchanged
consists of the OOV and its top-ranked IV.                     — by the number of OOVs in the collection.
    From development data, we observe that                     This is termed “precision” by the task organ-
in many cases when a correct normalisation                     isers, but a true measure of precision would
is identified, there is a large difference in                  be based on the number of OOVs that were
KL divergence between the first- and second-                   actually normalised. We therefore use the
ranked IVs. Conversely, if the KL divergence                   term “accuracy” here.
of the first- and second-ranked normalisation                      We submitted two runs for the task. The
candidates is similar, the normalisation is of-                first, Combined Lexicon (Table 4), uses
ten less reliable. As shown in Table 3.2,                      only the combination of lexicons from Section
callendo (“cayendo”) is a correctly-derived                    3, and achieves an accuracy of 0.52. The sec-
(OOV, IV) pair, but guau (“y”) is not.                         ond run builds on Combined Lexicon but
    Motivated by this observation, we filter                   incorporates normalisation based on charac-
the derived (OOV, IV) pairs by the KL di-                      ter edit distance for words with many re-
vergence ratio of the first- and second-ranked                 peated characters. We observed that such
IV words for the OOV. Setting a high thresh-                   words are often non-standard, and tend not
old on this KL divergence ratio increases the                  to occur in the lexicons because of their rel-
reliability of the derived lexicon, but reduces                atively low frequency. For words with ≥ 3
its coverage. This ratio was tested for values                 repeated characters, we remove all but one of
from 1.0 to 3.0 with a step size of 0.1 over the               the repeated characters, and then select the
development data and Slang Lexicon. As                         most similar IV word according to character-
shown in Figure 1, the best precision (94.0%)                  based Levenshtein distance. The accuracy of
is achieved when the ratio is 1.9.5 We directly                this run is 0.54 (+ Edit distance, Table 4).
use this setting to derive the final lexicon,                      We further consider an ablative analysis of
instead of further re-ranking the (OOV,IV)                     the component lexicons of Combined Lex-
pairs using string similarity.                                 icon. As shown in Table 4, when Slang
                                                               Lexicon (− Slang Lexicon) or Name
  5
      Here precision is defined as #correct normalisations
                                      #normalisations
                                                           .   Lexicon (− Name Lexicon) are excluded,
accuracy declines only slightly. Although this         Error type                    Number      Percentage
suggests that existing resources play only a
minor role in the normalisation of Spanish          Incorrect lexical form              22           37%
tweets, this is likely due in part to the rela-     Not available                       19           32%
tively small size of Slang Lexicon, which is        Accent error                        10           17%
much smaller than similar English resources         Case error                           5            8%
                                                    One to many                          2            3%
that have been effectively exploited in nor-        Annotation error                     1            2%
malisation — i.e., 145 Spanish entries versus
5k English entries used by Han and Baldwin             Table 3: Categorisation of false positives.
(2011). Furthermore, Slang Lexicon might
have little impact due to differences between
                                                   deletions), or present in the test data, but not
Spanish Twitter and SMS, the latter being
                                                   found in the tweets, and excluded in the gold
the primary focus of Slang Lexicon.
                                                   standard. These error types are denoted as
   On the other hand, normalisation lexi-          “Not available” in Table 4, and account for
cons derived from tweets — whether based           the second largest source of false positives.
on the development data (Dev Lexicon) or
                                                      Incorrect accents and casing account for
automatically learnt (Dist Lexicon) — sub-
                                                   17% and 8% of false positives, respectively.
stantially impact on accuracy (− Dev Lex-
                                                   In both of these cases, contextual informa-
icon and − Dist Lexicon). These findings
                                                   tion, which is not incorporated in the pro-
for the automatically derived Dist Lexicon
                                                   posed approach, could be helpful. Finally,
are in line with previous findings for English
                                                   we identified two one-to-many normalisations
Twitter normalisation (Han, Cook, and Bald-
                                                   (which are outside the scope of our normali-
win, 2012) that indicate that such lexicons
                                                   sation system), and one case we judged to be
can substantially improve recall with little
                                                   an annotation error.
impact on precision.
                                                      We analysed a random sample of 20 of the
   We considered an experiment in which we
                                                   280 false negatives, and found irregular char-
used Combined Lexicon, but ignored case
                                                   acter repetitions and named entities to be the
in the evaluation; the accuracy was 0.56.
                                                   main sources of errors, e.g., uajajajaa (“ja”)
This corresponds to the upper-bound on ac-
                                                   and Pedroo (“Pedro”).6 The lexicon-based
curacy if our system performed case restora-
                                                   approach could be improved, for example, by
tion perfectly, and suggests that improving
                                                   using additional regular expressions to cap-
the case restoration of our system would not
                                                   ture repetitions of character sequences. Er-
lead to substantial gains in accuracy.
                                                   rors involving named entities reveal the lim-
   In the final row of Table 4 we show re-
                                                   itations of using the Freeling 3.0 Spanish
sults for a baseline method which makes no
                                                   dictionary as the IV lexicon, as it has limited
attempt to normalise the input. All lexicon-
                                                   coverage of named entities. A corpus-derived
based methods improve substantially over
                                                   lexicon (e.g., from Wikipedia) could help im-
this baseline.
                                                   prove the coverage.
   To further analyse our lexicon-based nor-
malisation approach, we categorise the errors
for both false positives (OOVs that were nor-
                                                   5         Summary
malised, but incorrectly so) and false nega-       In this paper, we applied a lexicon-based ap-
tives (OOVs that were not normalised, but          proach to normalise non-standard words in
should have been). As shown in Table 4, 37%        Spanish tweets. Our analysis suggests that
of false positives are incorrect lexical forms,    the corpus-derived lexicon based on distribu-
e.g., algerooo is normalised to “algero” and       tional similarity improves accuracy, but that
not its correct form “alegra”. Further ex-         this approach is limited in terms of flexibility
amination shows that 23% of these cases are        (e.g., to capture accent variation) and lex-
incorrectly normalised to “que”, suggesting        icon coverage (e.g., of named entities). In
that distributional similarity alone is insuffi-   future work, we plan to expand the IV lexi-
cient to capture normalisations for some non-      con, and incorporate contextual information
standard words.                                    to improve normalisation involving accents
   Surprisingly, we found some OOVs in-            and casing.
cluded in the test data, but excluded from
                                                        6
the gold-standard annotations (due to tweet                 Pedro is not in our collected list of Spanish names.
Acknowledgements                                  Computational Linguistics (ACL 2012)
NICTA is funded by the Australian gov-            Demo Session, pages 25–30, Jeju, Repub-
ernment as represented by Department of           lic of Korea.
Broadband, Communication and Digital            O’Connor, Brendan, Michel Krieger, and
Economy, and the Australian Research Coun-        David Ahn. 2010. TweetMotif: Ex-
cil through the ICT centre of Excellence pro-     ploratory search and topic summarization
gramme. The authors would like to thank           for Twitter. In Proceedings of Fourth In-
the anonymous reviewers for their valuable        ternational AAAI Conference on Weblogs
feedback and language expertise.                  and Social Media, pages 384–385, Wash-
                                                  ington, USA.
References
                                                Padró, Lluı́s and Evgeny Stanilovsky. 2012.
Eisenstein, Jacob. 2013. What to do about         Freeling 3.0: Towards wider multilin-
   bad language on the internet. In Proceed-      guality. In Proceedings of the Eighth
   ings of the 2013 Conference of the North       International Conference on Language
   American Chapter of the Association for        Resources and Evaluation (LREC-2012),
   Computational Linguistics: Human Lan-          pages 2473–2479, Istanbul, Turkey.
   guage Technologies (NAACL HLT 2013),
   pages 359–369, Atlanta, USA.                 Weng, Jianshu and Bu-Sung Lee. 2011.
                                                  Event detection in Twitter.       In Pro-
Han, Bo and Timothy Baldwin. 2011. Lex-           ceedings of the Fifth International AAAI
  ical normalisation of short text messages:      Conference on Weblogs and Social Media,
  Makn sens a #twitter. In Proceedings of         Barcelona, Spain.
  the 49th Annual Meeting of the Associa-
  tion for Computational Linguistics: Hu-
  man Language Technologies (ACL HLT
  2011), pages 368–378, Portland, Oregon,
  USA.
Han, Bo, Paul Cook, and Timothy Baldwin.
  2012. Automatically constructing a nor-
  malisation dictionary for microblogs. In
  Proceedings of the 2012 Joint Conference
  on Empirical Methods in Natural Lan-
  guage Processing and Computational Nat-
  ural Language Learning, pages 421–432,
  Jeju Island, Korea. Association for Com-
  putational Linguistics.
Han, Bo, Paul Cook, and Timothy Baldwin.
  2013. Lexical normalisation of short text
  messages. ACM Transactions on Intel-
  ligent Systems and Technology, 4(1):5:1–
  5:27.
Jiang, Long, Mo Yu, Ming Zhou, Xiaohua
   Liu, and Tiejun Zhao. 2011. Target-
   dependent Twitter sentiment classifica-
   tion. In Proceedings of the 49th An-
   nual Meeting of the Association for Com-
   putational Linguistics: Human Language
   Technologies (ACL HLT 2011), pages
   151–160, Portland, Oregon, USA.
Lui, Marco and Timothy Baldwin. 2012.
  langid.py: An off-the-shelf language iden-
  tification tool. In Proceedings of the 50th
  Annual Meeting of the Association for