A Method to Lexical Normalisation of Tweets∗
                     Un método de normalización léxica de tweets

       Pablo Gamallo y Marcos Garcia                        José Ramom Pichel
                    CITIUS                                     Imaxin Software
          Univ. de Santiago de Comp.                      jramompichel@imaxin.com
             pablo.gamallo@usc.es

      Resumen: Este artı́culo describe una estrategia de normalización léxica de pal-
      abras “out-of-vocabulary” (OOV) en tweets escritos en español. Para corregir OOV
      incorrectos, el sistema de normalización genera candidatos “in-vocabulary” (IV)
      que aparecen en diferentes recursos léxicos y selecciona el más adecuado. Nuestro
      método genera dos tipos de candidatos, primarios y secundarios, que serán ordena-
      dos de diferentes maneras en el proceso de selección del mejor candidato.
      Palabras clave: Normalización léxica, Mensajes cortos de texto, Procesamiento de
      tweets
      Abstract: This paper describes a strategy to perform lexical normalisation of out-
      of-vocabulary (OOV) words in Spanish tweets. To correct any ill-formed OOV,
      the normalisation system generates in-vocabulary (IV) candidates found in several
      lexical resources, and selects the best one. Our method generates two types of
      candidates, primary and secondary IV candidates, which will be ranked in different
      ways to select the best candidate.
      Keywords: Lexical Normalisation, Short Text Message, Tweet Processing

1    Introduction                                      achieves very high precision, but with low re-
In this paper, we describe a strategy                  call. As recall relies on the size of the dic-
to perform lexical normalisation of out-of-            tionary, (Han y Baldwing, 2012a) propose
vocabulary (OOV) words in Spanish tweets.              to build wide-coverage normalisation dictio-
The task can be described as follows. Given            naries in an automatic way, by considering
an OOV, the algorithm must decide whether              that lexical variants occur in similar con-
the OOV is either correct or ill-formed and,           texts to their standard forms. Normalisation
for the latter case, it must propose an in-            dictionary should only contain unambiguous
vocabulary (IV) word found in a lexical re-            “variant-standard” pairs. Ambiguous vari-
source to restore the incorrect OOV.                   ants will be tackled using the second strat-
    There has been few work on lexical nor-            egy.
malisation in short messages. So far, the                  The second strategy is applied when the
most successful strategy to normalise En-              OOV is a lexical variant that has not been
glish tweets is described in (Han y Baldwing,          found in the normalisation dictionary. It con-
2012b; Han y Baldwing, 2013). They propose             sists of the following two tasks:
merging two different strategies: normalisa-
tion dictionary lookup and selection of the              • Generation of IV candidates (standard
best in-vocabulary (IV) candidate.                         forms) for each particular OOV (lexical
    The first strategy simply consists in look-            variant).
ing up a normalisation dictionary, which con-            • Candidate selection of the best IV can-
tains specific abreviations and other types                didate.
of lexical variants found in the Twitter lan-
guage. Each lexical variant is associated                 The objective of the first task is to build,
to its standard form, for instance gl →                for each OOV, a list of standard forms which
girlf riend. The dictionary lookup method              were derived from the OOV using different
∗
   This work has been supported by Ministerio de
                                                       processes. For instance: reduction of charac-
Ciencia e Innovación, within the project OntoPedia,   ter repetitions (e.g., carrrr → car), or gener-
ref: FFI2010-14986.                                    ation of those IV words whose Edit distance
with regard to the target OOV is within a           training data, which are not available for mi-
given threshold.                                    croblogs.
    The second task consists in selecting the
best candidate out of the list generated in the     2     The method
previous step. Two different selection meth-        The normalisation method we propose com-
ods can be used: string similarity and con-         bines the main strategies and tasks described
text inference. To compute string similarity        in (Han y Baldwing, 2013), namely: normal-
between the OOV and the different IV candi-         isation dictionary lookup, generation of IV
dates, several measures and strategies can be       candidates, and selection of the best IV can-
used: lexical Edit distance, phonemic Edit          didate with context information. In addition,
distance, the longest common subsequence,           given the conditions of the Workshop, we also
affix substring, and so on. For context infer-      include in our algorithm ill-formed OOV de-
ence, the IV candidates of a given OOV can          tection.
be ranked and then filtered on the basis of             The design of our algorithm was moti-
their local contexts. Local contexts are com-       vated by the conclusions we draw from the
pared against a language model. The main            analysis of the development corpus. We
problem of this method is that the local con-       observed that the most frequent types of
text of an OOV is often constituted by other        incorrect Spanish OOV are the following:
incorrect lexical variants that are not found       (1) Uppercase/lowercase confusion: patri →
in the language model.                              P atri ; (2) character repetition for emphasis:
    These two selection methods (string simi-       Buuenoo → Bueno ; (3) language-dependent
larity and context inference) are complemen-        spelling problems, namely for Spanish: miss-
tary and then can be used together to select        ing accents and letter confusion (v/b, g/j,
the best candidate.                                 ll/y, h/∅ . . . ).
    There are, at least, two significant differ-        These three types of errors can be solved
ences between the task evaluated in (Han y          using simple specific rules. For the remain-
Baldwing, 2013) and that proposed at the            ing phenomena, which correspond to more
Tweet Normalization Workshop at SEPLN               heterogeneous problems, we will make use of
2013. On the one hand, the task in (Han y           generic strategies such as those described in
Baldwing, 2013) relies on the basic assump-         the previous section: dictionary lookup and
tion that lexical variants have already been        selection of the best IV candidate. For de-
identified. This means that only ill-formed         tection of correct/incorrect OOV, we use the
OOV are taken as input of the selection pro-        following method: if no IV associated to an
cess. By contrast, the task defined by the          OOV is found using specific rules or generic
Workshop guidelines includes the detection of       strategies, then the OOV is considered as cor-
ill-formed OOV. On the other hand, in (Han          rect. Otherwise it is taken as an ill-formed
y Baldwing, 2013) the correspondences one-          OOV. Text is lemmatised and PoS tagged us-
to-several are not considered, for instance         ing FreeLing (Padró y Stanilovsky, 2012).
imo → in my opinion. At the Workshop,                   Our method contains two modules: a set
by contrast, it is required to search for one-      of lexical resources and an algorithm to de-
to-several correspondences, since the IV stan-      tect and correct ill-formed OOV.
dard forms used to correct OOV can be mul-
tiwords. In sum, the task defined at the            2.1    Lexical resources
Tweet Normalization Workshop is more com-           Our system makes use of three different lexi-
plex than that described in (Han y Baldwing,        cal resources:
2013).
    Finally, there are other approaches to          ND Normalisation dictionary, containing in-
SMS and tweet normalisation based on very             correct lexical variants and their stan-
different strategies. For instance (Beaufort          dard forms.
et al., 2010) and (Kaufmann y Kalita, 2010)
                                                    SD Standard dictionary, a list of correct
make use of the Statistical Machine Transla-
                                                       forms generated from the lemmas found
tion framework, as well as of the noisy chan-
                                                       in the Real Academia Española dictio-
nel model, very common in speech process-
                                                       nary (DRAE).
ing. The main problem of these approaches
is that they rely on large quantities of labelled   PND Proper names dictionary, containing
    proper names extracted from the Span-         out. The result is a list of 107, 980 unigrams
    ish Wikipedia.                                taking part in the names of persons, loca-
                                                  tions, and organisations.
   In the following, we describe how these
three dictionaries have been built.               2.2   The algorithm
2.1.1 Normalisation Dictionary (ND)               The system takes a list of OOV as input.
                                                  An OOV is considered as correct if the Dic-
It was mainly built using the develope-
                                                  tionary Lookup process is true. Dictionary
ment data distributed by the organizers
                                                  lookup is a process that consists in searching
for the Tweet Normalization Workshop at
                                                  a token in one of the three lexical dictionar-
SEPLN 2013. We also used as source of
                                                  ies: ND, SD, or PD. If the OOV is found
data the list of emoticons accesible from
                                                  in one of them, then it is considered as cor-
http://en.wikipedia.org/wiki/List_of_
                                                  rect. However, even if Dictionary Lookup is
emoticons, as well as the list of Spanish
                                                  false, the OOV will be considered as correct
abreviations released in http://www.rae.
                                                  if Affix Check is true. Affix Check is a pro-
es/dpd/apendices/apendice2.html. Our
                                                  cess that extracts regular suffixes and prefixes
final normalisation dictionary contains 824
                                                  from the OOV and verifies whether the stem
entries.
                                                  of the OOV takes part of an entry found in
2.1.2 Standard Dictionary (SD)                    one of the three dictionaries. Otherwise, the
The standard dictionary is constituted by         OOV can be incorrect.
all the forms automatically generated from            Given an incorrect OOV, we generate a
the lemmas found in DRAE. These lemmas            list of variants. A variant of an OOV is an
have been extracted and freely distributed by     IV candidate if either Dictionary Lookup or
the project http://olea.org/proyectos/            Affix Check is true. We distinguish between
lemarios. Verb forms were generated with          primary and secondary variants.
the Cilenis verb conjugator (Gamallo et al.,      2.2.1 Generation of primary variants
2013), whereas we used specific morpho-
                                                  Primary variants of an OOV are its most
logical rules to generate noun and adjec-
                                                  likely IV candidates, according to the type of
tive forms. The final dictionary consists of
                                                  errors we found in the development corpus.
778, 149 forms, which is significatively larger
                                                  Primary variants will be favoured in the pro-
than that provided by the last version of
                                                  cess of candidate selection: if at least a pri-
FreeLing (556, 509 Spanish forms in FreeLing
                                                  mary variant is found, then the system does
3.0).
                                                  not consider secondary variants.
2.1.3    Proper Names Dictionary                      Primary variants of an OOV are those IV
         (PND)                                    candidates derived from the OOV that only
To make easier the detection of correct OOV       differ from the source OOV with regard to
(for instance, proper names and domain-           one of these linguistic phenomena: Upper-
specific terms that are not in a standard vo-     case/lowercase confusion, character repeti-
cabulary), it is useful to make use of a large    tion, or frequent Spanish spelling errors. The
list of OOV extracted from an enclyclopaedic      frequent spelling errors include, not only typ-
resource, for instance the Wikipedia. Sev-        ical problems with accents and frequent let-
eral PND were automatically extracted. Fi-        ter confusions (v/b, j/g, etc), but also some
nally, the PND allowing the best performance      phonemic conventions, namely the use of “x”
in the normalisation task was extraced as         for “ch” (e.g. xicle → chicle). Primary
follows: First, using CorpusPedia (Gamallo        variants generated by simplifying repetition
y González, 2010), a simplified format de-       include the cases of interjection reduction:
rived from the original downloadable XML          jejeeje → je. For uppercase and lowercase
file (Wikipedia Dump of May 2011), the            variation, we take into account that words
names of articles belonging to categories re-     can be written with only lowercase letters,
lated to persons, locations, and organisations    with capitalisation (proper names or first po-
were identified, by using the strategy de-        sition in the sentence), or with only upper-
scribed in (Gamallo y Garcia, 2011). Then,        case letters (e.g. acronyms). For instance,
these names were tokenized and those uni-         given the OOV “pedro”, two other variants
grams whose lowercase variants are found in       are generated: “Pedro” and “PEDRO”. If
the standard dictionary (SD) were filtered        one of them is found in the lexical resources,
then it is considered as a primary IV candi-        lected.
date. Let us note that a primary variant is
considered an IV candidate if either Dictio-        3   Experiments
nary Lookup or Affix Check is true.                 Some experiments were performed using as
2.2.2    Generation of secondary                    test set the development corpus provided by
         variants                                   the organisation of the Tweet Normaliza-
                                                    tion Workshop. This corpus contains 500
If no primary variant is found as IV candi-
                                                    tweets and 651 OOV manually corrected.
date, then a large list of secondary variants
                                                    The language model used by our system was
is generated using Edit distance. In our ex-
                                                    built from two text sources: the collection
periments, we only generate those variants
                                                    of 227,255 tweets provided by the Workshop,
that have Edit distance 1 with regard to the
                                                    which were captured between April 1st and
original OOV. Dictionary Lookup and Affix
                                                    2nd of 2013, and a collection of news from El
Check allow us to identify the list of sec-
                                                    Pais and El Mundo captured via RSS Crawl-
ondary IV candidates. In the next step, we
                                                    ing. In sum, the language model was created
select the best candidate.
                                                    from 50MB of text. The normalisation dic-
2.2.3 Candidate selection                           tionary contains annotated information from
To select the best IV candidate of a given          the sample corpus with 100 tweets provided
OOV, we compare the local context of each           by the Workshop. For the final tests, this
candidate against a language model contain-         dictionary also includes the annotated pairs
ing bigrams of tokens found within a window         of the development corpus.
of size 4 (2 tokens to the left and 2 to the           Two versions of our system were tested,
right of a given token). More precisely, for        “Standard” and “Restricted”, and com-
each candidate, chi-square measure is com-          pared against two baselines: “Baseline1” and
puted by considering observed frequencies in        “Baseline2”. The standard version has been
the local context against expected frequen-         described in the previous section. The re-
cies in the language model. The language            stricted version includes a constraint on short
model was built by selecting lemmas of the          proper names and short acronyms (with less
following list of PoS categories: nouns, verbs,     than 5 letters). The constraint prevents short
adjectives, prepositions, and adverbs. Text         proper names and acronyms from being ex-
was processed with FreeLing. We also in-            panded with secondary variants. For in-
troduced an important restriction that takes        stance, if the OOV is “BBC”, the system does
into account whether the IV candidate is ei-        not create IV candidates such as “BBV”,
ther a primary or a secondary variant. A            “ABC”, and so on. In Baseline1, we do
primary variant is always selected even if its      not separate primary from secondary vari-
chi-square score is 0. It means that a primary      ants, and all IV candidates are treated as pri-
variant is always selected even if it is not        mary variants. Baseline2 does not separate
found in the language model. By contrast,           primary from secondary variants, and all IV
for secondary variants, the chi-square must         candidates are treated as secondary variants.
be higher than 0 to be selected. Candidates            Table 1 shows the results obtained from
are ranked considering chi-square values and        the experiments performed on the develop-
the above restriction. The best IV candidate        ment set. The best performance is achieved
on the top of the rank is selected and given        with “Restricted”, which is based on the algo-
as correction of the OOV. At the end, we            rithm that makes use of restrictions on short
apply the capitalisation rule which considers       proper names. The low scores reached by the
the position of the original ill-formed OOV in      baseline systems clearly show that candidates
the sentence: if it is the first word in the sen-   must be separated at different levels to be
tence, then the selected IV candidate must          treated in different ways. In the test set, “Re-
be written with its first letter in uppercase.      stricted” achieved 66.3% accuracy, the sec-
    Finally, if no IV candidate (primary or sec-    ond best score among the 13 participants in
ondary variant) is selected, then the OOV is        the Tweet-Norm Competition.
considered as correct. So, correct OOV are
detected in two different ways: first, if Dic-      Bibliografı́a
tionary Lookup or Affix Check is true for the       Beaufort, Richard, Sophie Roekhaut, Louise-
original OOV, or if no IV candidate is se-            Amélie Cougnon, y Cédrick Fairon. 2010.
       Systems      pos   neg   accuracy
      Baseline1     273   378    41.80
      Baseline2     288   363    44.10
      Standard      444   207    67.99
      Restrictive   451   200    69,06

Table 1: Results from the development set


  A hybrid rule/model-based finite-state
  framework for normalizing SMS mes-
  sages. En 48th Annual Meeting of the As-
  sociation for Computational Linguistics,
  páginas 770–779, Uppsala, Sweden.
Gamallo, P., M. Garcia, I. González, M. Mu
  noz, y I. del Rı́o. 2013. Learning verb
  inflection using Cilenis conjugators. Eu-
  rocall Review, 21(1):12–19.
Gamallo, Pablo y Marcos Garcia. 2011.
  A resource-based method for named en-
  tity extraction and classification. LNCS,
  7026:610–623.
Gamallo, Pablo y Isaac González. 2010.
  Wikipedia as a multilingual source of com-
  parable corpora. En LREC 2010 Work-
  shop on Building and Using Comparable
  Corpora, páginas 19–26, Valeta, Malta.
Han, B. y T. Baldwing. 2012a. Automati-
  cally constructing a normalisation dictio-
  nary for microblogs. En Conference on
  Empirical Methods in Natural Language
  Processing and Natural Language Learn-
  ing (EMNLP-CoNLL 2012), Jeju, Korea.
Han, B. y T. Baldwing. 2012b. Lexical nor-
  malisation of short text messages: Makn
  sens a twitter. En 49th Annual Meeting
  of the Association for Computational Lin-
  guistics, páginas 368–378, Portland, Ore-
  gon, USA.
Han, B. y T. Baldwing. 2013. Lexical nor-
  malisation of social media text. ACM
  Transactions on Intelligent Systems and
  Technology, 4(1):15–27.
Kaufmann, J. y J. Kalita. 2010. Syn-
  tactic normalization of twitter messages.
  En Conference on Natural Language Pro-
  cessin, Kharagpur, India.
Padró, Lluı́s. y Evgeny Stanilovsky. 2012.
  Freeling 3.0: Towards wider multilingual-
  ity. En Conference on Language Re-
  sources and Evaluation (LREC’12), Istan-
  bul, Turkey.