=Paper= {{Paper |id=Vol-1086/10-paper |storemode=property |title=Word Normalization in Twitter Using Finite-state Transducers |pdfUrl=https://ceur-ws.org/Vol-1086/paper10.pdf |volume=Vol-1086 |dblpUrl=https://dblp.org/rec/conf/sepln/PortaS13 }} ==Word Normalization in Twitter Using Finite-state Transducers== https://ceur-ws.org/Vol-1086/paper10.pdf
    Word normalization in Twitter using finite-state transducers


                          Jordi Porta and José Luis Sancho
                     Centro de Estudios de la Real Academia Española
                            c/ Serrano 197-198, Madrid 28002
                                 {porta,sancho}@rae.es

      Resumen:
      Palabras clave:
      Abstract: This paper presents a linguistic approach based on weighted-finite state
      transducers for the lexical normalisation of Spanish Twitter messages. The sys-
      tem developed consists of transducers that are applied to out-of-vocabulary tokens.
      Transducers implement linguistic models of variation that generate sets of candi-
      dates according to a lexicon. A statistical language model is used to obtain the
      most probable sequence of words. The article includes a description of the compo-
      nents and an evaluation of the system and some of its parameters.
      Keywords: Tweet Messages. Lexical Normalisation. Finite-state Transducers.
      Statistical Language Models.

1    Introduction                                 2   Architecture and components of
                                                      the system
Text messaging (or texting) exhibits a con-       The system has three main components that
siderable degree of departure from the writ-      are applied sequentially: An analyser per-
ing norm, including spelling. There are many      forming tokenisation and lexical analysis on
reasons for this deviation: the informality       standard word forms and on other expres-
of the communication style, the character-        sions like numbers, dates, etc.; a compo-
istics of the input devices, etc. Although        nent generating word candidates for out-
many people consider that these communi-          of-vocabulary (OOV) tokens; a statistical
cation channels are “deteriorating” or even       language model used to obtain the most
“destroying” languages, many scholars claim       likely sequence of words; and finally, a true-
that even in this kind of channels communi-       caser giving proper capitalisation to common
cation obeys maxims and that spelling is also     words assigned to OOV tokens.
principled. Even more, it seems that, in gen-         Freeling (Atserias et al., 2006) with a spe-
eral, the processes underlying variation are      cial configuration designed for this task is
not new to languages. It is under these con-      used to tokenise the message and identify,
siderations that the modelling of the spelling    among other tokens, standard words forms.
variation, and also its normalisation, can be     The generation of candidates, i.e., the con-
addressed. Normalisation of text messaging        fusion set of an OOV token, is performed
is seen as a necessary preprocessing task be-     by components inspired in other modules
fore applying other natural language process-     used to analyse words found in historical
ing tools designed for standard language va-      texts, where other kind of spelling variation
rieties.                                          can be found (Porta, Sancho, and Gómez,
                                                  2013). The approach to historical variation
    Few works dealing with Spanish text mes-      was based on weighted finite-state transduc-
saging can be found in the literature. To         ers over the tropical semiring implementing
the best of our knowledge, the most rele-         linguistically motivated models. Some ex-
vant and recent published works are Mos-          periments were conducted in order to assess
quera and Moreda (2012), Pinto et al.             the task of assigning to old word forms their
(2012), Gomez Hidalgo, Caurcel Dı́az, and         corresponding modern lemmas. For each old
Iñiguez del Rio (2013) and Oliva et al.          word, lemmas were assigned via the possible
(2013).                                           modern forms predicted by the model. Re-
sults were comparable to the results obtained     xDDDDDD are recognised by regular expres-
with the Levenshtein distance (Levenshtein,       sions and mapped to their canonical form by
1966) in terms of recall, but were better in      means of simple transducers.
terms of accuracy, precision and F . As for       3.1.2    Initialisms, shortenings, and
old words, the confusion set of a OOV token                letter omissions
is generated by applying the shortest-paths
                                                  The string operations for initialisms (or
algorithm to the following expression:
                                                  acronymisation) and shortenings are difficult
                  W ◦E◦L                          to model without incurring in an overgenera-
                                                  tion of candidates. For this reason, only com-
where W is the automata representing the          mon initialisms, e.g., sq (es que), tk (te quie-
OOV token, E is an edit transducer gener-         ro) or sa (se ha), and common shortenings,
ating possible variations on tokens, and L is     e.g., exam (examen) or nas (buenas), are lis-
the set of target words. The composition of       ted.
these three modules is performed using an             For the omission of letters several trans-
on-line implementation of the efficient three-    ducers are implemented. The simplest and
way composition algorithm of Allauzen and         more conservative one is a transducer intro-
Mohri (2008).                                     ducing just one letter in any position of the
                                                  token string. Consonantal writing is a spe-
3     Resources employed                          cial case of letter omission. This kind of wri-
                                                  ting relies on the assumption that consonants
In this section, the resources employed by the
                                                  carry much more information than vowels do,
components of the system are described: the
                                                  which in fact is the norm in same languages
edit transducers, the lexical resources and the
                                                  like Semitic languages. Some rewrite rules are
language model.
                                                  applied to OOV tokens in order to restore vo-
3.1    Edit transducers                           wels:
We follow the classification of Crystal (2008)
                                                  InsertVowels = invert(RemoveVowels)
for texting features present also in Twitter
                                                  RemoveVowels = Vowels (→) 
messages. In order to deal with these features
several transducers were developed. Trans-
                                                  3.1.3    Standard non-standard
ducers are expressed as regular expressions
                                                           spellings
and context-dependent rewrite rules of the
form α → β / γ        δ (Chomsky and Halle,       We consider non-standard spellings standard
1968) that are compiled into weighted finite-     when they are widely used. These include
state transducers using the OpenGrm Thrax         spellings for representing regional or informal
tools (Tai, Skut, and Sproat, 2011).              speech, or choices sometimes conditioned by
                                                  input devices, as non-accented writing. For
3.1.1 Logograms and Pictograms                    the case of accents and tildes, they are resto-
Some letters are found used as logograms,         red using a cascade of optional rewrite rules
with a phonetic value. They are dealt with        like the following:
by optional rewrites altering the orthographic
form of tokens:                                   RestoreAccents = (n|ni|ny|nh (→) ñ) ◦
                                                   (a (→) á) ◦ (e (→) é) ◦ . . .
ReplaceLogograms = (x (→) por ) ◦
 (2 (→) dos) ◦ (@ (→) a|o) ◦ . . .                   Also words containing k instead of c or qu,
                                                  which appears frequently in protest writings,
Also laughs, which are very frequent, are         are standardised with simple transducers. So-
considered logograms, since they represent        me other changes are done to some endings to
sounds associated with actions. The multi-        recover the standard ending. There are com-
ple ways they are realised, including plurals,    plete paradigms like the following, which re-
are easily described with regular expressions.    lates non-standard to standard endings:
   Pictograms like emoticons entered by
means of ready-to-use icons in input devices                       -a     -ada
are not treated by our system since they are                       -as    -adas
not textual representations. However textual                       -ao    -ado
representations of emoticons like :DDD or                          -aos   -ados
   We also consider phonetic writing as a        varying number of consecutive occurrences of
kind of non-standard writing in which a          the same letter. An example of a rule dealing
phonetic form of a word is alphabetically        with letter a repetitions is a (→)  / a     .
and syllabically approximated. The transdu-      A transducer is generated for the alphabet.
cers used for generating standard words from         Because messages are keyboarded, some
their phonetic and graphical variants are:       errors found in words are due to letter trans-
                                                 positions and confusions between adjacent
DephonetiseWriting =                             letters in the same row of the keyboard. The-
 invert(PhonographemicVariation)                 se changes are also implemented with a trans-
                                                 ducer.
PhonographemicVariation =                            Finally, a Levenshtein transducer with a
 GraphemeToPhoneme ◦                             maximum distance of one has been also im-
 PhoneConflation ◦                               plemented.
 PhonemeToGrapheme ◦
 GraphemeVariation                               3.2   The lexicon
                                                 The lexicon for OOV token normalisation
In the previous definitions, the PhoneCon-       contains mainly Spanish standard words,
flation makes phonemes equivalent, as for        proper names and some frequent English
example the IPA phonemes /L/ and /J/. Lin-       words. These constitute the set of target
guistic phenomena as seseo and ceceo, in         words. We used the DRAE (RAE, 2001) as
which several phonemes were conflated by         the source for Spanish standard words in the
16th century, still remain in spoken variants    lexicon. Besides inflected forms, we have ad-
and are also reflected in texting. The Grap-     ded verbal forms with clitics attached and
hemeVariation transducer models, among ot-       derivative forms not found as entries in the
hers, the writing of ch as x, which could be     DRAE: -mente adverbs, appreciatives, etc.
due to the influence of other languages.         The list of proper names was compiled from
3.1.4 Juxtapositions                             many sources and contains first names, sur-
Spacing in texting is also non-standard. In      names, aliases, cities, country names, brands,
the normalisation task, some OOV tokens          organisations, etc. Special attention was pa-
are in fact juxtaposed words. The possi-         yed to hypocorisms, i.e., shorter or diminuti-
ble decompositions of a word into a sequen-      ve forms of a given name, as well as nickna-
ce of possible words is: shortest-paths(W ◦      mes or calling names, since communication in
SplitConjoinedWords ◦ L( L)+ ), where W is       channels as Twitter tends to be interpersonal
the word to be analysed, L( L)+ represents       (or between members of a group) and affecti-
the valid sequences of words and SplitConjoi-    ve. A list of common hypocorisms is provided
nedWords is a transducer introducing blanks      to the system. For English words, we have se-
( ) between letters and undoing optionally       lected the 100,000 more frequent words of the
possible fused vowels:                           BNC (BNC, 2001).
                                                 3.3   Language model
SplitConjoinedWords = invert(JoinWords)
                                                 We use a language model to decode the word
                                                 graph and thus obtain the most probable
JoinWords =
                                                 word sequence. The model is estimated from
 (a a (→) a<1>) ◦ . . . ◦ (u u (→) u<1>) ◦
                                                 a corpus of webpages compiled with Wacky
 ( (→) )
                                                 (Baroni et al., 2009). The corpus contains
Note that in the previous definition, some ru-   about 11,200,000 tokens coming from about
les are weighted with a unit cost <1>. The-      21,000 URLs. We used as seeds the types
se costs are used by the shortest-paths al-      found in the development set (about 2,500).
gorithm as a preference mechanism to select      Backed-off n-gram models, used as language
non-fused over fused sequences when both ca-     models, are implemented with the OpenGrm
ses are possible.                                NGram toolkit (Roark et al., 2012).
3.1.5 Other transducers                          3.4   Truecasing
Expressive lengthening, which consist in re-     The restoring of case information in badly-
peating a letter in order to convey emphasis,    cased text has been addressed in (Lita et
are dealt with by means of rules removing a      al., 2003) and has been included as part of
the normalisation task. Part of this process,     tokens are analysed with the system. Recall
for proper names, is performed by the ap-         on OOV tokens is of 89.40 %. Confusion sets
plication of the language model to the word       size follows a power-law distribution with an
graph. Words at message initial position are      average size of 5.48 for OOV tokens that goes
not always uppercased, since doing so yiel-       down to 1.38 if we average over the rest of
ded contradictory results after some experi-      the tokens. Precision for 2- and 4-gram lan-
mentation. A simple heuristic is implemented      guage models is 78.10 %, but the best result
to uppercase a normalisation candidate when       is obtained with 3-grams, with an precision
the OOV token is also uppercased.                 of 78.25 %.
                                                      There is a number of non-standard
4     Settings and evaluation                     forms that were wrongly recognised as in-
In order to generate the confusion sets we        vocabulary words because they clash with
used two edit transducers applied in a casca-     other standard words. In the second series
de. If neither of the two is able to relate a     of experiments a confusion set is generated
token with a word, the token is assigned to       for each word in order to correct potentially
itself.                                           wrong assignments. Average size of confu-
    The first transducer generates candida-       sion sets increases to 5.56.1 Precision results
tes according to the expansion of abbrevia-       for the 2-gram language model is of 78.25 %
tions, the identification of acronyms, picto-     but 3- and 4-gram reach both an precision of
grams and words which result from the follo-      78.55 %.
wing composition of edit transducers combi-           From a quantitative point of view, it seems
ning some of the features of texting:             that slighty better results are obtained using
                                                  a 3-gram language model and generating con-
    RemoveSymbols ◦                               fusion sets not only for OOV tokens but for
    LowerCase ◦                                   all the tokens in the message. In a qualitati-
    Deaccent ◦                                    ve evaluation of errors several categories show
    RemoveReduplicates ◦                          up. The most populated categories are those
    ReplaceLogograms ◦                            having to do with case restoration and wrong
    StandardiseEndings ◦                          decoding by the language model. Some errors
    DephonetiseWriting ◦                          are related to particularities of DRAE, from
    Reaccent ◦                                    which the lexicon was derived (dispertar or
    MixCase                                       malaleche). Non standard morphology is ob-
                                                  served in tweets, as in derivatives (tranquileo
    The second edit transducer analyses to-       or loquendera). Lack of abbreviation expan-
kens that did not receive analyses with the       sion is also observed (Hum). Faulty applica-
first editor. This second editor implements       tion of segmentation accounts for a few errors
consonantal writing, typing error recovery, an    (mencantaba). Finally, some errors are not on
approximate matching using a Levenshtein          our output but on the reference (Hojo).
distance of one and the splitting of juxtapo-
sed words. In all cases, case, accents and re-
                                                  5       Conclusions and future work
duplications are also considered. This second     No attention has been payed to multilingua-
transducer makes use of an extended lexicon       lism since the task explicitly excluded tweets
containing sequences of simple words.             from bilingual areas of Spain. However, gi-
    Several experiments were conducted in or-     ven that not few Spanish speakers (both in
der to evaluate some parameters of the sys-       Europe and America) are bilingual or live in
tem. In particular, the effect of the order of    bilingual areas, mechanisms should be provi-
the n-grams in the language model and the         ded to deal with other languages than English
effect of generating confusion sets for OOV       to make the system more robust.
tokens only versus the generation of confu-           We plan to build a corpus of lexically stan-
sion sets for all tokens. For all the experi-     dard tweets via the Twitter streaming API
ments we used the test set provided with the      to determine whether n-grams observed in a
tokenization delivered by Freeling.                   1
                                                      We removed from the calculation the token
    For the first series of experiments, tokens   mes de abril, which receives 308,017 different analy-
identified as standard words by Freeling re-      ses due to the combination of multiple editions and
ceive the same token as analysis and OOV          segmentations.
Twitter-only corpus improve decoding or not        Levenshtein, Vladimir I. 1966. Binary co-
as a side effect of syntax being also non stan-      des capable of correcting deletions, inser-
dard.                                                tions, and reversals. Soviet Physics Do-
    Qualitative analysis of results showed that      klady, 10(8):707–710.
there is room for improvement experimenting
                                                   Lita, Lucian Vlad, Abe Ittycheriah, Salim
with selective deactivation of items in the le-
                                                      Roukos, and Nanda Kambhatla. 2003.
xicon and further development of the segmen-
                                                      tRuEcasIng. In Proc. of the 41st Annual
ting module.
                                                      Meeting on ACL - Volume 1, ACL ’03, pa-
    However, initialisms and shortenings are          ges 152–159, Stroudsburg, PA, USA.
features of texting difficult to model without
causing overgeneration. Acronyms like FYQ,         Mosquera, Alejandro and Paloma Moreda.
which correspond to the school subject of            2012. TENOR: A lexical normalisation
Fı́sica y Quı́mica, are domain specific and dif-     tool for Spanish Web 2.0 texts. In Petr
ficult to foresee and therefore to have them         Sojka, Aleš Horák, Ivan Kopeček, and Ka-
listed in the resources.                             rel Pala, editors, Text, Speech and Dialo-
                                                     gue, volume 7499 of LNCS. Springer, pa-
References                                           ges 535–542.
Allauzen, Cyril and Mehryar Mohri. 2008. 3-        Oliva, J., J. I. Serrano, M. D. Del Castillo,
   way composition of weighted finite-state           and Á. Iglesias. 2013. A SMS normali-
   transducers. In Proc. of the 13th Int.             zation system integrating multiple gram-
   Conf. on Implementation and Application            matical resources. Natural Language En-
   of Automata (CIAA–2008), pages 262–                gineering, 19:121–141, 1.
   273, San Francisco, California, USA.
                                                   Pinto, David, Darnes Vilariño Ayala, Yuri-
Atserias, Jordi, Bernardino Casas, Elisa-             diana Alemán, Helena Gómez, Nahun Lo-
  bet Comelles, Meritxell González, Lluı́s           ya, and Héctor Jiménez-Salazar. 2012.
  Padró, and Muntsa Padró. 2006. Free-              The Soundex phonetic algorithm revisited
  Ling 1.3: Syntactic and semantic services           for SMS text representation. In Petr Soj-
  in an open-source NLP library. In Proc. of          ka, Aleš Horák, Ivan Kopeček, and Karel
  the 5th Int. Conf. on Language Resources            Pala, editors, Text, Speech and Dialogue,
  and Evaluation (LREC-2006), pages 48–               volume 7499 of LNCS, pages 47–55. Sprin-
  55, Genoa, Italy, May.                              ger.
Baroni, Marco, Silvia Bernardini, Adriano          Porta, Jordi, José-Luis Sancho, and Javier
  Ferraresi, and Eros Zanchetta. 2009. The           Gómez. 2013. Edit transducers for spe-
  wacky wide web: A collection of very large         lling variation in Old Spanish. In Proc. of
  linguistically processed web-crawled cor-          the workshop on computational historical
  pora. Language Resources and Evalua-               linguistics at NODALIDA 2013. NEALT
  tion, 43(3):209–226.                               Proc. Series 18, pages 70–79, Oslo, Nor-
                                                     way, May 22-24.
BNC. 2001. The British National Cor-
  pus, version 2 (BNC World). Distribu-            RAE. 2001. Diccionario de la lengua es-
  ted by Oxford University Computing Ser-            pañola. Espasa, Madrid, 22th edition.
  vices on behalf of the BNC Consortium.           Roark, Brian, Richard Sproat, Cyril Allau-
  http://www.natcorp.ox.ac.uk.                       zen, Michael Riley, Jeffrey Sorensen, and
Chomsky, Noam and Morris Halle. 1968.                Terry Tai. 2012. The OpenGrm open-
  The sound pattern of English. Harper &             source finite-state grammar software libra-
  Row, New York.                                     ries. In Proc. of the ACL 2012 System
                                                     Demonstrations, pages 61–66, Jeju Island,
Crystal, David. 2008. Txtng: The Gr8 Db8.            Korea, July.
  Oxford University Press.
                                                   Tai, Terry, Wojciech Skut, and Richard
Gomez Hidalgo, José Marı́a, Andrés Alfonso          Sproat. 2011. Thrax: An Open Source
  Caurcel Dı́az, and Yovan Iñiguez del Rio.          Grammar Compiler Built on OpenFst. In
  2013. Un método de análisis de lenguaje           ASRU 2011, Waikoloa Resort, Hawaii, De-
  tipo SMS para el castellano. Linguamáti-           cember.
  ca, 5(1):31—39, July.