Weighted and unweighted transducers for tweet normalization


                 Mans Hulden                                   Jerid Francom
               University of Helsinki                       Wake Forest University
              mans.hulden@helsinki.fi                         francojc@wfu.edu

      Resumen: En este artı́culo se presentan dos estrategias basadas en transductores
      de estados finitos para la normalización de tweets. La primera de ellas se basa en
      reglas de corrección creadas manualmente y diseñadas para capturar las erratas y
      abreviaturas utilizadas más comúnmente, mientras que la segunda intenta construir
      automáticamente un modelo de errores a partir de un corpus etiquetado (gold stan-
      dard) de tweets previamente normalizados.
      Palabras clave: Normalización de tweets, tranductores de estados finitos, reglas
      fonológicas, canal con ruido.
      Abstract: We present two simple finite-state transducer based strategies for tweet
      normalization. One relies on hand-written correction rules designed to capture com-
      monly occurring misspellings and abbreviations, while the other tries to automati-
      cally induce an error model from a gold standard corpus of normalized tweets.
      Keywords: Tweet normalization, finite-state transducers, phonological rules, noisy
      channel

1   Introduction                                  2        Tools
                                                  We have employed two different toolkits in
                                                  our correction strategy. The unweighted
We present two different finite-state trans-      transducers for manually designed correc-
ducer (FST) based strategies for correction       tion rules were compiled using the foma-
and normalization of tweets. While both           toolkit1 (Hulden, 2009). For manipulating
methods use FSTs in the recovery of normal-       the weighted transducers employed in the
ized forms of words, the two employ differ-       noisy channel model approach, the Kleene 2
ent strategies for normalization. The first       (Beesley, 2009) finite-state programming lan-
method is entirely rule-based; that is, the       guage was used. Both approaches rely on
goal is that a linguist provides rules for cor-   a dictionary taken from the FreeLing suite
rection and normalization of tweets based on      (Carreras et al., 2004).
observations of commonly recurring patterns
of abbreviations or misspellings. This is ac-     3        Correction rules as FSTs
complished by compiling correction rules into     The goal with handwritten FST correction
unweighted FSTs in a hierarchical pattern         rules is to attempt a battery of corrections
where simple corrections are attempted be-        to unknown input words to transform them
fore more complex ones. The second method         into normalized forms found in a dictionary.
attempts to induce the normalization pat-         The corrections are expressed as contextually
terns from development data directly by as-       conditioned phonological replacement rules,
suming a noisy channel model. Here, an error      converted into transducers, and combined in
model of character-to-character transforma-       a hierarchical fashion in conjunction with a
tions is learned from the data and, combined      dictionary. They reflect phenomena such as:
with a language model also represented as a
transducer, the most probable correction can          1
                                                          http://foma.googlecode.com
                                                      2
be calculated.                                            http://kleene-lang.org
  • diacritic restoration (a → á, ny → ñ, . . . )   and (3), etc. etc. This type of a hierarchi-
                                                      cal replacement strategy can be implemented
  • commonly occurring spelling errors (v →
                                                      through standard regular expression opera-
    b, d → ∅, . . . )
                                                      tors available in conjunction with a dictio-
  • repetition removal (aaa. . . → a)                 nary encoded as an automaton (Beesley and
                                                      Karttunen, 2003).
  • common abbreviations/errors (ke →
    qué/que, tqm/tkm → te quiero mucho)              4     Noisy channel FST modeling
   The correction rules themselves range              The fundamental idea behind noisy chan-
from the very specific to highly generic. For         nel modeling is to choose a w to maxi-
example, a simple rule to attempt to restore          mize the probability P (w|s), where s is the
missing d-characters in past participles looks        “scrambled” word and w the normalized, in-
in the foma notation as follows:                      tended form. This quantity is proportional to
                                                      P (s|w)P (w), where the first factor is called
define RULEPart [..] (->) d || [a|i] _ o (s) .#. ;    the error model—the probability that a word
                                                      w was perturbed to yield s—and the second
reflecting the idea that d-characters are to          factor the language model, the probability
be inserted when preceded by a or i, and fol-         that w occurs.
lowed by o, and optionally s at the end of a              Weighted FSTs provide a convenient cal-
word.                                                 culus for addressing the above task, assum-
    At the same time, a very specific rule that       ing we can encode the probabilities of words
only addresses the high-frequency word-form           P (w) and perturbations P (s|w) as proba-
Twitter and its various misspellings and al-          bilistic or weighted transducers somehow.
ternative forms appears as                            When the error model (EM), the language
                                                      model (LM), and a scrambled input word s
define RuleTW1 [[t|T]+ [w|u]+ Vow+ t+ Vow+/h (r)]:    are available to us as weighted transducers,
               {Twitter};
                                                      we may calculate the composition
in effect accepting such forms as Tuiter, tui-
                                                                   W = s ◦ EM ◦ LM              (1)
itah, twittr, etc., and mapping all these to
Twitter.                                              and subsequently find the least-cost string in
   In total, about 20 rules were designed.            the output projection of W . Here, the trans-
Some rules count as single rules in the com-          ducer s is simply an acceptor that repeats the
bination hierarchy: for example, diacritic            to-be-corrected input word with probability 1
restoration rules (a → á, e → é, etc.) are          (or cost 0, as we are operating with negative
considered a single rule for the purposes of          log probabilities, or weights). Since trans-
combination.                                          ducer composition, projection, and least-cost
   The rules themselves could be contextu-            paths are easily calculated, what remains to
ally conditioned outside the current word as          be done is to construct the transducers that
well. This may be useful for performing cor-          assign a cost (probability) to character per-
rections based on syntactic context. How-             turbations (EM) and words (LM).
ever, for the experiments conducted here, all
rules refer only to contexts within the same          4.1    Inducing an error model
word.                                                 To assign probabilities to perturbations, we
   The rules in question are combined in a            first align the individual characters of word-
hierarchical order using a type of if-then-else       pairs in the development corpus using a Chi-
logic. That is, we can, for example, first ap-        nese Restaurant Process (CRP) (Ferguson,
ply corrections to (1) known abbreviations. If        1973) aligner. This is very similar to align-
that fails to produce a legitimate word in the        ing the word-pairs by minimum edit dis-
dictionary, we (2) apply diacritic restoration.       tance (MED). After the alignment is per-
If that fails, we apply some common error             formed, we work under the assumption that
pattern (3). Should that fail, apply both (2)         the character-to-character correspondences
are all independent—that is, not conditioned               $^correct($word) {
upon other transformations in the same word                  return $^shortestPath(
or elsewhere—and estimate the probability of                    $^lowerside($word _o_ $EM _o_ $LM)
each input symbol x being transformed to y                   );
as:                                                        }
                            c(x : y) + α
            p(x → y) =                              (2)        This reflects exactly the calculation given
                             N + |Γ|α
                                                           in (1) and the entire function is equivalent
using the smoothing parameter α (set to 0.1)               to finding argmax P (s|w) P (w) in the noisy
to avoid zero counts, N being the number of                channel model presented above. That is, we
observed character pairs, and Γ our alphabet               compose the input word with the error model
of possible symbol pairs. Note that zeroes,                and language model, extract the output pro-
representing deletions and insertions are also             jection and find the least-cost (most proba-
possible.                                                  ble) path.
   For example, our aligner outputs align-                     When this strategy is employed, we also
ments such as:                                             need to establish a cutoff probability to
                                                           choose when to simply accept a word as is.
   pasa_o         t__o
                                                           Obviously, the error model allows for any
   pasado         todo
                                                           word to change into any other word, albeit at
                                                           a low probability, and we do not want to allow
   Tambien        Adioos
                                                           extreme changes to produce a word in the dic-
   También       Adió_s
                                                           tionary, but rather accept the unknown word
   From this, we may calculate probabili-                  without change.
ties of inserting an o ( :o), inserting a d,                   Apart from calculating corrections based
( :d), repeating a p (p:p), etc., and construct            on a language model and error model, we can
a transducer representing the error model                  also incorporate previously “known” correc-
(EM ) that performs such perturbations with                tions collected from a gold-standard set of
the intended probabilities. Note that while                corrected words, if such a resource is avail-
we assume a context-independent probabil-                  able. If we assume access to a set of al-
ity for changes, there is no principled reason             ready known word-word pairs that represent
why complex contexts could not be accom-                   corrected input words, we can build another
modated as well.3                                          transducer that directly maps such cases to
                                                           their correct form at a low cost. This can be
4.2     Calculating corrections                            helpful because, for example, abbreviations
For the language model, LM , we have chosen                such as tqm → te quiero mucho would likely
to rely mainly on simple unigram frequen-                  produce a high cost using the character-
cies for words in the FreeLing lexicon, where              to-character based noisy channel corrector
frequencies were collected from a Wikipedia                because of the large number of character
dump. Such a model is particularly simple to               changes required to produce the target form.
produce a transducer from, but again, there                However, this can be addressed by building
is no principled reason why more complex                   a transducer EX that models such previously
models could not be used.                                  known word-word pairs and assigns these cor-
   Assuming the existence of transducers la-               rections a low cost, and the entire calculation
beled $word, representing the current word                 can be performed as:
to be corrected, an error model $EM, and a
language model $LM, as described above, the                        W = s ◦ ((EM ◦ LM ) ∪ EX)          (3)
correction can be described as a function in
the Kleene programming language as follows:                   This would then model both a lookup from
   3
     For example, inserting a d to yield a past partici-
                                                           a “known” list of cheap corrections as well as
ple, as in pasao→pasado should reasonably be more          the noisy-channel error-model and language
likely than inserting a d in other contexts.               model corrections.
          System              Accuracy            combining the hand-written and data-driven
                                                  approaches to yield a hybrid system.
          Rules               63.1                    One perhaps profitable avenue of further
          Noisy Channel       61.4                experimentation is to use the contextual rules
                                                  developed by the linguist and learn probabil-
                                                  ities for those rules. For example, under the
    Table 1: Results from development set.
                                                  current noisy channel model, the perturba-
                                                  tion ∅ → u (insert u) is learned without ref-
             System    Accuracy                   erence to context and deemed equally likely
             Rules     60.4                       regardless of context. However, the hand-
                                                  written rule allows that transformation pri-
        Table 2: Results from test set.           marily following a q, reflecting errors such as
                                                  *qiero and *aqi.
                                                      Quick improvements also appear possible
5    Evaluation                                   by investing in a good dictionary of named
                                                  entities as tweets appear to contain proper
To evaluate the performance of the methods,       names with much higher frequency than in
we divided the gold standard corpus into five     running text in general. For the current
parts to be able to perform cross-validation.     work, entities such as movie names, compa-
This is necessary, in particular, for the noisy   nies, celebrities, etc. were not used in the
channel model where the learning data must        development. Automatic induction of such a
necessarily come from a subset of the gold        database from large numbers of tweets should
corpus with normalized tweets.                    also be considered.
    The results are given in tables 1 and 2. As
is seen, the manual correction rules slightly     References
outperform the corrections produced by the
                                                  Beesley, Kenneth R. 2009. The Kleene
weighted transducer method.
                                                    language for weighted finite-state pro-
                                                    gramming.        In Finite-state Methods
6    Discussion & Future Work
                                                    and Natural Language Processing: Post-
We have compared and evaluated two differ-          proceedings of the 7th International Work-
ent approaches to tweet normalization, one          shop FSMNLP; Edited by Jakub Piskorski,
being more of an expert system developed            Bruce Watson and Anssi Yli-Jyrä, volume
based on observation of tweets and the other        191, page 27. IOS Press.
a statistical one based on inducing correc-
tion rules from a gold standard corpus. As        Beesley, Kenneth R. and Lauri Karttunen.
it stands, the noisy channel corrector suffers      2003. Finite State Morphology. CSLI
from the difficulty of constructing a reliable      Publications, Stanford, CA.
error model from the limited number of gold-      Carreras, Xavier, Isaac Chao, Lluis Padró,
standard corrections and the fact that the          and Muntsa Padró. 2004. FreeLing: An
unigram language model is relatively poor.          open-source suite of language analyzers.
This sometimes leads to drastically incor-          In LREC Proceedings.
rect conclusions. For example, the corrector
                                                  Ferguson, T. S. 1973. A Bayesian analy-
would produce from the input cansao the
                                                     sis of some nonparametric problems. Ann.
output casa and not the desired cansado,
                                                     Stat., 1:209–230.
since in a unigram context, the high likeli-
hood of casa outweighs the cost of having         Hulden, Mans. 2009. Foma: a finite-state
to delete two characters. Such errors are fre-      compiler and library. In Proceedings of
quent and could probably be addressed by            the 12th conference of the European Chap-
both a richer language model and a more re-         ter of the Association for Computational
fined error model.                                  Linguistics,, pages 29–32. Association for
    Apart from obvious improvements to the          Computational Linguistics.
language model, there is also potential for