<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Weighted and unweighted transducers for tweet normalization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mans Hulden</string-name>
          <email>mans.hulden@helsinki</email>
          <email>mans.hulden@helsinki.</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jerid Francom</string-name>
          <email>francojc@wfu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Helsinki</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Wake Forest University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present two simple nite-state transducer based strategies for tweet normalization. One relies on hand-written correction rules designed to capture commonly occurring misspellings and abbreviations, while the other tries to automatically induce an error model from a gold standard corpus of normalized tweets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>We present two di erent nite-state
transducer (FST) based strategies for correction
and normalization of tweets. While both
methods use FSTs in the recovery of
normalized forms of words, the two employ di
erent strategies for normalization. The rst
method is entirely rule-based; that is, the
goal is that a linguist provides rules for
correction and normalization of tweets based on
observations of commonly recurring patterns
of abbreviations or misspellings. This is
accomplished by compiling correction rules into
unweighted FSTs in a hierarchical pattern
where simple corrections are attempted
before more complex ones. The second method
attempts to induce the normalization
patterns from development data directly by
assuming a noisy channel model. Here, an error
model of character-to-character
transformations is learned from the data and, combined
with a language model also represented as a
transducer, the most probable correction can
be calculated.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Tools</title>
      <p>
        We have employed two di erent toolkits in
our correction strategy. The unweighted
transducers for manually designed
correction rules were compiled using the
fomatoolkit1
        <xref ref-type="bibr" rid="ref5">(Hulden, 2009)</xref>
        . For manipulating
the weighted transducers employed in the
noisy channel model approach, the Kleene2
        <xref ref-type="bibr" rid="ref1">(Beesley, 2009)</xref>
        nite-state programming
language was used. Both approaches rely on
a dictionary taken from the FreeLing suite
        <xref ref-type="bibr" rid="ref3">(Carreras et al., 2004)</xref>
        .
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Correction rules as FSTs</title>
      <p>The goal with handwritten FST correction
rules is to attempt a battery of corrections
to unknown input words to transform them
into normalized forms found in a dictionary.
The corrections are expressed as contextually
conditioned phonological replacement rules,
converted into transducers, and combined in
a hierarchical fashion in conjunction with a
dictionary. They re ect phenomena such as:
1http://foma.googlecode.com
2http://kleene-lang.org
diacritic restoration (a ! a, ny ! n~, . . . )
commonly occurring spelling errors (v !
b, d ! ;, . . . )
repetition removal (aaa. . . ! a)
common abbreviations/errors (ke !
que/que, tqm/tkm ! te quiero mucho)
The correction rules themselves range
from the very speci c to highly generic. For
example, a simple rule to attempt to restore
missing d-characters in past participles looks
in the foma notation as follows:
define RULEPart [..] (-&gt;) d || [a|i] _ o (s) .#. ;
re ecting the idea that d-characters are to
be inserted when preceded by a or i, and
followed by o, and optionally s at the end of a
word.</p>
      <p>At the same time, a very speci c rule that
only addresses the high-frequency word-form
Twitter and its various misspellings and
alternative forms appears as
define RuleTW1 [[t|T]+ [w|u]+ Vow+ t+ Vow+/h (r)]:
{Twitter};
in e ect accepting such forms as Tuiter,
tuiitah, twittr, etc., and mapping all these to
Twitter.</p>
      <p>In total, about 20 rules were designed.
Some rules count as single rules in the
combination hierarchy: for example, diacritic
restoration rules (a ! a, e ! e, etc.) are
considered a single rule for the purposes of
combination.</p>
      <p>The rules themselves could be
contextually conditioned outside the current word as
well. This may be useful for performing
corrections based on syntactic context.
However, for the experiments conducted here, all
rules refer only to contexts within the same
word.</p>
      <p>
        The rules in question are combined in a
hierarchical order using a type of if-then-else
logic. That is, we can, for example, rst
apply corrections to (1) known abbreviations. If
that fails to produce a legitimate word in the
dictionary, we (2) apply diacritic restoration.
If that fails, we apply some common error
pattern (3). Should that fail, apply both (2)
and (3), etc. etc. This type of a
hierarchical replacement strategy can be implemented
through standard regular expression
operators available in conjunction with a
dictionary encoded as an automaton
        <xref ref-type="bibr" rid="ref2">(Beesley and
Karttunen, 2003)</xref>
        .
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Noisy channel FST modeling</title>
      <p>The fundamental idea behind noisy
channel modeling is to choose a w to
maximize the probability P (wjs), where s is the
\scrambled" word and w the normalized,
intended form. This quantity is proportional to
P (sjw)P (w), where the rst factor is called
the error model|the probability that a word
w was perturbed to yield s|and the second
factor the language model, the probability
that w occurs.</p>
      <p>Weighted FSTs provide a convenient
calculus for addressing the above task,
assuming we can encode the probabilities of words
P (w) and perturbations P (sjw) as
probabilistic or weighted transducers somehow.
When the error model (EM), the language
model (LM), and a scrambled input word s
are available to us as weighted transducers,
we may calculate the composition</p>
      <p>W = s</p>
      <p>EM</p>
      <p>LM
(1)
and subsequently nd the least-cost string in
the output projection of W . Here, the
transducer s is simply an acceptor that repeats the
to-be-corrected input word with probability 1
(or cost 0, as we are operating with negative
log probabilities, or weights). Since
transducer composition, projection, and least-cost
paths are easily calculated, what remains to
be done is to construct the transducers that
assign a cost (probability) to character
perturbations (EM) and words (LM).
4.1</p>
      <sec id="sec-4-1">
        <title>Inducing an error model</title>
        <p>
          To assign probabilities to perturbations, we
rst align the individual characters of
wordpairs in the development corpus using a
Chinese Restaurant Process (CRP)
          <xref ref-type="bibr" rid="ref4">(Ferguson,
1973)</xref>
          aligner. This is very similar to
aligning the word-pairs by minimum edit
distance (MED). After the alignment is
performed, we work under the assumption that
the character-to-character correspondences
are all independent|that is, not conditioned
upon other transformations in the same word
or elsewhere|and estimate the probability of
each input symbol x being transformed to y
as:
c(x : y) +
p(x ! y) =
        </p>
        <p>N + j j
using the smoothing parameter (set to 0.1)
to avoid zero counts, N being the number of
observed character pairs, and our alphabet
of possible symbol pairs. Note that zeroes,
representing deletions and insertions are also
possible.</p>
        <p>For example, our aligner outputs
alignments such as:
(2)
pasa_o
pasado</p>
        <sec id="sec-4-1-1">
          <title>Tambien Tambien t__o todo</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Adioos Adio_s</title>
          <p>From this, we may calculate
probabilities of inserting an o ( :o), inserting a d,
( :d), repeating a p (p:p), etc., and construct
a transducer representing the error model
(EM ) that performs such perturbations with
the intended probabilities. Note that while
we assume a context-independent
probability for changes, there is no principled reason
why complex contexts could not be
accommodated as well.3
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Calculating corrections</title>
        <p>For the language model, LM , we have chosen
to rely mainly on simple unigram
frequencies for words in the FreeLing lexicon, where
frequencies were collected from a Wikipedia
dump. Such a model is particularly simple to
produce a transducer from, but again, there
is no principled reason why more complex
models could not be used.</p>
        <p>Assuming the existence of transducers
labeled $word, representing the current word
to be corrected, an error model $EM, and a
language model $LM, as described above, the
correction can be described as a function in
the Kleene programming language as follows:
3For example, inserting a d to yield a past
participle, as in pasao!pasado should reasonably be more
likely than inserting a d in other contexts.
$^correct($word) {
return $^shortestPath(</p>
        <p>$^lowerside($word _o_ $EM _o_ $LM)
}
);</p>
        <p>This re ects exactly the calculation given
in (1) and the entire function is equivalent
to nding argmax P (sjw) P (w) in the noisy
channel model presented above. That is, we
compose the input word with the error model
and language model, extract the output
projection and nd the least-cost (most
probable) path.</p>
        <p>When this strategy is employed, we also
need to establish a cuto probability to
choose when to simply accept a word as is.
Obviously, the error model allows for any
word to change into any other word, albeit at
a low probability, and we do not want to allow
extreme changes to produce a word in the
dictionary, but rather accept the unknown word
without change.</p>
        <p>Apart from calculating corrections based
on a language model and error model, we can
also incorporate previously \known"
corrections collected from a gold-standard set of
corrected words, if such a resource is
available. If we assume access to a set of
already known word-word pairs that represent
corrected input words, we can build another
transducer that directly maps such cases to
their correct form at a low cost. This can be
helpful because, for example, abbreviations
such as tqm ! te quiero mucho would likely
produce a high cost using the
characterto-character based noisy channel corrector
because of the large number of character
changes required to produce the target form.
However, this can be addressed by building
a transducer EX that models such previously
known word-word pairs and assigns these
corrections a low cost, and the entire calculation
can be performed as:</p>
        <p>W = s ((EM</p>
        <p>LM ) [ EX)
(3)</p>
        <p>This would then model both a lookup from
a \known" list of cheap corrections as well as
the noisy-channel error-model and language
model corrections.
Rules 63.1</p>
        <p>Noisy Channel 61.4
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>To evaluate the performance of the methods,
we divided the gold standard corpus into ve
parts to be able to perform cross-validation.
This is necessary, in particular, for the noisy
channel model where the learning data must
necessarily come from a subset of the gold
corpus with normalized tweets.</p>
      <p>The results are given in tables 1 and 2. As
is seen, the manual correction rules slightly
outperform the corrections produced by the
weighted transducer method.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Discussion &amp; Future Work</title>
      <p>We have compared and evaluated two di
erent approaches to tweet normalization, one
being more of an expert system developed
based on observation of tweets and the other
a statistical one based on inducing
correction rules from a gold standard corpus. As
it stands, the noisy channel corrector su ers
from the di culty of constructing a reliable
error model from the limited number of
goldstandard corrections and the fact that the
unigram language model is relatively poor.
This sometimes leads to drastically
incorrect conclusions. For example, the corrector
would produce from the input cansao the
output casa and not the desired cansado,
since in a unigram context, the high
likelihood of casa outweighs the cost of having
to delete two characters. Such errors are
frequent and could probably be addressed by
both a richer language model and a more
rened error model.</p>
      <p>Apart from obvious improvements to the
language model, there is also potential for
combining the hand-written and data-driven
approaches to yield a hybrid system.</p>
      <p>One perhaps pro table avenue of further
experimentation is to use the contextual rules
developed by the linguist and learn
probabilities for those rules. For example, under the
current noisy channel model, the
perturbation ; ! u (insert u) is learned without
reference to context and deemed equally likely
regardless of context. However, the
handwritten rule allows that transformation
primarily following a q, re ecting errors such as
*qiero and *aqi.</p>
      <p>Quick improvements also appear possible
by investing in a good dictionary of named
entities as tweets appear to contain proper
names with much higher frequency than in
running text in general. For the current
work, entities such as movie names,
companies, celebrities, etc. were not used in the
development. Automatic induction of such a
database from large numbers of tweets should
also be considered.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Beesley</surname>
            ,
            <given-names>Kenneth R.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>The Kleene language for weighted nite-state programming</article-title>
          .
          <source>In Finite-state Methods and Natural Language Processing: Postproceedings of the 7th International Workshop FSMNLP;</source>
          Edited by Jakub
          <string-name>
            <surname>Piskorski</surname>
          </string-name>
          ,
          <article-title>Bruce Watson and Anssi Yli-Jyra</article-title>
          , volume
          <volume>191</volume>
          , page 27. IOS Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Beesley</surname>
          </string-name>
          ,
          <string-name>
            <surname>Kenneth R</surname>
            . and
            <given-names>Lauri</given-names>
          </string-name>
          <string-name>
            <surname>Karttunen</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Finite State Morphology</article-title>
          .
          <source>CSLI Publications</source>
          , Stanford, CA.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Carreras</surname>
            , Xavier, Isaac Chao, Lluis Padro, and
            <given-names>Muntsa</given-names>
          </string-name>
          <string-name>
            <surname>Padro</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>FreeLing: An open-source suite of language analyzers</article-title>
          .
          <source>In LREC Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Ferguson</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <year>1973</year>
          .
          <article-title>A Bayesian analysis of some nonparametric problems</article-title>
          . Ann. Stat.,
          <volume>1</volume>
          :
          <fpage>209</fpage>
          {
          <fpage>230</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Hulden</surname>
          </string-name>
          , Mans.
          <year>2009</year>
          .
          <article-title>Foma: a nite-state compiler and library</article-title>
          .
          <source>In Proceedings of the 12th conference of the European Chapter of the Association for Computational Linguistics</source>
          ,, pages
          <fpage>29</fpage>
          {
          <fpage>32</fpage>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>