<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Word normalization in Twitter using nite-state transducers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jordi Porta</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Madrid</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>porta</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>sanchog@rae.es</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Resumen: Palabras clave:</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper presents a linguistic approach based on weighted- nite state transducers for the lexical normalisation of Spanish Twitter messages. The system developed consists of transducers that are applied to out-of-vocabulary tokens. Transducers implement linguistic models of variation that generate sets of candidates according to a lexicon. A statistical language model is used to obtain the most probable sequence of words. The article includes a description of the components and an evaluation of the system and some of its parameters.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Text messaging (or texting) exhibits a
considerable degree of departure from the
writing norm, including spelling. There are many
reasons for this deviation: the informality
of the communication style, the
characteristics of the input devices, etc. Although
many people consider that these
communication channels are \deteriorating" or even
\destroying" languages, many scholars claim
that even in this kind of channels
communication obeys maxims and that spelling is also
principled. Even more, it seems that, in
general, the processes underlying variation are
not new to languages. It is under these
considerations that the modelling of the spelling
variation, and also its normalisation, can be
addressed. Normalisation of text messaging
is seen as a necessary preprocessing task
before applying other natural language
processing tools designed for standard language
varieties.</p>
      <p>
        Few works dealing with Spanish text
messaging can be found in the literature. To
the best of our knowledge, the most
relevant and recent published works are
        <xref ref-type="bibr" rid="ref10">Mosquera and Moreda (2012)</xref>
        ,
        <xref ref-type="bibr" rid="ref12">Pinto et al.
(2012)</xref>
        , Gomez Hidalgo, Caurcel D az, and
In~iguez del Rio (2013) and
        <xref ref-type="bibr" rid="ref11">Oliva et al.
(2013)</xref>
        .
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Architecture and components of the system</title>
      <p>The system has three main components that
are applied sequentially: An analyser
performing tokenisation and lexical analysis on
standard word forms and on other
expressions like numbers, dates, etc.; a
component generating word candidates for
outof-vocabulary (OOV) tokens; a statistical
language model used to obtain the most
likely sequence of words; and nally, a
truecaser giving proper capitalisation to common
words assigned to OOV tokens.</p>
      <p>
        Freeling
        <xref ref-type="bibr" rid="ref2">(Atserias et al., 2006)</xref>
        with a
special con guration designed for this task is
used to tokenise the message and identify,
among other tokens, standard words forms.
The generation of candidates, i.e., the
confusion set of an OOV token, is performed
by components inspired in other modules
used to analyse words found in historical
texts, where other kind of spelling variation
can be found
        <xref ref-type="bibr" rid="ref13 ref7">(Porta, Sancho, and Gomez,
2013)</xref>
        . The approach to historical variation
was based on weighted nite-state
transducers over the tropical semiring implementing
linguistically motivated models. Some
experiments were conducted in order to assess
the task of assigning to old word forms their
corresponding modern lemmas. For each old
word, lemmas were assigned via the possible
modern forms predicted by the model.
Results were comparable to the results obtained
with the Levenshtein distance
        <xref ref-type="bibr" rid="ref8">(Levenshtein,
1966)</xref>
        in terms of recall, but were better in
terms of accuracy, precision and F . As for
old words, the confusion set of a OOV token
is generated by applying the shortest-paths
algorithm to the following expression:
W
      </p>
      <p>
        E L
where W is the automata representing the
OOV token, E is an edit transducer
generating possible variations on tokens, and L is
the set of target words. The composition of
these three modules is performed using an
on-line implementation of the e cient
threeway composition algorithm of
        <xref ref-type="bibr" rid="ref1">Allauzen and
Mohri (2008)</xref>
        .
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Resources employed</title>
      <p>In this section, the resources employed by the
components of the system are described: the
edit transducers, the lexical resources and the
language model.
3.1</p>
      <sec id="sec-3-1">
        <title>Edit transducers</title>
        <p>
          We follow the classi cation of
          <xref ref-type="bibr" rid="ref6">Crystal (2008)</xref>
          for texting features present also in Twitter
messages. In order to deal with these features
several transducers were developed.
Transducers are expressed as regular expressions
and context-dependent rewrite rules of the
form ! /
          <xref ref-type="bibr" rid="ref5">(Chomsky and Halle,
1968)</xref>
          that are compiled into weighted
nitestate transducers using the OpenGrm Thrax
tools
          <xref ref-type="bibr" rid="ref16">(Tai, Skut, and Sproat, 2011)</xref>
          .
        </p>
        <sec id="sec-3-1-1">
          <title>3.1.1 Logograms and Pictograms</title>
          <p>Some letters are found used as logograms,
with a phonetic value. They are dealt with
by optional rewrites altering the orthographic
form of tokens:
ReplaceLogograms = (x (!) por )</p>
          <p>(2 (!) dos) (@ (!) ajo) . . .</p>
          <p>Also laughs, which are very frequent, are
considered logograms, since they represent
sounds associated with actions. The
multiple ways they are realised, including plurals,
are easily described with regular expressions.</p>
          <p>Pictograms like emoticons entered by
means of ready-to-use icons in input devices
are not treated by our system since they are
not textual representations. However textual
representations of emoticons like :DDD or
xDDDDDD are recognised by regular
expressions and mapped to their canonical form by
means of simple transducers.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2 Initialisms, shortenings, and letter omissions</title>
          <p>The string operations for initialisms (or
acronymisation) and shortenings are di cult
to model without incurring in an
overgeneration of candidates. For this reason, only
common initialisms, e.g., sq (es que), tk (te
quiero) or sa (se ha), and common shortenings,
e.g., exam (examen) or nas (buenas), are
listed.</p>
          <p>For the omission of letters several
transducers are implemented. The simplest and
more conservative one is a transducer
introducing just one letter in any position of the
token string. Consonantal writing is a
special case of letter omission. This kind of
writing relies on the assumption that consonants
carry much more information than vowels do,
which in fact is the norm in same languages
like Semitic languages. Some rewrite rules are
applied to OOV tokens in order to restore
vowels:
InsertVowels = invert(RemoveVowels)
RemoveVowels = Vowels (!)</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3 Standard non-standard spellings</title>
          <p>We consider non-standard spellings standard
when they are widely used. These include
spellings for representing regional or informal
speech, or choices sometimes conditioned by
input devices, as non-accented writing. For
the case of accents and tildes, they are
restored using a cascade of optional rewrite rules
like the following:
RestoreAccents = (njnijnyjnh (!) n~)
(a (!) a) (e (!) e) . . .</p>
          <p>Also words containing k instead of c or qu,
which appears frequently in protest writings,
are standardised with simple transducers.
Some other changes are done to some endings to
recover the standard ending. There are
complete paradigms like the following, which
relates non-standard to standard endings:
-a -ada
-as -adas
-ao -ado
-aos -ados</p>
          <p>We also consider phonetic writing as a
kind of non-standard writing in which a
phonetic form of a word is alphabetically
and syllabically approximated. The
transducers used for generating standard words from
their phonetic and graphical variants are:
DephonetiseWriting =</p>
          <p>invert(PhonographemicVariation)
PhonographemicVariation =</p>
          <p>GraphemeToPhoneme
PhoneCon ation
PhonemeToGrapheme</p>
          <p>GraphemeVariation
In the previous de nitions, the
PhoneConation makes phonemes equivalent, as for
example the IPA phonemes /L/ and /J/.
Linguistic phenomena as seseo and ceceo, in
which several phonemes were con ated by
16th century, still remain in spoken variants
and are also re ected in texting. The
GraphemeVariation transducer models, among
others, the writing of ch as x, which could be
due to the in uence of other languages.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.1.4 Juxtapositions</title>
          <p>Spacing in texting is also non-standard. In
the normalisation task, some OOV tokens
are in fact juxtaposed words. The
possible decompositions of a word into a
sequence of possible words is: shortest-paths(W
SplitConjoinedWords L( L)+), where W is
the word to be analysed, L( L)+ represents
the valid sequences of words and
SplitConjoinedWords is a transducer introducing blanks
( ) between letters and undoing optionally
possible fused vowels:
SplitConjoinedWords = invert(JoinWords)
JoinWords =
(a a (!) a&lt;1&gt;) . . . (u u (!) u&lt;1&gt;)
( (!) )
Note that in the previous de nition, some
rules are weighted with a unit cost &lt;1&gt;.
These costs are used by the shortest-paths
algorithm as a preference mechanism to select
non-fused over fused sequences when both
cases are possible.</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>3.1.5 Other transducers</title>
          <p>Expressive lengthening, which consist in
repeating a letter in order to convey emphasis,
are dealt with by means of rules removing a
varying number of consecutive occurrences of
the same letter. An example of a rule dealing
with letter a repetitions is a (!) / a .
A transducer is generated for the alphabet.</p>
          <p>Because messages are keyboarded, some
errors found in words are due to letter
transpositions and confusions between adjacent
letters in the same row of the keyboard.
These changes are also implemented with a
transducer.</p>
          <p>Finally, a Levenshtein transducer with a
maximum distance of one has been also
implemented.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>The lexicon</title>
        <p>
          The lexicon for OOV token normalisation
contains mainly Spanish standard words,
proper names and some frequent English
words. These constitute the set of target
words. We used the DRAE
          <xref ref-type="bibr" rid="ref14">(RAE, 2001)</xref>
          as
the source for Spanish standard words in the
lexicon. Besides in ected forms, we have
added verbal forms with clitics attached and
derivative forms not found as entries in the
DRAE: -mente adverbs, appreciatives, etc.
The list of proper names was compiled from
many sources and contains rst names,
surnames, aliases, cities, country names, brands,
organisations, etc. Special attention was
payed to hypocorisms, i.e., shorter or
diminutive forms of a given name, as well as
nicknames or calling names, since communication in
channels as Twitter tends to be interpersonal
(or between members of a group) and a
ective. A list of common hypocorisms is provided
to the system. For English words, we have
selected the 100,000 more frequent words of the
BNC
          <xref ref-type="bibr" rid="ref4">(BNC, 2001)</xref>
          .
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Language model</title>
        <p>
          We use a language model to decode the word
graph and thus obtain the most probable
word sequence. The model is estimated from
a corpus of webpages compiled with Wacky
          <xref ref-type="bibr" rid="ref3">(Baroni et al., 2009)</xref>
          . The corpus contains
about 11,200,000 tokens coming from about
21,000 URLs. We used as seeds the types
found in the development set (about 2,500).
Backed-o n-gram models, used as language
models, are implemented with the OpenGrm
NGram toolkit
          <xref ref-type="bibr" rid="ref15">(Roark et al., 2012)</xref>
          .
3.4
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Truecasing</title>
        <p>
          The restoring of case information in
badlycased text has been addressed in
          <xref ref-type="bibr" rid="ref9">(Lita et
al., 2003)</xref>
          and has been included as part of
the normalisation task. Part of this process,
for proper names, is performed by the
application of the language model to the word
graph. Words at message initial position are
not always uppercased, since doing so
yielded contradictory results after some
experimentation. A simple heuristic is implemented
to uppercase a normalisation candidate when
the OOV token is also uppercased.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Settings and evaluation</title>
      <p>In order to generate the confusion sets we
used two edit transducers applied in a
cascade. If neither of the two is able to relate a
token with a word, the token is assigned to
itself.</p>
      <p>The rst transducer generates
candidates according to the expansion of
abbreviations, the identi cation of acronyms,
pictograms and words which result from the
following composition of edit transducers
combining some of the features of texting:
RemoveSymbols
LowerCase
Deaccent
RemoveReduplicates
ReplaceLogograms
StandardiseEndings
DephonetiseWriting
Reaccent
MixCase</p>
      <p>The second edit transducer analyses
tokens that did not receive analyses with the
rst editor. This second editor implements
consonantal writing, typing error recovery, an
approximate matching using a Levenshtein
distance of one and the splitting of
juxtaposed words. In all cases, case, accents and
reduplications are also considered. This second
transducer makes use of an extended lexicon
containing sequences of simple words.</p>
      <p>Several experiments were conducted in
order to evaluate some parameters of the
system. In particular, the e ect of the order of
the n-grams in the language model and the
e ect of generating confusion sets for OOV
tokens only versus the generation of
confusion sets for all tokens. For all the
experiments we used the test set provided with the
tokenization delivered by Freeling.</p>
      <p>For the rst series of experiments, tokens
identi ed as standard words by Freeling
receive the same token as analysis and OOV
tokens are analysed with the system. Recall
on OOV tokens is of 89.40 %. Confusion sets
size follows a power-law distribution with an
average size of 5.48 for OOV tokens that goes
down to 1.38 if we average over the rest of
the tokens. Precision for 2- and 4-gram
language models is 78.10 %, but the best result
is obtained with 3-grams, with an precision
of 78.25 %.</p>
      <p>There is a number of non-standard
forms that were wrongly recognised as
invocabulary words because they clash with
other standard words. In the second series
of experiments a confusion set is generated
for each word in order to correct potentially
wrong assignments. Average size of
confusion sets increases to 5.56.1 Precision results
for the 2-gram language model is of 78.25 %
but 3- and 4-gram reach both an precision of
78.55 %.</p>
      <p>From a quantitative point of view, it seems
that slighty better results are obtained using
a 3-gram language model and generating
confusion sets not only for OOV tokens but for
all the tokens in the message. In a
qualitative evaluation of errors several categories show
up. The most populated categories are those
having to do with case restoration and wrong
decoding by the language model. Some errors
are related to particularities of DRAE, from
which the lexicon was derived (dispertar or
malaleche). Non standard morphology is
observed in tweets, as in derivatives (tranquileo
or loquendera). Lack of abbreviation
expansion is also observed (Hum). Faulty
application of segmentation accounts for a few errors
(mencantaba). Finally, some errors are not on
our output but on the reference (Hojo).
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and future work</title>
      <p>No attention has been payed to
multilingualism since the task explicitly excluded tweets
from bilingual areas of Spain. However,
given that not few Spanish speakers (both in
Europe and America) are bilingual or live in
bilingual areas, mechanisms should be
provided to deal with other languages than English
to make the system more robust.</p>
      <p>We plan to build a corpus of lexically
standard tweets via the Twitter streaming API
to determine whether n-grams observed in a
1We removed from the calculation the token
mes de abril, which receives 308,017 di erent
analyses due to the combination of multiple editions and
segmentations.</p>
      <p>Twitter-only corpus improve decoding or not
as a side e ect of syntax being also non
standard.</p>
      <p>Qualitative analysis of results showed that
there is room for improvement experimenting
with selective deactivation of items in the
lexicon and further development of the
segmenting module.</p>
      <p>However, initialisms and shortenings are
features of texting di cult to model without
causing overgeneration. Acronyms like FYQ,
which correspond to the school subject of
F sica y Qu mica, are domain speci c and
difcult to foresee and therefore to have them
listed in the resources.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Allauzen</surname>
            , Cyril and
            <given-names>Mehryar</given-names>
          </string-name>
          <string-name>
            <surname>Mohri</surname>
          </string-name>
          .
          <year>2008</year>
          . 3
          <article-title>- way composition of weighted nite-state transducers</article-title>
          .
          <source>In Proc. of the 13th Int. Conf. on Implementation and Application of Automata (CIAA{2008)</source>
          , pages
          <fpage>262</fpage>
          {
          <fpage>273</fpage>
          , San Francisco, California, USA.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Atserias</surname>
            , Jordi, Bernardino Casas, Elisabet Comelles, Meritxell Gonzalez, Llu s Padro, and
            <given-names>Muntsa</given-names>
          </string-name>
          <string-name>
            <surname>Padro</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>FreeLing 1.3: Syntactic and semantic services in an open-source NLP library</article-title>
          .
          <source>In Proc. of the 5th Int. Conf. on Language Resources and Evaluation (LREC-2006)</source>
          , pages
          <fpage>48</fpage>
          {
          <fpage>55</fpage>
          ,
          <string-name>
            <surname>Genoa</surname>
          </string-name>
          , Italy, May.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Baroni</surname>
            , Marco, Silvia Bernardini, Adriano Ferraresi, and
            <given-names>Eros</given-names>
          </string-name>
          <string-name>
            <surname>Zanchetta</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>The wacky wide web: A collection of very large linguistically processed web-crawled corpora</article-title>
          .
          <source>Language Resources and Evaluation</source>
          ,
          <volume>43</volume>
          (
          <issue>3</issue>
          ):
          <volume>209</volume>
          {
          <fpage>226</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>BNC.</surname>
          </string-name>
          <year>2001</year>
          .
          <article-title>The British National Corpus, version 2 (BNC World)</article-title>
          .
          <source>Distributed by Oxford University Computing Services on behalf of the BNC Consortium</source>
          . http://www.natcorp.ox.ac.uk.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Chomsky</surname>
            , Noam and
            <given-names>Morris</given-names>
          </string-name>
          <string-name>
            <surname>Halle</surname>
          </string-name>
          .
          <year>1968</year>
          .
          <article-title>The sound pattern of English</article-title>
          . Harper &amp; Row, New York.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Crystal</surname>
          </string-name>
          , David.
          <year>2008</year>
          .
          <article-title>Txtng: The Gr8 Db8</article-title>
          . Oxford University Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Gomez</given-names>
            <surname>Hidalgo</surname>
          </string-name>
          , Jose Mar a,
          <source>Andres Alfonso Caurcel D az, and Yovan In~iguez del Rio</source>
          .
          <year>2013</year>
          .
          <article-title>Un metodo de analisis de lenguaje tipo SMS para el castellano</article-title>
          .
          <source>Linguamatica</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ):
          <volume>31</volume>
          |
          <fpage>39</fpage>
          ,
          <string-name>
            <surname>July</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Levenshtein</surname>
            ,
            <given-names>Vladimir I.</given-names>
          </string-name>
          <year>1966</year>
          .
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          .
          <source>Soviet Physics Doklady</source>
          ,
          <volume>10</volume>
          (
          <issue>8</issue>
          ):
          <volume>707</volume>
          {
          <fpage>710</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Lita</surname>
            , Lucian Vlad, Abe Ittycheriah, Salim Roukos, and
            <given-names>Nanda</given-names>
          </string-name>
          <string-name>
            <surname>Kambhatla</surname>
          </string-name>
          .
          <year>2003</year>
          . tRuEcasIng.
          <source>In Proc. of the 41st Annual Meeting on ACL - Volume 1, ACL '03</source>
          , pages
          <fpage>152</fpage>
          {
          <fpage>159</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Mosquera</surname>
            , Alejandro and
            <given-names>Paloma</given-names>
          </string-name>
          <string-name>
            <surname>Moreda</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>TENOR: A lexical normalisation tool for Spanish Web 2.0 texts</article-title>
          . In Petr Sojka, Ales Horak, Ivan Kopecek, and Karel Pala, editors,
          <source>Text, Speech and Dialogue</source>
          , volume
          <volume>7499</volume>
          <source>of LNCS</source>
          . Springer, pages
          <volume>535</volume>
          {
          <fpage>542</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J. I.</given-names>
            <surname>Serrano</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D. Del Castillo</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Iglesias</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A SMS normalization system integrating multiple grammatical resources</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>19</volume>
          :
          <fpage>121</fpage>
          {
          <issue>141</issue>
          ,
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Pinto</surname>
          </string-name>
          , David, Darnes Vilarin~o Ayala, Yuridiana Aleman, Helena Gomez, Nahun Loya, and
          <string-name>
            <surname>Hector</surname>
          </string-name>
          Jimenez-Salazar.
          <year>2012</year>
          .
          <article-title>The Soundex phonetic algorithm revisited for SMS text representation</article-title>
          . In Petr Sojka, Ales Horak, Ivan Kopecek, and Karel Pala, editors,
          <source>Text, Speech and Dialogue</source>
          , volume
          <volume>7499</volume>
          <source>of LNCS</source>
          , pages
          <volume>47</volume>
          {
          <fpage>55</fpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Porta</surname>
          </string-name>
          , Jordi, Jose-Luis
          <string-name>
            <surname>Sancho</surname>
            , and
            <given-names>Javier</given-names>
          </string-name>
          <string-name>
            <surname>Gomez</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Edit transducers for spelling variation in Old Spanish</article-title>
          .
          <source>In Proc. of the workshop on computational historical linguistics at NODALIDA</source>
          <year>2013</year>
          .
          <source>NEALT Proc. Series 18</source>
          , pages
          <fpage>70</fpage>
          {
          <fpage>79</fpage>
          , Oslo, Norway, May
          <volume>22</volume>
          -24.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>RAE.</surname>
          </string-name>
          <year>2001</year>
          .
          <article-title>Diccionario de la lengua espan~ola</article-title>
          . Espasa, Madrid, 22th edition.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Roark</surname>
            , Brian, Richard Sproat, Cyril Allauzen, Michael Riley, Je rey Sorensen, and
            <given-names>Terry</given-names>
          </string-name>
          <string-name>
            <surname>Tai</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>The OpenGrm opensource nite-state grammar software libraries</article-title>
          .
          <source>In Proc. of the ACL 2012 System Demonstrations</source>
          , pages
          <volume>61</volume>
          {
          <fpage>66</fpage>
          ,
          <string-name>
            <surname>Jeju</surname>
            <given-names>Island</given-names>
          </string-name>
          , Korea, July.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Tai</surname>
            , Terry,
            <given-names>Wojciech</given-names>
          </string-name>
          <string-name>
            <surname>Skut</surname>
          </string-name>
          , and Richard Sproat.
          <year>2011</year>
          .
          <article-title>Thrax: An Open Source Grammar Compiler Built on OpenFst</article-title>
          .
          <source>In ASRU</source>
          <year>2011</year>
          ,
          <string-name>
            <given-names>Waikoloa</given-names>
            <surname>Resort</surname>
          </string-name>
          , Hawaii, December.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>