<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SINAI at Twitter-Normalization 2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arturo Montejo Raez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Carlos Diaz Galiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eugenio Mart nez Camara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Teresa Mart n Valdivia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel A. Garc a Cumbreras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L. Alfonso Uren~a Lopez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad de Jaen Campus Las Lagunillas 23071</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>In this paper, we present the Twitter-normalization system developed by the SINAI group. Our system performs a series of conversions on the text by the use of translation lexicons and a spell checker. We obtain a poor result, only 37.6% of accuracy, and after the analysis of these results our system should be improved in areas such as the treatment of diminutives and superlatives, entities or abbreviations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Twitter is a popular medium for
broadcasting news, staying in touch with friends, and
sharing opinions. Several researches have
been focused on this new microblogging
platform that is changing the communication way
among people. However, tweets often contain
highly irregular syntax and nonstandard use
of a language. In addition, Twitter posts
frequently include URLs, as well as markup
syntax, which further decreases the amount of
characters available for content. Because of
these limits, users have created a novel syntax
to communicate their messages with as much
brevity as possible. While this brevity allows
tweets to contain more information, it makes
them harder to mine and analyze the
information due to its lack of standardization.</p>
      <p>
        Several works have studied the
normalization problem for short text. For
example,
        <xref ref-type="bibr" rid="ref2">(Kaufmann and Kalita, 2010)</xref>
        describe a
novel system which normalizes these Twitter
posts standard form of English by taking a
two step-approach, rst preprocess tweets to
remove as much noise as possible and then
feed them into a machine translation model
to convert them into standard English.
        <xref ref-type="bibr" rid="ref1">(Han
and Baldwin, 2011)</xref>
        target out-of-vocabulary
words in short text messages and propose a
method for identifying and normalizing
illformed words that doesn't require any
annotations. They use a classi er to detect
illformed words, and generate correction
candidates based on morphophonemic similarity.
      </p>
      <p>
        On the other hand, most of the studies
on short text normalization only deal with
English tweets while more and more, other
languages are increasingly used on Twitter.
For example, there are some works
dealing with Spanish tweets
        <xref ref-type="bibr" rid="ref3">(Moreno-Ortiz and
Hernandez, 2013)</xref>
        but very few are focused
on the normalization process.
      </p>
      <p>This paper describes a system which
normalizes Spanish Twitter posts, converting
them into a more standard form and so
natural language processing (NLP) techniques
can be more easily applied to them. Next
section describes our approach based on the
use of translation lexicons and spell checking.
Then, the evaluation process is commented
and, in addition, we have accomplished an
analysis of the obtained results.
formed into a nal normalized form. We have
not considered annotation-based approaches
like those followed by well-known systems like
GATE1 or proposed by recommendations like
the UIMA speci cation2. Instead, we have
chosen a straightforward solution, where rst
the text is tokenized with special attention on
Twitter related items (like emoticons,
mentions or hashtags) and then each token is
converted into some sort of canonical form
by the use of translation lexicons and a spell
checker. Details of each module are given in
the following subsections.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Tokenization</title>
      <p>Tokenization allows the segmentation of texts
into their most simple units of meaning:
terms. In our case, multi-word forms are not
considered, so each term is either related to
a word or to other type of information like:
emoticons, HTML tags, telephone numbers,
mentions, hashtags, dates, URLs, e-mail
addresses and some other minor items. Case
is preserved during the tokenization process
and, as result, we obtain a list of strings to
feed next modules.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Translation tables</title>
      <p>A translation table allows for the replacement
of certain forms of strings into other forms.
In this way, we can recognize some
expressions and translate them to more convenient
representations. In this step, the following
translation tables have been considered:
1. Abbreviations. Expressions like \a2 "
are translation into \adios", \q " into
\que" and so on up to twelve possible
Spanish abbreviations commonly used in
\texting" communication.
2. Laughings. This translation table
make intensive use of regular expressions
in order to capture most possible forms
of laughing expressions found in text. In
this way, \aajajajaaj " would be replaced
by \ja", for example.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Spell checking</title>
      <p>For this module we have used the GNU
Aspell3 spell checker and its binding for Python,
aspell-python4. GNU Aspell is an open
1http://gate.ac.uk/
2http://uima.apache.org
3http://aspell.net/
4http://0x80.pl/proj/aspell-python/
source spell checker that works well with
Unicode strings, which makes it very suitable for
multilingual texts. Also, it allows multiple
dictionaries to be used concurrently and the
addition of further vocabularies to be
considered as correct forms, so we can integrate
more lexicons. Aspell works by converting
the misspelled word (that is, the word not
included in their dictionaries) into a sounds
like equivalent. Then proposes a list of words
with one or two edit distances from the
original words sounds like. An edit distance is
one replacement, insertion or deletion of one
single character.</p>
      <p>We have added into Aspell the following
lexicons:</p>
      <p>Main provinces and cities in Spain,
extracted from the INE (Statistics
National Institute of Spain)5
Interjections like \aja", \jol n" or
\puf " among others. This list is a
selection from the ones proposed in
Wiktionary6
Twitter jargon and neologisms, with
terms like \Facebook " or \tuiteo",
selected from an on-line glossary7.</p>
      <p>Named entities, generated from
Wikipedia and containing more than
650 di erent named entities. Also,
political parties and main political leaders
have been added to this list manually.
2.4</p>
    </sec>
    <sec id="sec-5">
      <title>Automatic spelling correction</title>
      <p>
        After receiving a list of possible spelling
corrections from the previous module, the
system selects the most common term,
according to a list of words sort by frequency
generated by
        <xref ref-type="bibr" rid="ref4">(Vega et al., 2011)</xref>
        . Although more
sophisticated solutions could be used here
(like considering surrounding words as
context for candidate selection), our attempts
applying techniques taken from word sense
desambiguation approaches did not lead to
signi cant improvements.
      </p>
      <p>To consider surrounding words as context,
rst we have calculated a table with
normalized pointwise mutual information (NPMI)
of lemmatized words in the same sentence.</p>
      <p>5http://www.ine.es/daco/daco42/codmun/cod\
_provincia.htm</p>
      <p>6http://es.wiktionary.org/wiki/Categor\
%C3\%ADa:ES:Interjecciones
7http://estwitter.com/glosario/
To calculate this table we have used a dump
of Spanish Wikipedia8 articles and calculated
the NPMI values of the rst 10.000 lemmas
most frequents. Second, we have computed
the sum of NPMI values of a candidate with
each word of the context. Finally, we have
selected the candidate with the best sum of
NPMI.
3</p>
      <sec id="sec-5-1">
        <title>Evaluation and results</title>
        <p>The performance reached by the system
showed above are not good according with
the results published by the organization.
After a deep analysis of the results we have
realized that we have to improve the following
issues:
1. Diminutives and superlatives: We have
followed an approach based on a Spanish
lemma dictionary. The Spanish lemma
dictionary used was the o ered by the
project LingPipe9. This dictionary does
not include a great amount of
diminutives and superlatives, so one of the
weaknesses of our system is the
detection of this kind of words and a set of
the errors are caused by them.
2. New words: Aspell is a dictionary bases
on spell checker. It is also possible to
add more list of words to Aspell with the
aim of enlarging the tool coverage.
However the coverage of all the Spanish
language is not easy. Other problem is the
new Spanish words included in the RAE
(Royal Spanish Language Academy)
because those ones are di cult to nd out
in the classic spell checker tools.
Although, we have appended to the Aspell
dictionaries new Spanish words, they
have not been enough and the system
has failed in words such as \ ipante" or
\sobao".
3. Entities: The misclassi cation of
entities has been other error of our system.
The entities without any error must be
classi ed as 1 (CORRECT, NO
VARIATION) but our system considered them
correct. Also the entity recognition
power of our system is not strength, so
some of the errors are related with this
problem. A clear example is the entity
Vallecas, which was not recognized by
8http://dumps.wikimedia.org/eswiki/
9http://alias-i.com/lingpipe/
our system as an entity, so it was
replaced by the word Vacas.
4. Abbreviations: Although we have
compiled a bag of abbreviations, after the
publications of the results we have
realized that they are not enough and we
need to add more abbreviations.</p>
        <p>We have detected some errors in the
organization results. Laughing expressions like
\jajaja" have been normalized in some tweets
but other have not, so we do not know if our
right normalization of some laughing
expressions have been considered as correct. Other
example of words that we think they have
to be normalized is \que" that some users
write as \q". In some tweets like
\#Escorpio Puedes sentir q el camino es muy oscuro,
sera mejor q busques q alguien te ayude a
iluminarlo puede ser algun amigo.", the
organizers considered \q" well written and we
are not agree. The organizers also think that
the word \d as" without accent is well
written and it is not. Due to that, we think that
the test corpus have to be improved for future
editions of the workshop.</p>
        <p>Those are some of the reasons because our
system has reached only 37.6% of accuracy.
4</p>
      </sec>
      <sec id="sec-5-2">
        <title>Conclusions and ongoing work</title>
        <p>In this paper, we have proposed a
normalization system for tweets that performs a
series of conversions on the text by the use of
translation lexicons and a spell checker. We
found that most illformed words are based
on morphophonemic variation and proposed
a cascade method to convert each tweet. Our
system has reached only 37.6% of accuracy.</p>
        <p>Our future work will be focused on resolve
some problems discovered such as the
treatment of diminutives and superlatives,
entities or abbreviations. Furthermore, we want
to adapt our normalization system for
subsequent processes such as sentiment analysis or
text classi cation.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Acknowledgements</title>
        <p>This work has been partially supported by
a grant from the Fondo Europeo de
Desarrollo Regional (FEDER), TEXT-COOL
2.0 project (TIN2009-13391-C04-02) and
ATTOS project (TIN2012-38536-C03-0) from
the Spanish Government. The project
AORESCU (TIC - 07684) from the regional
government of Junta de Andaluc a partially
supports this manuscript. Also,this paper
is partially funded by the European
Commission under the Seventh (FP7 -
20072013) Framework Programme for Research
and Technological Development through the
FIRST project (FP7-287607). This
publication re ects the views only of the authors,
and the Commission cannot be held
responsible for any use which may be made of the
information contained therein.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Han</surname>
            , Bo and
            <given-names>Timothy</given-names>
          </string-name>
          <string-name>
            <surname>Baldwin</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Lexical normalisation of short text messages: Makn sens a# twitter</article-title>
          .
          <source>In ACL</source>
          , pages
          <volume>368</volume>
          {
          <fpage>378</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Kaufmann</surname>
            , Max and
            <given-names>Jugal</given-names>
          </string-name>
          <string-name>
            <surname>Kalita</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Syntactic normalization of twitter messages</article-title>
          .
          <source>In International conference on natural language processing</source>
          , Kharagpur, India.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Moreno-Ortiz</surname>
          </string-name>
          , Antonio and Chantal Perez Hernandez.
          <year>2013</year>
          .
          <article-title>Lexicon-based sentiment analysis of twitter messages in spanish</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          ,
          <volume>50</volume>
          (
          <issue>0</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Vega</surname>
          </string-name>
          , Fernando Cuetos,
          <article-title>Mar a Gonzalez Nosti, Anal a Barbon Gutierrez, and</article-title>
          <string-name>
            <given-names>Marc</given-names>
            <surname>Brysbaert</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Subtlex-esp: Spanish word frequencies based on lm subtitles</article-title>
          .
          <source>Psicologica: Revista de metodolog a y psicolog a experimental</source>
          ,
          <volume>32</volume>
          (
          <issue>2</issue>
          ):
          <volume>133</volume>
          {
          <fpage>143</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>