=Paper= {{Paper |id=Vol-1086/04-paper |storemode=property |title=Resource-based Lexical Approach to Tweet-Norm task |pdfUrl=https://ceur-ws.org/Vol-1086/paper04.pdf |volume=Vol-1086 |dblpUrl=https://dblp.org/rec/conf/sepln/MoyaCT13 }} ==Resource-based Lexical Approach to Tweet-Norm task== https://ceur-ws.org/Vol-1086/paper04.pdf
       Resource-based lexical approach to TWEET-NORM task

     Aproximación léxica basada en recursos para la tarea TWEET-NORM

    Juan M. Cotelo Moya                  Fermín L. Cruz                 Jose A. Troyano
      Universidad de Sevilla           Universidad de Sevilla         Universidad de Sevilla
    Avda. Reina Mercedes s/n.       Avda. Reina Mercedes s/n.      Avda. Reina Mercedes s/n.
          41012 Sevilla                     41012 Sevilla                  41012 Sevilla
          jcotelo@us.es                      fcruz@us.es                  troyano@us.es



       Abstract: This paper proposes a resource-based lexical approach for addressing the
       TWEET-NORM task. The proposed system exposes a simple but extensible mod-
       ular architecture in which each analysis module independently proposes correction
       candidates for each OOV word. Each one of these analysis modules tries to address a
       specic problem and each one works in a very dierent way. The resources are used
       as the main component for the OOV detection system and they works as support for
       the validation and ltering of candidates.
       Keywords: Twitter, resources, modular architecture, candidates


       Resumen:     Este artículo propone una aproximación léxica basada en recursos para
       abordar la tarea TWEET-NORM. El sistema presenta una arquitectura modular
       sencilla pero extensible en la cual cada módulo de análisis propone candidatos para
       cada palabra OOV de forma independiente. Cada uno de estos módulos de análisis
       intenta abordar una problemática especíca y cada uno opera de forma muy distinta.
       Los recursos se usan como base fundamental del sistema de detección de OOVs y
       como apoyo para la validación y ltrado de candidatos.
       Palabras clave: Twitter, recursos, arquitectura modular, candidatos



1     Introduction and objectives                    tactic or stylistic variants.
                                                        Before doing any normalization process,
One of the most important challenges fac-            we previously did a characterization of the
ing us today is how to process and analyze           existing phenomena, being easier trying to
the large amount of information on the In-           address the underlying causes of OOVs. To
ternet, and especially social networking sites       perform this characterization we used a previ-
like Twitter, where millions of people daily         ously collected dataset, composed of 3.1 mil-
express ideas and opinions on any topic of           lions of tweets related with the 2012 UEFA
interest. These texts, called tweets, are char-      European Football Championship.
acterized by having a short length (140 char-           The table 1 shows that characteriza-
acters) that is too small compared with the          tion and provides examples for each phe-
size of traditional genres.                          nomenon.The gure 1 also shows the phe-
    Consequently, users of these networks have       nomena ratio in a clearer way. It is observed
developed a new form of expression that in-          that most of errors t into 5 major categories
cludes SMS-style abbreviations, lexical vari-        and most of errors are associated with the fast
ants, letters repetitions, use of emoticons, etc.    and informal writing in Twitter, usually done
The result is that current NLP tools can have        from a mobile device.
problems to process and understand these                The categories proposed for this task are
short and noisy texts unless they are normal-        coarser than our characterization. Ortho-
ized rst.                                           graphic errors, Texting language and Charac-
    The TWEET-NORM lexical normaliza-                ter reptitions t into the Variation category,
tion task proposes the automatic cleansing         Free Inections and correct words t into the
of a set amount of tweets by identifying and         Correct category. Other Language and Ascii
normalizing, abbreviations, words with re-           Art would t into NS/NC category.
peated letters, and generally any out of the            The system proposed in this paper is based
vocabulary (OOV) words, regardless of syn-           on lexical approaches only and it is mainly
          Phenomenon             Ratio   Examples
          Ortographic errors     28%     sacalo → sácalo, trapirar → transpirar, . . .
          Texting Language       22%     x2 → por dos, q → que, aro → claro, . . .
          Character repetition   15%     siiiiiiiii → si, quiiiiieeeeroooo → quiero, . . .
          Ascii Art              14%     « ¤ oO._.Oo . . .
          Free inections         7%     besote, gatino, bonico, . . .
          Other errors            7%     htt, asdafawecas, engoriles, . . .
          Other Language          4%     ow, ftw, great, lol, . . .
          Multiple phenomena      3%     diass → días, artooo → rato, . . .

       Table 1: Characterization of error phenomena commonly found in Twitter media




     Figure 1: Ratio of characterized error phenomena commonly found in Twitter media

composed of three types of components:             2    Architecture and components of
                                                        the system

  • Resources: Lexicons and similar lan-           The architechture of our system proposed for
    guage resources, including resources con-      this task is pretty straightforward. It is com-
    taining specic knowledge of the media         posed by several main components:
    used.                                              • Preprocessing module

  • Rules: Rules for handling common phe-              • OOV/IV detection module
    nomena found in this type of media as              • OOV analyzer modules
    excessive character repetition, acronyms
    or homophonic errors.                              • Candidate generator module
                                                       • Candidate scoring and selection module
  • Lexical distance analysis: Traditional
    lexical distance analysis for handling            The gure 2 shows all the process ex-
    common ortographic errors found on it.         plained before and how the components are
                                                   interconnected in a single diagram.
                                                      The preprocessing module performs the
   In essence, our system works straightfor-       typical initial processing step done in lexical
wardly: it examines each word at lexical           analysis, generating a stream of tokens from
scope, determine if it is an OOV using the         tweets taking into account things like hash-
knowledge, generate possible correction can-       tags and usernames, numerals, dates and pre-
didates and select the best one.                   serving emoticons during the splitting.
              Figure 2: Architecture and processing steps of the proposed system

    The detection module tries to determine           table 2 shows some example rules, using
if a token is an OOV or not. This module              Python's regular expressions.
performs that detection using resources and
                                                    • Edit distance: This module works
checking if a token belongs to any resource.
                                                      very similar to distance-based suggestion
We used a set of lexicons, each one provid-
                                                      scheme commonly found in spell check-
ing known forms used in Twitter, the Span-
                                                      ers. The main dierence is that it takes
ish language, well known emoticons or even
                                                      into account multiple lexicons instead a
colloquial inections.
                                                      monolithic one.
    Given an OOV Token, an analyzer mod-
ule perform some error guessing process and       • Language: This module tries to identify
try to estimate corrections from it. The spe-         whether language the OOV token actu-
cic process varies for each analyzer. Every          ally belongs to.
analyzer provides some kind of basic scoring
providing some degree of condence for each         Notice that the language analyzer module
correction proposed. The analyzers used for      does not actually perform any correction, be-
this task were the following:                    cause if the token comes from another lan-
                                                 guage only has to be marked not corrected.
                                                 This module uses a trigram language guessing
  • Tranformation rules: This analyzer           module Python 3 implementation (Phi-Long,
    holds a collection of hand-crafted rules,    2012) as backend.
    each one representing some kind of well        The candidate generation module asks for
    dened error and transforms that to-        candidates to each analyzer, performing a val-
    ken into candidate of correction. It is      idation and ltering step, thus removing some
    possible to generate more than a candi-      incorrectly generated candidates from trans-
    date due multiple rule matching, but the     formation rules according to validation rules
    number usually is limited to a few.          and used language resources. Also removes
    These rules are intended to address phe-     duplicates.
    nomena that the edit distance module            The candidate selector module applies a
    does not correctly address like Charac-      normalizing scoring function from the con-
    ter Repetition or Texting Language. The      dence values provides for each candidates,
 Matching                   Processing             Example                             Phenomenon
 [ck]n$                    con                    kn, cn, → con                       Texting Language
 x([aeiouáéíóú])            ch\1                   xaval, coxe → chaval, coche         Texting Language
 ((\w)(\w))\1+(\2|\3)?      \g<1>                  sisisisisisi, nonononono → si, no   Character Repetition
 t[qk]m+$                  te quiero mucho        tkm, tqm → te quiero mucho          Texing Language

                 Table 2: Extract of the transformation rules used in our system

sorts them and selects the best one.                    4   Settings and evaluation
   The system generates a token stream from             The table 4 shows performance values of the
the Tweet using the preprocessing module.               system against the provided corpus and ac-
For each token, determines if a token is an             tivating dierent analyzer modules. It is ob-
OOV or not using the detector module. If the            served that the accuracy of the system im-
token is an IV, no further processing is done           proves signicantly as more modules are ac-
because is a valid form. Otherwise, the to-             tivated.
ken is an OOV, the candidate generator mod-
                                                           However, our implementation perfor-
ule creates a tentative list using the analyzers
                                                        mance is hindered due high dierences be-
previously described. As nal step, the can-
                                                        tween the preprocessing used to build the cor-
didate selector module selects the best candi-
                                                        pus for this task and our detection and pre-
date for correction.
                                                        processing system. Our preprocessing mod-
                                                        ule leads to dierent sets of OOVs to be con-
3    Resources employed
                                                        sidered, often leading to outputs that dier
We have used several lexicons for the detec-            in length respect the ones provided with the
tion and analyzing stages in our system. All            task. These discrepancies result in a high
of them are in raw text format and one entry            rate of what the provided test script counts
per line.                                               as align errors.
   The table 3 shows stats about all the lex-
icons used.                                              Modules             Accuracy     Align Error
                                                         Distance module      0.2036        0.2526
    Lexicon    Entries    Description                    Rule module          0.3307        0.3231
    Spanish    1250796    Common        forms            Distance + Rule      0.3905        0.2281
                          from          Span-            Full system          0.5053        0.1684
                          ish.     Based on
                          LibreOffice
                          dictionaries.                 Table 4: System performance with dierent
    Genre         40      Common forms re-              modules activated
                          lated to Twitter.
                          Handcrafted.                     It is worth mentioning that we used
 Emoticons       320      Commonly used                 threshold of k ≤ 2 for the edit distance mod-
                          emoticons. Hand-              ule because most of correct candidates are
                          crafted.                      within that threshold. Though is true that
                                                        selecting a higher k includes more candidates,
Table 3: Lexicons used for our proposed sys-            most of newly included candidates are not a
tem                                                     valid solution and usually they will have a low
                                                        condence score and will not be selected.
   For the transformation rule module, we
crafted a ruleset of 71 rules. The syntax used
                                                        5   Conclusions and future work
in the ruleset le vaguely resembles a CSV
format, being a rule per line and each line             We provide a resource-aided lexical solution
holds information about the matching and                for the proposed task using an extensible
the transformation process.                             architecture made of independent modules.
   The language detector module uses a tri-             Our system has much room for improvement
gram character language model implementa-               like adding a ngram segmenter, including con-
tion as backend and uses dictionaries as back-          text during analysis or using automatic meth-
o just in case of insucient data for lan-             ods for improving the candidate scoring and
guage estimation.                                       selection.
References
Han, Bo and Timothy Baldwin. 2011. Lex-
  ical normalisation of short text messages:
  makn sens a #twitter. In Proceedings of
  the 49th Annual Meeting of the Associa-
  tion for Computational Linguistics: Hu-
  man Language Technologies - Volume 1,
  HLT '11, pages 368378, Stroudsburg, PA,
  USA. Association for Computational Lin-
  guistics.
Han, Bo, Paul Cook, and Timothy Baldwin.
  2013. Lexical normalization for social me-
  dia text. ACM Trans. Intell. Syst. Tech-
  nol., 4(1):5:15:27, February.
Pennell, Deana and Yang Liu. 2011. A
  character-level machine translation ap-
  proach for normalization of sms abbrevi-
  ations. In IJCNLP, pages 974982.
Phi-Long. 2012. Python 3.3+ implementa-
  tion of the language guessing module made
  by Jacob R. Rideout for KDE.
Xue, Zhenzhen, Dawei Yin, Brian D Davison,
  and BD Davison. 2011. Normalizing mi-
  crotext. In Analyzing Microtext.