Exploiting web-based collective knowledge for micropost
                         normalisation
    Uso del conocimiento colectivo recogido en recursos de la Web para la
            normalización de textos cortos publicados en Twitter

              Óscar Muñoz-Garcı́a                      Silvia Vázquez, Nuria Bel
                Havas Media Group                          Universitat Pompeu Fabra
                  Madrid - Spain                               Barcelona - Spain
            oscar.munoz@havasmg.com               silvia.vazquez@upf.edu, nuria.bel@upf.edu

      Resumen: La tarea de normalización de contenido publicado por el usuario es un
      paso fundamental previo al análisis de las publicaciones en los medios sociales, espe-
      cialmente en Twitter. En este trabajo se presenta un método para la normalización
      morfológica de tweets mediante el uso de recursos publicados en la Web y desarrol-
      lados de manera colectiva, entre los que se encuentran la Wikipedia y un diccionario
      de SMS. Los resultados obtenidos demuestran que estos recursos son una fuente de
      conocimiento muy valiosa para la generación de los diccionarios utilizados en la tarea
      de normalización.
      Palabras clave: medios sociales, normalización de contenidos, Twitter, tweet-norm
      Abstract: The task of normalising user-generated content is a crucial step before
      analysing social media posts, particularly on Twitter. This paper presents a method
      for the morphological of tweets by the use of on-line and collectively developed
      resources, including Wikipedia and a SMS lexicon. The results obtained demonstrate
      that these resources are a valuable source of knowledge for generating the dictionaries
      used in the normalisation task.
      Keywords: social media, micropost normalisation, Twitter, tweet-norm


1   Introduction and objectives                       mantic similarity between texts (Gabrilovich
Microposts published on social media are              and Markovitch, 2007), and word sense dis-
characterised by informality, brevity, fre-           ambiguation (Mihalcea, 2007) among others.
quent grammatical errors and misspellings,               This paper presents a technique for mor-
and by the use of abbreviations, acronyms,            phological normalisation of microposts by the
and emoticons. These features add addi-               use of two open data sources namely, Wiki-
tional difficulties in text mining processes          pedia and the SMS dictionary of the Spanish
that frequently make use tools designed for           Association of Internet Users (AUI, 2013).
dealing with texts which conform to the can-             The paper is structured as follows. Sec-
ons of standard grammar and spelling (Hovi            tion 2 describes the architecture and the com-
et al., 2013).                                        ponents of the system. Section 3 describes
   The micropost normalisation task en-               the linguistic resources that we have reused
hances the accuracy of NLP tools when ap-             for constructing the normalisation tool. Sec-
plied to short fragments of texts published           tion 4 presents the evaluation results. Fi-
in social media, e.g., the syntactic normalisa-       nally, Section 5 presents the conclusions and
tion of tweets may improve the accuracy of            future lines of work.
existing part-of-speech taggers (Codina and
Atserias, 2012).                                      2   Architecture and components of
   The collective knowledge freely available              the system
on the Web, and particularly Wikipedia, has
been used in different NLP tasks, such as text        Figure 1 shows the process followed by the
categorization (Gabrilovich and Markovitch,           micropost normaliser proposed. The specific
2006), topic identification (Coursey, Mihal-          components involved in the overall process
cea, and Moen, 2009), measuring the se-               are described below.
                               Standard       Normalise Twitter
                              vocabulary       Metalanguage
                                                  Element


                                                             Twitter
                                                                                    Normalised
                                                          metalanguage
                                                                                      forms
                                                            element
                                                                                                     Concatenate
            Tokenize        Classify Tokens
                                                                                                   Normalised Forms
                                                           In vocabulary
Micropost                                        OOV            word                                                     Normalised
                                                 word                                                                     Micropost

                                                                      Correct               Variation
                               Correct
                                OOV           Classify OOV word
                               words                                                                    NoES
                                                                                       Variation
                                                                                                               Correct
                                                                                OOV
                                                                                word


                                SMS
                              Dictionary                              Check & Correct
                                                       Spell              Spell
                                                      Checker
                                                     Dictionary


                                                                         OOV word


                                  Figure 1: Normalisation process

2.1     Tokeniser                                                 tionary, neither are Twitter metalan-
This component receives the text to be nor-                       guage elements. Each token classified in
malised and breaks it into words, Twitter                         this category is sent to the OOV Word
metalanguage elements (e.g., hash-tags, user                      Classifier component.
IDs), emoticons, URLs etc. The output (i.e.,
the list of tokens) is sent to the Token Clas-             2.3       OOV Word Classifier
sifier component.                                          This component receives every token previ-
                                                           ously classified as OOV by the Token Classi-
2.2     Token Classifier                                   fier and detects if it is correct, wrong, or un-
The input of this component is the list of                 known. If the token is wrong, the component
tokens generated by the Tokeniser. It clas-                returns the correct form of the token. The
sifies each of them into one of the following              OOV Word Classifier Component executes
categories:                                                the following process:

  • Twitter metalanguage elements (i.e.,                      1. Firstly, the token is looked up in a dic-
    hash-tags, user IDs, RTs and URLs).                          tionary of correct OOV words. The
    Such elements are detected by match-                         search disregards both case and accents.
    ing regular expressions against the token
    (e.g., if a token starts by the symbol                         (a) If an exact match of the token
    “#”, then it is a hash-tag). Each token                            is found in the dictionary (e.g.,
    classified in this category is sent to the                         both forms are capitalised), then
    Twitter Metalanguage Normaliser com-                               the token is classified as Correct and
    ponent.                                                            sent to the Normalised Forms Con-
                                                                       catenator component with no vari-
  • Words contained in a standard language                             ation.
    dictionary, excluding proper nouns.
                                                                   (b) If the token is found with variations
    Each token classified in this category is
                                                                       of case or accentuation, then the
    sent to the Normalised Forms Concaten-
                                                                       token is classified as Variation and
    ator component.
                                                                       its correct form is sent to Normal-
  • Out-Of-Vocabulary (OOV) words. They                                ised Forms Concatenator compon-
    are words not found in a standard dic-                             ent.
      (c) If the token is not found in the dic-          (c) If the spell checker is not able to
          tionary, then the process continues                propose a correct form, the token
          in step 2.                                         is classified as Unknown and sent
                                                             to the Normalised Forms Concaten-
 2. The token is looked up in a SMS diction-
                                                             ator without a variation.
    ary which contains tuples with the SMS
    term and its corresponding correct form.
                                                   2.5    Twitter Metalanguage
    The search is case-unsensitive, and does
    not consider accent marks.                            Normaliser
                                                   This component performs a syntactic norm-
      (a) If the token is found in the SMS         alisation of Twitter meta-language elements.
          dictionary, then it is classified as     Specifically, it executes a set of rules, pre-
          Variation and its correct form is        viously proposed by (Kaufmann and Jugal,
          retrieved and sent to Normalised         2010).
          Forms Concatenator component.
                                                       (1) Remove the sequence of characters
      (b) If the token is not found in the dic-    “RT” followed by a mention to a Twitter
          tionary, then it is sent to the Spell    user (marked by the symbol “@”) and, op-
          Checker and Corrector component.         tionally, by a colon punctuation mark; (2)
2.4    Spell Checker and Corrector                 Remove user IDs that are not preceded by
                                                   a coordinating or subordinating conjunction,
This component checks the spelling of the
                                                   a preposition, or a verb; (3) Remove the
token received and returns its correct form
                                                   word “via” followed by a user mention at
when possible. To do so, it executes the fol-
                                                   the end of the tweet; (4) Remove all the
lowing process:
                                                   hash-tags found at the end of the tweet; (5)
 1. Firstly, the token is matched against reg-     Remove all the “#” symbol from the hash-
    ular expressions to find whether it con-       tags that are maintained; (6) Remove all
    tains characters (or sequences of char-        the hyper-links contained within the tweet;
    acters) repeated more than twice (e.g.,        (7) Remove ellipses points that are at the
    “loooooollll” and “jajaja”).                   end of the tweet, followed by a hyper-link;
                                                   (8) Replace underscores with blank spaces;
      (a) If the token contains repeated char-     (9) Divide camel-cased words in multiple
          acters (or sequences of characters),     words (e.g., “BarackObama” is converted to
          the repeated ones are removed (e.g.,     “Barack Obama”).
          “lol”, and “ja”), and the resulting
          form is sent back to the OOV Word        2.6    Normalised Forms
          Classifier, since the new form may              Concatenator
          be included into the correct words
          set.                                     This component receives the normalised form
                                                   of each token, and amends the micropost.
      (b) If the token does not contain re-
          peated characters (or sequences of
          characters), then the process con-       3     Resources employed
          tinues in step 2.                        The system described makes use of the fol-
 2. The token is sent to an existing spell         lowing resources.
    checking and correction implementation            We use Freeling (Padró and Stanilovsky,
    reused by this component.                      2012) for microposts tokenisation. Its specific
                                                   tokenization rules and its user map module
      (a) If the spell is correct, the token       were adapted for dealing with smileys and
          is classified as Correct and sent        particular elements typically used in Twitter,
          to the Normalised Forms Concat-          such as hash-tags, RTs, and user IDs.
          enator component without a vari-            In addition, we use the POS-tagging mod-
          ation.                                   ule of Freeling within the Token Classifier
      (b) If the spell is not correct, the token   component.      As we deactivate Freeling’s
          is classified as Variation, and the      probability assignment and unknown word
          first correct form returned by the       guesser module, all the words which are not
          spelling corrector is sent to Norm-      contained in Freeling’s POS-tagging diction-
          alised Forms Concatenator.               ary are not marked with a tag and considered
as OOV words. Our standard vocabulary is,           Corpus     Size    Wikipedia    SMS     Both
thus, the Freeling dictionary itself.               Devel. 1   100       0.336      0.631   0.688
    We have populated the correct OOV               Devel. 2   500       0.317      0.634    0.66
                                                    Test       600       0.361      0.516   0.548
words dictionary (used by the OOV Word
Classifier component) by making use of the
list of articles’ titles from Wikipedia (Wiki-     Table 1: Precision of the normalisation tool
pedia, 2013). To speed-up the process of               Corpus           Wikipedia     SMS
querying the 2,447,932 Wikipedia articles’             Development 1     20.661%     47.107%
titles, we uploaded them to a HBASE store              Development 2     20.436%     51.188%
(Apache, 2013).                                        Test              27.497%     28.115%
    In order to increase the coverage of the
correct OOV words dictionary, we incorpor-             Table 2: Coverage of OOV words by
ated into it a list of first names from the                        dictionary
Spanish National Institute of Statistics (INE,
2013). This list contains 18,679 male names        SMS dictionary contains a bigger percentage
and 19,817 female names.                           of OOV words than the dictionary populated
    Additionally, we have populated the SMS        with Wikipedia titles.
dictionary and its corresponding correct
forms, from the SMS dictionary of the Span-        5   Conclusions and future work
ish Association of Internet Users (AUI, 2013),
which contains 53,281 entries for Spanish.         We presented a method for tweet normal-
    Finally, the Spell Checker and Corrector       isation that relies on existing web resources
component makes use of Jazzy (Jazzy, 2013),        collectively developed, finding that such re-
an open-source Java library. For the creation      sources, useful for many NLP tasks, are also
of the spell checker dictionary used by Jazzy,     valid for the task of micropost normalisation.
we made use of the Spanish and Mexican                 With respect to the future lines of work,
dictionaries available on JazzyDicts (Jazzy-       we plan to adapt the normaliser to new lan-
Dicts, 2013). The resulting dictionary con-        guages by the incorporation of the corres-
tains 683,436 terms.                               ponding dictionaries and improving the ex-
                                                   isting lexicons by the use of more available
4   Settings and evaluation                        resources, such as the anchor texts from in-
                                                   tra wiki links.
The evaluation of the technique previously
                                                       Additionally, we plan to improve the nor-
described was done by using two develop-
                                                   malization of multiword expressions, as dif-
ment corpora and a test corpus provided by
                                                   ferent words should be transformed in just
the organisation of the Tweet Normalisation
                                                   one (e.g., “a cerca de” should be trans-
Workshop at SEPLN 2013. Specifically, we
                                                   formed into “acerca de”), as well as cases
evaluated the performance of the OOV iden-
                                                   where joined words should be splitted (e.g.
tification, classification and correction tasks.
                                                   “realmadrid”) by using existing word break-
The accuracy of the normalization task for
                                                   ing techniques, such as the one described in
the Twitter metalanguage elements was not
                                                   (Wang, Thraser, and Hsu, 2011).
evaluated since it was out of the scope of the
                                                       Finally, we will study how the normalisa-
workshop challenge.
                                                   tion process affects to different opinion min-
    Table 1 shows the results of the evalu-
                                                   ing tasks, including sentiment analysis and
ation, including the size of each evaluation
                                                   topic identification.
corpus (column 2), the precision obtained by
using either Wikipedia or the SMS dictionary
separately (columns 3 and 4 respectively),         Acknowledgements
and the overall precision achieved by exploit-     This research is partially supported by the
ing both dictionaries (column 5).                  Spanish Centre for the Development of In-
    As Table 1 reflects, both dictionaries help    dustrial Technology under the CENIT pro-
to improve the final precision score, being        gram, project CEN-20101037, “Social Me-
the SMS dictionary the one which contrib-          dia” (http://www.cenitsocialmedia.es).
utes the most. This can be explained with          We are very grateful to AUI (Asociación
the coverage of OOV words by each of the           de Usuarios de Internet) for facilitating the
dictionaries, which is shown in Table 2. The       textese dictionary used in this work to us.
References                                       Kaufmann, Max and Kalita Jugal. 2010.
Apache. 2013.       HBase. http://hbase.           Syntactic normalization of twitter mes-
  apache.org.      [Online; accessed 25-Jul-       sages. In Proceedings of the Interna-
  2013].                                           tional Conference on Natural Language
                                                   Processing (ICON-2010).
AUI. 2013. Asociación de Usuarios de Inter-
                                                 Mihalcea, R. 2007. Using wikipedia for
  net. http://aui.es. [Online; accessed
                                                   automatic word sense disambiguation. In
  24-July-2013].
                                                   Proc. of NAACL HLT, volume 2007.
Codina, Joan and Jordi Atserias. 2012.
                                                 Padró, Lluı́s and Evgeny Stanilovsky. 2012.
  What is the text of a tweet?           In
                                                   Freeling 3.0: Towards wider multilin-
  Proceedings of @NLP can u tag
                                                   guality.      In Proceedings of the Lan-
  #user generated content?!       via lrec-
                                                   guage Resources and Evaluation Confer-
  conf.org, Istanbul, Turkey, May. ELRA.
                                                   ence (LREC 2012), Istanbul, Turkey,
Coursey, K., R. Mihalcea, and W. Moen.             May. ELRA.
  2009. Using encyclopedic knowledge for         Wang, Kuansan, Christopher Thraser, and
  automatic topic identification. In Proc. of      Paul Bo-June Hsu. 2011. Web Scale NLP:
  the Thirteenth Conference on Computa-            A Case Study on URL Word Breaking. In
  tional Natural Language Learning, pages          Proceedings of the 20th international con-
  210–218. Association for Computational           ference on World Wide Web, pages 357–
  Linguistics.                                     366. ACM.
Gabrilovich, E. and S. Markovitch. 2006.         Wikipedia.    2013.    Wikipedia:Database
  Overcoming the brittleness bottleneck us-        download. http://en.wikipedia.org/
  ing Wikipedia: Enhancing text categor-           wiki/Wikipedia:Database_download.
  ization with encyclopedic knowledge. In          [Online; accessed 23-May-2013].
  Proc. of the 21st National Conference
  on Artificial Intelligence, volume 2, page
  1301. Menlo Park, CA; Cambridge, MA;
  London; AAAI Press; MIT Press; 1999.
Gabrilovich, E. and S. Markovitch. 2007.
  Computing semantic relatedness using
  wikipedia-based explicit semantic ana-
  lysis. In Proc. of the 20th Int. Joint Con-
  ference on Artificial Intelligence, pages 6–
  12.
Hovi, Eduard, Vita Markman, Craig Mar-
  tell, and David Uthus. 2013. Analyz-
  ing microtext. In Papers from the 2013
  AAAI Spring Symposium. Association for
  the Advancement of Artificial Intelligence,
  March.
INE. 2013. INEbase: Operaciones es-
  tadı́sticas: clasificación por temas. http:
  //www.ine.es/inebmenu/indice.htm.
  [Online; accessed 8-April-2013].
Jazzy.    2013. Jazzy.    http://jazzy.
   sourceforge.net. [Online; accessed 25-
   Jul-2013].
JazzyDicts.      2013.       JazzyDicts.
   http://sourceforge.net/projects/
   jazzydicts. [Online; accessed 25-Jul-
   2013].