=Paper=
{{Paper
|id=Vol-1228/paper4
|storemode=property
|title=A Language Identification Method Applied to Twitter Data
|pdfUrl=https://ceur-ws.org/Vol-1228/tweetlid-4-singh.pdf
|volume=Vol-1228
|dblpUrl=https://dblp.org/rec/conf/sepln/SinghG14
}}
==A Language Identification Method Applied to Twitter Data==
<pdf width="1500px">https://ceur-ws.org/Vol-1228/tweetlid-4-singh.pdf</pdf>
<pre>
    A Language Identification Method Applied to Twitter Data


               Anil Kumar Singh                               Pratya Goyal
            IIT (BHU), Varanasi, India                       NIT, Surat, India
                nlprnd@gmail.com                          goyalpratya@gmail.com

      Resumen: Este paper presenta los resultados de varios experimentos que hacen
      uso de un algoritmo sencillo, guiado por heurı́sticas, para la finalidad de identificar
      el idioma en datos de Twitter. Estos experimentos son parte de la tarea compartida
      que se centra en este problema. El algoritmo se basa en una métrica de distancia
      calculada a partir de n-gramas. Este algoritmo habı́a sido evaluado satisfactoria-
      mente en textos normales previamente. La métrica de distancia utilizada en este
      caso es una entropı́a cruzada simétrica.
      Palabras clave: identificación de idioma, entropı́a cruzada simétrica, microblog-
      ging
      Abstract: This paper presents the results of some experiments on using a simple
      algorithm, aided by a few heuristics, for the purposes of language identification
      on Twitter data. These experiments were a part of a shared task focused on this
      problem. The core algorithm is an n-gram based distance metric algorithm. This
      algorithm has previously been shown to work very well on normal text. The distance
      metric used is symmetric cross entropy.
      Keywords: Language identification, symmetric cross entropy, microblogging

1   Introduction and Objectives                     2    Architecture and Components
Language identification was perhaps the                  of the System
first natural language processing task for          The system we have used is quite simple.
which a statistical method was used success-        There are only two components in the sys-
fully (Beesley, 1988). Over the years, many         tem. At its core there is a language identifier
algorithms have become available that work          for normal text. The only other module is
very well with normal text (Dunning, 1994;          a preprocessing module. This preprocessing
Combrinck and Botha, 1994; Jiang and Con-           module implements some heuristics. There
rath, 1997; Teahan and Harper, 2001; Mar-           are two main heuristics implemented. The
tins and Silva, 2005). However, with the            first one is based on the knowledge that word
recent spread of social media globally, the         boundaries are an important source of lin-
need for language identification algorithms         guistic information that can help a language
that work well with the data available on such      processing system perform better. We just
media has been felt increasingly. There has         wrap every word (more accurately, a token)
been a special focus on microblogging data,         inside two special symbols, one for word be-
because of at least two main reasons. The           ginning and the other for word ending. The
first is that microblogs have too little data       effect of this heuristic is that it not only
for traditional algorithms to work well di-         provides additional information, it also ‘ex-
rectly and the second is that microblogs use        pands’ the short microblogging text a little
a kind of abbreviated language where, for ex-       bit, which is statistically important.
ample, many words are not fully spelled out.            The other heuristic relates to cleaning up
Some other facts about such data, like multi-       the data. Microblogging text, particularly
linguality of many microbloggers only make          Twitter text, contains extra-textual tokens
the problem harder.                                 such as hashtags, mentions, retweet symbols,
    Our goal was to take one of the algorithms      URLs etc. This heuristic removes such extra-
that has been shown to work very well for           textual tokens from the data before training
normal text, add some heuristics to it, and         as well as before language identification.
see how far it goes in performing language              The intuitive basis of our algorithm is sim-
identification for microblog data.                  ilar to the unique n-gram based approach,
which was first used for human identifica-                 The parameters in the above algorithm are:
tion (Ingle, 1976) and later for automatic
                                                          1. Character based n-gram models Pc and Qc
identification (Newman, 1987). The insight
behind these methods is as old as the time of             2. Word based n-gram models Pw and Qw
Ibn ad-Duraihim who lived in the 14th cen-                3. Orders Oc and Ow of n-grams models
tury.                                                     4. Number of retained top n-grams Nc and Nw
   It is worth noting that when n-grams                      (pruning ranks for character based and word
are used for language identification, normally               based n-grams, respectively)
no distinction is made between orders of n-
                                                          5. Number t of character based models to be
grams, that is, unigrams, bigrams and tri-                   disambiguated by word based models
grams etc. are all given the same status. Fur-
ther, when using vector space based distance              6. Weight a of word based models
measures, n-grams of all orders are merged                In our case, for the twitter data, we have
together and a single vector is formed. It is         not used word based n-grams as they do not
this vector over which the distance measures          seem to help. Adding them does not improve
are applied.                                          the results. Perhaps the reason is that there
3     The Core Algorithm                              is too little data in terms of word n-grams.
                                                      So the parameters for our case are:
The core algorithm that we have used (Singh,
2006) is an adaptation of the one used by             Oc = 7, Ow = 0, Nc = 1000, Nw = 0, a = 0
Cavnar and Trenkle (Cavnar and Trenkle,
1994). The main difference is that instead of             We used an existing implementation of
using the sum of the differences of ranks, we         this algorithm which is available as part of
use symmetric cross entropy as the similarity         a library called Sanchay1 (version 0.3.0).
or distance measure.                                      The parameters were selected based on re-
   The algorithm can be described as follows:         peated experiments. The ones selected are
    1. Train the system by preparing character        those which gave the best results. The length
       based and word based (optional) n-grams        of n-grams was selected as 7-grams and we
       from the training data.                        did find that increasing n-gram length im-
    2. Combine n-grams of all orders (Oc for char-    proves the results.
       acters and Ow for words).                          In this paper we have used this technique
    3. Sort them by rank.                             for monolingual identification in accordance
    4. Prune by selecting only the top Nc charac-     with the task definition, but it can be used for
       ter n-grams and Nw word n-grams for each       multilingual identification (Singh and Gorla,
       language-encoding.                             2007), although the accuracies are not likely
    5. For the given test data or string, calculate   to be high when used directly.
       the character n-gram based score simc with
       every model for which the system has been      4        Resources Employed
       trained.
                                                      For our experiments reported here we have
    6. Select the t most likely language-encoding
       pairs (training models) based on this score.   only used the training data provided. We
                                                      have not used any other resources. We have
    7. For each of the t best training models, cal-
       culate the score with the test model. The      also, so far, not used any additional tools
       score is calculated as:                        such as a name entity recognizer. We have
                                                      implemented some heuristics as described in
                score = simc + a ∗ simw         (1)   the previous section.
       where c and w represent character based and
       word based n-grams, respectively. And a
                                                      5        Setup and Evaluation
       is the weight given to the word based n-       We evaluated with two different setups. Be-
       grams. In our experiment, this weight was      fore the test data for the shared task was re-
       1 for the case when word n-grams were con-     leased, we had randomly divided the train-
       sidered and 0 when they were not.              ing data into two sets by the usual 80-20
    8. Select the most likely language-encoding       split: one for training and one for evalua-
       pair out of the t ambiguous pairs, based on    tion. We also used two evaluation methods.
       the combined score obtained from word and
                                                          1
       character based models.                                http://sanchay.co.in
          Table 1: Language-wise Results in Percentages (Macroaverages)
                              Training 80-20 Split                  Test Set
            Language     Precision Recall F-measure     Precision   Recall F-measure
            Spanish       91.62      82.05     86.57     93.12       85.93   89.38
            Catalan       74.84      84.27     79.28     63.43       81.99   71.52
            Portuguese    86.79      73.95     79.86     65.03       88.53   74.98
            Galician      34.97      55.34     42.86     25.71       50.12   33.99
            Basque        66.67      71.15     68.83     49.30       76.74   60.03
            English       80.53      80.53     80.53     71.44       76.53   73.90
            Undefined     42.11      16.67     23.88     42.53       7.84    13.24
            Ambiguous      1.00      69.62     82.09      1.00       78.08   87.69
            Global        72.19      67.20     68.26     63.82       68.25   63.10


One was simple precision based on microaver-      purely on distributional similarity, differences
ages, while the other was using the evaluation    in training or testing distributions cause un-
script provided by the organizers, which was      expected errors. The fact that there is more
based on macroaverages. Under this setup,         data available for some languages (Spanish
on repeated runs, the algorithm described         and Portuguese) and less for others (Gali-
earlier, out of the box, gave a (microaverages    cian, Catalan and Basque), the difference be-
based) precision of little more than 70%. On      ing very large, contributes to these discrep-
adding the word boundary heuristic to the         ancies. It may also be noted that the results
data, the precision increased to around 78%.      were much better in terms of microaverage
On further adding the cleaning heuristic, the     based precision because in that case our eval-
precision reached 80.80%. The corresponding       uation method took into account multi-label
macroaverge based F-score was 68.26%.             classification such as ‘en+pt’. In fact, each
   However, once the test data for the shared     multi-label combination was treated as a sin-
task was released and we used it with our         gle class, both in the case of code switching
algorithm, along with the heuristics, the         and ambiguity. As a result, many (around
(macroaverage based) F-score was 61.5%.           half) of the errors were of such as ‘en’ being
This increased a little after we slightly im-     identified as ‘en+pt’. This also contributed
proved the implementation of the preprocess-      to making our results lower as evaluated by
ing module. The corresponding microaver-          the script provided by the organizers.
age based precision was 77.47%. On look-
ing at the results for each language, we find     Table 2: Top single label errors on the
that the performance was best for Spanish         training 80-20 split
(89.38% F-measure) and worst for Galician              Language     Identified As   No. of Times
(33.99% F-measure). These results are pre-             Spanish      Catalan             212
sented in table-1.                                     Portuguese   Spanish              72
                                                       Galician     Portuguese           37
   Tables 2 and 3 list the most frequent sin-
                                                       Undef        Basque               31
gle label errors for the two cases (80-20 split        Catalan      Spanish              29
of the training data and the test set). While          Basque       Spanish              20
some of the results are as expected, others are        English      Spanish              13
surprising. For example, Galician and Por-             Other        Spanish               6
tuguese are very similar and they are con-
fused for one another. Similarly for Span-
ish and Catalan. But it is surprising that
                                                  Table 3: Top single label errors on the
Catalan is identified as English and Basque
                                                  test set
as Spanish. Also, Galician and Portuguese              Language     Identified As   No. of Times
are similar, but the results for them are dif-         Spanish      Catalan             1879
ferent. These discrepancies become a lit-              Undef        Galician             494
tle clearer if we notice the fact that the re-         Other        Portuguese           382
sults are quite different in many ways for             Catalan      English              214
the two cases: the 80-20 split and the test            Portuguese   Galician             212
set. The most probable reason for these dis-           Galician     Portuguese           209
                                                       Basque       Spanish               59
crepancies is that since this method is based
6   Conclusions and Future Work                     Dunning, Ted. 1994. Statistical identification of
                                                      language. Technical Report CRL MCCS-94-
We presented the results of our experiments
                                                      273, Computing Research Lab, New Mexico
on using an existing algorithm for language           State University, March.
identification on the Twitter data provided
                                                    Ingle, Norman C. 1976. A language identification
for the shared task. We tried the algorithm
                                                       table. In The Incorporated Linguist, 15(4).
as it is and also with some heuristics. The
two main heuristics were: adding the word           Jiang, Jay J. and David W. Conrath. 1997. Se-
                                                       mantic similarity based on corpus statistics
boundaries to the data in the form of spe-
                                                       and lexical taxonomy.
cial symbols and cleaning up hashtags, men-
tions etc. The results were not state of the        Kiciman, Emre. 2010. Language differences and
                                                       metadata features on twitter. In Web N-gram
art for Twitter data (Zubiaga et al., 2014),
                                                       Workshop at SIGIR 2010. ACM, July.
but they might show how far an out of the
box well-performing algorithm can go for this       Lui, Marco and Timothy Baldwin. 2014. Ac-
                                                       curate language identification of twitter mes-
purpose. Also, the results were significantly
                                                       sages. In Proceedings of the 5th Workshop on
worse for the test data than they were for             Language Analysis for Social Media (LASM),
the 80-20 split on the provided training data.         pages 17–25, Gothenburg, Sweden, April. As-
This means either the algorithm lacks robust-          sociation for Computational Linguistics.
ness when it comes to microblogging data,           Martins, Bruno and Mario J. Silva. 2005. Lan-
or there is a data shift between the training         guage identification in web pages. In Proceed-
and test data. Perhaps one important con-             ings of ACM-SAC-DE, the Document Enge-
clusion from the experiments is that adding           neering Track of the 20th ACM Symposium on
word boundary markers to the data can sig-            Applied Computing.
nificantly improve the performance.                 Newman, Patricia. 1987. Foreign language iden-
    For future work, we plan to experiment            tification - first step in the translation pro-
with techniques along the lines suggested             cess. In Proceedings of the 28th Annual Con-
in recent work (Kiciman, 2010; Carter,                ference of the American Translators Associa-
Weerkamp, and Tsagkias, 2013; Lui and                 tion. pages 509–516.
Baldwin, 2014) on language identification for       Simon, Kranig. 2005. Evaluation of language
Twitter data.                                          identification methods. In BA Thesis. Uni-
                                                       versitt Tbingens.
References                                          Singh, Anil Kumar. 2006. Study of some distance
Adams, Gary and Philip Resnik. 1997. A lan-            measures for language and encoding identifica-
  guage identification application built on the        tion. In Proceeding of ACL 2006 Workshop on
  Java client-server platform. In Jill Burstein        Linguistic Distances. Sydney, Australia, Syd-
  and Claudia Leacock, editors, From Research          ney, Australia. Association for Computational
  to Commercial Applications: Making NLP               Linguistics.
  Work in Practice. Association for Computa-        Singh, Anil Kumar and Jagadeesh Gorla. 2007.
  tional Linguistics, pages 43–47.                     Identification of languages and encodings in
Beesley, K. 1988. Language identifier: A com-          a multilingual document. In Proceedings of
   puter program for automatic natural-language        the 3rd ACL SIGWAC Workshop on Web As
   identification on on-line text.                     Corpus, Louvain-la-Neuve, Belgium.
Carter, Simon, Wouter Weerkamp, and Manos           Teahan, W. J. and D. J. Harper. 2001. Us-
   Tsagkias. 2013. Microblog language identi-          ing compression based language models for
   fication: overcoming the limitations of short,      text categorization. In J. Callan, B. Croft
   unedited and idiomatic text. Language Re-           and J. Lafferty (eds.), Workshop on Language
   sources and Evaluation, 47(1):195–215.              Modeling and Information Retrieval. ARDA,
                                                       Carnegie Mellon University, pages 83–88.
Cavnar, William B. and John M. Trenkle. 1994.
  N-gram-based text categorization. In Pro-         Zubiaga, Arkaitz, Iaki San Vicente, Pablo
  ceedings of SDAIR-94, 3rd Annual Symposium          Gamallo, Jos Ramom Pichel, Iaki Alegria,
  on Document Analysis and Information Re-            Nora Aranberri, Aitzol Ezeiza, and Vctor
  trieval, pages 161–175, Las Vegas, US.              Fresno. 2014. Overview of TweetLID: Tweet
                                                      Language Identification at SEPLN 2014. In
Combrinck, H. and E. Botha. 1994. Automatic           Proceedings of TweetLID @ SEPLN 2014,
  language identification: Performance vs. com-       Girona, Spain.
  plexity. In Proceedings of the Sixth Annual
  South Africa Workshop on Pattern Recogni-
  tion.

</pre>