Language Segmentation of Twitter Tweets using Weakly
              Supervised Language Model Induction

    Segmentación de Twitter Tweets con Modelos de Lenguaje Inducidos

                                         David Alfter
                                       University of Trier
                                       Universitätsring 15
                                      s2daalft@uni-trier.de

      Resumen: En este artículo presentamos los primeros resultados de la inducción de
      modelos de lenguaje de manera semi supervisada para la segmentación por idioma
      de textos multilingües con especial interés en textos cortos.
      Palabras clave: modelo de lenguaje, inducción, semi supervisada, segmentación,
      textos cortos, tuits
      Abstract: This paper presents early results of a weakly supervised language model
      induction approach for language segmentation of multilingual texts with a special
      focus on short texts.
      Keywords: language model, induction, weakly supervised, short text, tweet, seg-
      mentation

1 Motivation                                       task of language identiﬁcation in general
                                                   (King and Abney, 2013; Lui, Lau, and
Twitter tweets often contain non-standard
                                                   Baldwin, 2014) and on tweets (Mendizabal,
language and they are limited to 140 char-
                                                   Carandell, and Horowitz, 2014; Porta, 2014),
acters. While not a problem in itself, these
                                                   but they cannot always be applied. For one,
restrictions can pose diﬃculties for natu-
                                                   Tweets often contain a lot of non-standard
ral language processing systems (Lui, Lau,
                                                   spellings and ad hoc spellings that may or
and Baldwin, 2014). Furthermore, Tweets
                                                   may not be due to the imposed character
may be written in more than one language
                                                   limit. This can be problematic if the su-
(Zubiaga et al., 2014). This typically hap-
                                                   pervised methods have only seen standard
pens when multilingual speakers switch be-
                                                   spelling in training. Also, Tweets may con-
tween the languages known to them, be-
                                                   tain languages for which there is insuﬃcient
tween or inside sentences (Jain and Bhat,
                                                   data to train a supervised method. In these
2014). The resulting text is said to be code-
                                                   cases, unsupervised approaches might yield
switched (Jain and Bhat, 2014; Solorio et
                                                   better results than supervised approaches.
al., 2014). This further complicates mat-
ters for natural language processing systems           Language segmentation consists in iden-
that need at least a certain degree of knowl-      tifying the language borders within a mul-
edge of the language at hand such as part-of-      tilingual text (Yamaguchi and Tanaka-Ishii,
speech taggers, parsers, or machine transla-       2012). Language segmentation is not the
tion (Beesley, 1988; Jain and Bhat, 2014; Zu-      same as language identiﬁcation; the main dif-
biaga et al., 2014). The performance of “tra-      ference is that language identiﬁcation iden-
ditional” monolingual natural language pro-        tiﬁes the languages in a text, and lan-
cessing components on mixed language data          guage segmentation “only” separates the text
tends to be miserable, making it necessary to      into monolingual segments (Yamaguchi and
identify the languages in a multilingual text      Tanaka-Ishii, 2012). Language segmentation
in order to get acceptable results (Jain and       can be useful when direct language identiﬁ-
Bhat, 2014). Even if the results are not terri-    cation is not available.
ble, language identiﬁcation and segmentation
can signiﬁcantly increase the accuracy of nat-     2   Language Model Induction
ural language processing tools (Alex, Dubey,       King and Abney (2013) consider the task of
and Keller, 2007).                                 language identiﬁcation as a sequence label-
    Supervised methods perform well on the         ing task, and Lui, Lau, and Baldwin (2014) a
multi-label classiﬁcation task. In contrast,     pervised trained language models, the score
the proposed system uses a clustering ap-        is indicated in bold.
proach. The system induces n-gram language           Besides the F score (F1), the F5 score
models from the text iteratively and assigns     is also indicated. This score sets β to 5,
each word of the text to one of the induced      weighting recall higher than precision. This
language models. One induction step consists     means that throwing together pairs that are
of the following steps:                          separate in the gold standard is penalized
                                                 more strongly than splitting pairs that oc-
    • Forward generation: Generate language      cur together in the gold standard (Manning,
      models by moving forward through the       Raghavan, and Schütze, 2008).
      text
                                                     For comparison purposes, a supervised
    • Backward generation: Generate lan-         approach as described in (Dunning, 1994)
      guage models by moving backwards           has been implemented. For the supervised
      through the text                           approach, language models for all relevant
                                                 languages have been trained on Wikipedia
    • Model merging: Merge the two most
                                                 dumps from the months June and July
      similar models from the forward and
                                                 2015 in the languages occurring in the data,
      backward generation based on the uni-
                                                 namely Greek, English, French, Polish and
      gram distribution
                                                 Amharic. Since the Amharic wikipedia is
   Generation starts at the beginning of the     written in the Ge’ez script and the data only
text, takes the ﬁrst word and decomposes it      contains transliterated Amharic, all Amharic
into uni-, bi- and trigrams. These n-grams       texts were transliterated prior to training.
are then added to the initial language model,    Then, Tweets have been segmented by assign-
which is empty at the start. For each follow-    ing each word to the model with the highest
ing word, the existing language models eval-     probability. Training on a corpus of Twitter
uate the word in question. The highest rank-     data, separated by language, might yield bet-
ing model is updated with the word. If no        ter results for the supervised approach; how-
model scores higher than the threshold value     ever, such a corpus would have to be compiled
for model creation, a new model is created.      ﬁrst.
   Backwards generation works exactly the            For this toy example, the results show
same, but starts at the end of the text and      that the language model induction seems to
moves towards the beginning of the text.         work reasonably well with scores compara-
   Finally, the two models that have the most    ble to the supervised approach, sometimes
similar unigram distribution are merged.         even performing better than the supervised
This way, the language models iteratively        approach.
amass information about diﬀerent languages.          Closer inspection of the results reveals
   The induction step is repeated at least       that the language model induction tends to
twice. At the end of the induction, while        generate too many clusters for a single lan-
there are two models that have a similarity      guage, resulting in a degradation of the ac-
greater than a certain threshold value, these    curacy, while on the other hand also being
models are merged.                               able to separate the diﬀerent languages sur-
   Language segmentation is then performed       prisingly well.
by assigning each word in the text to the            For example, the ﬁrst tweet “Μόλις ψή-
model that yields the highest probability for    φισα αυτή τη λύση Internet of Things, στο
the word in question.                            διαγωνισμό BUSINESS IT EXCELLENCE.”
                                                 is decomposed into two English clusters and
3     Results                                    two Greek clusters, with one erroneous inclu-
Table 1 shows the results for a set of example   sion of ‘EXCELLENCE.’ in the Greek cluster.
tweets manually collected from Twitter. For
all tweets, a gold standard has been manu-        • Things,
ally created and evaluated against. The eval-
uation is that of a clustering task; the words    • Μόλις λύση διαγωνισμό EXCELLENCE.
of a text are clustered around diﬀerent in-       • Internet of BUSINESS IT
duced language models. Whenever the lan-
guage model induction outperformed the su-        • ψήφισα αυτή τη στο
                     Induction                                Supervised
                     F1                  F5                   F1               F5
 Tweet 1             0.5294              0.4775               0.7441           0.8757
 Tweet 2             0.7515              0.9325               0.7570           0.8121
 Tweet 3             0.4615              0.8185               0.6060           0.8996
 Tweet 4             0.5172              0.7587               0.7360           0.9545
 Tweet 5             1.0000              1.0000               0.2500           0.4642

                                        Table 1: F-Scores

    The second tweet “Demain #dhiha6                   Finally, the last tweet ”Buna dabo naw
Keynote 18h @dhiparis “The collective dy-          (coﬀee is our bread).” is decomposed as fol-
namics of science-publish or perish; is it all     lows. The English words are split across four
that counts?” par David” and its decompo-          clusters while the transliterated Amharic text
sition. It is clear that we have one En-           is clustered together. The splitting is due to
glish cluster and one French cluster, and two      the structure of the tweet; there is not enough
other clusters, one of which could be la-          overlapping information to build an English
beled ‘Named Entity’ cluster and the other         cluster.
possibly ‘English with erroneous inclusion of
@dhiparis’. Interestingly, the French way of           • (coﬀee
notating time ‘18h’ is also included in the
French cluster.                                        • bread).

 • Keynote “The collective of science-                 • is
   publish or perish; it all that counts?”             • our
 • Demain 18h par
                                                       • Buna dabo naw
 • #dhiha6 David
 • @dhiparis dynamics is                           4     Conclusion
                                                   The paper has presented the early ﬁndings of
   The third tweet “Food and breuvages in
                                                   a weakly supervised approach for language
Edmonton are ready to go, just waiting
                                                   segmentation that works on short texts. By
for the fans #FWWC2015 #bilingualism” is
                                                   taking the text itself as basis for the induced
split into one acronym group, three English
                                                   language models, there is no need for train-
clusters and one French cluster with the er-
                                                   ing data. As the approach does not rely on
roneous inclusion of ‘go’.
                                                   external language knowledge, the approach is
 • #FWWC2015                                       language independent.
                                                      The results seem promising, but the ap-
 • breuvages, go                                   proach has to be tested on more data. Still,
 • Food, Edmonton, to, for, the                    being able to achieve results comparable to
                                                   supervised approaches with a weakly super-
 • in, waiting, #bilingualism                      vised method is encouraging.
 • and, are, ready, just, fans
                                                   5     Future work
   The fourth tweet “my dad comes back
from poland with two crates of strawberries,       Future work should concern the reduction of
żubrówka and adidas jackets omg” again is          the number of generated clusters, ideally ar-
split into two English clusters and one Polish     riving at one cluster per language. Alterna-
cluster with the erroneous inclusion of ‘back’.    tively, it would be possible to smooth the fre-
                                                   quent switching of language models by taking
 • comes, from, with, two, crates, of, straw-      context into account.
   berries, jackets, omg                              Also, since the structure of the text
                                                   strongly inﬂuences the presented approach,
 • my, dad, poland, and, adidas
                                                   some form of text normalization could be
 • back, żubrówka                                  used to increase the robustness of the system.
Sources                                             Conference of the North American Chap-
GaloTyri. “Μόλις ψήφισα αυτή τη λύση In-            ter of the Association for Computational
ternet of Things, στο διαγωνισμό BUSINESS           Linguistics – Human Language Technolo-
IT EXCELLENCE.”. 19 June 2015, 12:06.               gies, pages 1110–1119.
Tweet.                                           Lui, Marco, Jey Han Lau, and Timothy Bald-
                                                   win. 2014. Automatic detection and lan-
   Claudine      Moulin     (ClaudineMoulin).      guage identiﬁcation of multilingual docu-
”Demain #dhiha6 Keynote 18h @dhiparis              ments. Transactions of the Association for
”The collective dynamics of science-publish        Computational Linguistics, 2:27–40.
or perish; is it all that counts?” par David”.
                                                 Manning, Christopher D, Prabhakar Ragha-
10 June 2015, 17:35. Tweet.
                                                   van, and Hinrich Schütze. 2008. Intro-
                                                   duction to information retrieval, volume 1.
   HBS (HBS_Tweets).           ”Food and
                                                   Cambridge University Press.
breuvages in Edmonton are ready to
go, just waiting for the fans #FWWC2015          Mendizabal, Iosu, Jeroni Carandell, and
#bilingualism”. 6 June 2015, 23:29. Tweet.         Daniel Horowitz. 2014. TweetSafa: Tweet
                                                   language identiﬁcation. TweetLID @ SE-
   katarzyne (wifeyriddim). ”my dad comes          PLN.
back from poland with two crates of straw-       Porta, Jordi. 2014. Twitter Language Iden-
berries, żubrówka and adidas jackets omg”.         tiﬁcation using Rational Kernels and its
8 June 2015, 08:49. Tweet.                         potential application to Sociolinguistics.
                                                   TweetLID @ SEPLN.
   TheCodeswitcher. ”Buna dabo naw (cof-
fee is our bread).”. 9 June 2015, 02:12.         Solorio, Thamar, Elizabeth Blair, Suraj Ma-
Tweet.                                              harjan, Steven Bethard, Mona Diab, Mah-
                                                    moud Gohneim, Abdelati Hawwari, Fa-
References                                          had AlGhamdi, Julia Hirschberg, Alison
                                                    Chang, et al. 2014. Overview for the
Alex, Beatrice, Amit Dubey, and Frank               First Shared Task on Language Identiﬁ-
   Keller. 2007. Using Foreign Inclusion De-        cation in Code-Switched Data. In Pro-
   tection to Improve Parsing Performance.          ceedings of the Conference on Empirical
   In Proceedings of the Conference on Em-          Methods on Natural Language Processing,
   pirical Methods on Natural Language Pro-         pages 62–72.
   cessing and Computational Natural Lan-
   guage Learning, pages 151–160.                Yamaguchi, Hiroshi and Kumiko Tanaka-
                                                   Ishii. 2012. Text segmentation by lan-
Beesley, Kenneth R. 1988. Language iden-           guage using minimum description length.
  tiﬁer: A computer program for automatic          In Proceedings of the 50th Annual Meeting
  natural-language identiﬁcation of on-line        of the Association for Computational Lin-
  text. In Proceedings of the 29th Annual          guistics, pages 969–978. Association for
  Conference of the American Translators           Computational Linguistics.
  Association, volume 47, page 54.
                                                 Zubiaga, Arkaitz, Inaki San Vicente, Pablo
Dunning, Ted. 1994. Statistical Identiﬁca-         Gamallo, José Ramom Pichel, Inaki Ale-
  tion of Language. Computing Research             gria, Nora Aranberri, Aitzol Ezeiza, and
  Laboratory, New Mexico State University.         Vıctor Fresno. 2014. Overview of Tweet-
                                                   LID: Tweet language identiﬁcation at SE-
Jain, Naman and Riyaz Ahmad Bhat.
                                                   PLN 2014. TweetLID @ SEPLN.
   2014. Language Identiﬁcation in Code-
   Switching Scenario. In Proceedings of the
   Conference on Empirical Methods on Nat-
   ural Language Processing, pages 87–93.
King, Ben and Steven P Abney. 2013. La-
  beling the Languages of Words in Mixed-
  Language Documents using Weakly Su-
  pervised Methods. In Proceedings of the