Language Identification in Mixed Script Social Media Text

                         S. Nagesh Bhattu                                                 Vadlamani Ravi
                     CoE on Analytics, IDRBT                                         CoE on Analytics, IDRBT
                  Castle Hills Road#1,Masab Tank                                  Castle Hills Road#1,Masab Tank
                     Hyderabad-500057, India                                         Hyderabad-500057, India
                      nageshbs@idrbt.ac.in                                                vravi@idrbt.ac.in


ABSTRACT
With the spurt in usage of smart devices, large amounts                        Table 1: Number of Words in each languge
of unstructured text is generated by numerous social media                   Lang    No of words   Lang     No of words
tools. This text is often filled with stylistic or linguistic vari-         Bengali     2207       Hindi       2457
ations making the text analytics using traditional machine                  English     5115      Kannada      1302
learning tools to be less effective. One of the specific problem            Gujarati    1741     Malayalam     1862
in Indian context is to deal with large number of languages                 Marathi     1265       Tamil       1886
used by social media users in their roman form. As part                     Telugu      3462
of FIRE-2015 shared task on mixed script information re-
trieval, we address the problem of word level language iden-
tification. Our approach consists of a two stage algorithm                 of languages used is more than 10 and all of them share
for language identification. First level classification is done            vocabulary excessively.
using sentence level character n-grams and second level con-
sists of word level character n-grams based classifier. This               To understand the complexity of the task we have posed the
approach effectively captures the linguistic mode of author                primary problem of word-level language identification as a
in social texting enviroment. The overall weighted F-Score                 multi-class classification using word lists gathered for each
for the run submitted to FIRE Shared task is 0.7692. The                   language. These word-lists are obtained using the method
sentence level classification algorithm which is used in achiv-            suggested in Gupta et al. [2012]. We have converted these
ing this result has an accuracy of 0.6887. We could further                words into n-gram representation and built a classifier based
improve the accuracy of sentence level classifier further by               on multi-class logistic regression McCallum [2002] and multi-
1.6% using additional social media text crawled from other                 class SVM (support vector machine) Crammer and Singer
sources. Naive Bayes classifier showed largest improvement                 [2002]. We conducted this experiment by taking n-gram rep-
(5.5%) in accuracy level by the addition of supplementary                  resentation of each word of the training data as an instance
tuples. We also observed that using semi-supervised learn-                 (leaving the NE words). This experiment yielded an accu-
ing algorithm such as Expectation Maximization with Naive                  racy of 57%-54% respectively. The number of words used in
Bayes, the accuracy could be improved to 0.7977.                           these word lists are given in table 1. The multi-class Logistic
                                                                           functions likelihood is defined as below
Categories and Subject Descriptors                                                        p(y|x) = P
                                                                                                       exp(λy .F(x, y))
                                                                                                                                       (1)
                                                                                                                        0
H.3.3 [Information Storage and Retrieval ]: Informa-                                                   y‘ exp(λy .F(x, y ))
                                                                                                                0

tion Search and Retrieval-Information filtering.
                                                                           Here y is the label associated with instance x. The instance
                                                                           x is expressed in some feature representation F(x, y). In
Keywords                                                                   the current work, feature representation is n-gram represen-
Classification                                                             ation of words. λy are class specific parameters learnt during
                                                                           maximum likelihood based training process.
1.   INTRODUCTION
With the proliferation of social tools like twitter, facebook,             As the text segments are typically social media posts, the
etc.. large volumes of text is being generated on a daily                  number of languages with in a text segment can be safely
basis. Traditional machine learning tools used for text anal-              assumed to be 2. Using this corpus level prior knowledge,
ysis such as Named Entity Recognization(NER) or Parts of                   we built a two-stage classification algorithm. The first stage
Speech Tagging or parsing, are dependent on the premise                    consists of identification of sentence level language. We used
that the text provided for them are in purer form. They                    character-level n-grams of each of the sentences as training
achieve their objective using cooccurrence patterns of fea-                data for building sentence level classifier. We have used
tures. It has been observed by many studies that social me-                1,2,3,4,5 grams of all the words in the sentence as the fea-
dia text when fed to such machine learning algorithm, is of-               tures. We divided the input training data into 80-20 splits
ten plagued by the excessive out of vocabulary words(sparsity              using 5-fold cross validation. We built a multi-class classifier
of features). The FIRE-2015 shared task1 addresses the lan-                using softmax, Naive Bayes and Naive Bayes EM and SVM
guage identification task as well as Named Entity recogni-                 algorithms using the training data. Among these Naive
tion in the context of Indian social fora, where the number                Bayes EM is a semi-supervised learning algorithm, which


                                                                      37
Table 2: Class-Wise Disribution in the training data                        Table 4: Results as submitted for the test run
              bn-en 215 ml-en 131                                                         Tag           F1 Score
               en   679 mr-en 200                                                         MIX               0.57
              gu-en 149 ta-en 318                                                     MIX-en-bn              0
              hi-en 383 te-en 525                                                      MIX-en-kn             0
              kn-en 272                                                               MIX-en-ml              0
                                                                                       MIX-en-te             0
                                                                                          NE           0.387409201
                                                                                         NE-ml               0
Table 3: Accuracy Results of Cross-Validation on                                         NE-L             0.2791
Training Data                                                                            NE-O                0
               Method      Accuracy                                                     NE-OA                0
              Naive Bayes    0.7419                                                      NE-P             0.2187
               MaxEnt        0.8436                                                     NE-PA                0
           Multi-Class SVM   0.8123                                                      Others              0
           Naive Bayes EM    0.7454                                                        X              0.9555
                                                                                           bn             0.7749
                                                                                           en              0.831
                                                                                           gu                0
uses EM algorithm for improving the test data accuracy. In                                 hi             0.6125
preparing such training data, we have removed the URLs,                                    kn             0.8215
X, NE tagged words. The 5-fold cross-validation classifica-
                                                                                           ml             0.8132
tion accuracy are reported in Table 3. We tried varying the
                                                                                           mr              0.745
number of n-grams to 3 and 4, which has the effect of de-
                                                                                           ta             0.8582
preciating the accuracy 3-6%. The class-wise distribution of
documents in the training data is given in Table 2.                                        te             0.6148
                                                                                    tokensAccuracy       77.5231
We get 82% accuracy when we applied the multi-class Logis-                           tokensCorrect         9302
tic regression based classifier trained on the above data. We                          utterances           792
have experimented with latent Dirichlet based topic model                         utterancesAccuracy     18.0556
for this with 100 as the number of topics which was not                            UtterancesCorrect        143
providing accuracy levels beyond 60%. The results of clas-                        Average F-measure  0.6845007667
sification using training data are as given in table 3.                           Weighted F-Measure 0.769245171

1.1    Word-Level Classification
After identifying the language pair used for writing a par-            based on word-clusters from a large collection of twitter cor-
ticular posting, we further build binary classifiers for each          pus. We use the tool provided by authors of Ritter et al.
of the language pairs namely, bn-en,gu-en,kn-en,hi-en, ml-             [2011] for english tweets and a small lexicon of named en-
en,mr-en,ta-en,te-en. We use the words in the table 1 to               tities for all the other languages for dealing with Named-
build the binary classifiers. We used Logistic regression              Entity detection.
based binary classifier which are giving 92-94% accuracy on
training data. The character n-grams (where n is set to 5)
are used as features for the binary classifier. The approach           2.    EXPERIMENTS
suggested in Täckström and McDonald [2011] uses latent               We have used McCallum [2002] for multi-class classification.
variable models for using document level sentiment ratings             The table 4 contains the F1 scores of language identifica-
to infer sentence level classifier.                                    tion and Named entities. These are the results of test run
                                                                       submitted for the FIRE workshop. We reported F1 score
This approach of using word-level binary classifier works well         which is a representative measure capturing both precision
as long as the length of the words is sufficiently long to cap-        and recall. As we have adopted 2-stage algorithm for word
ture the n-gram characteristics of the language of our inter-          level language identification, the classification accuracy of
est. But, as we see, tweets often contain stylistic variations         the first-level(sentence level) classificaion is most imprtant
which reduce the length of words signinficantly. When the              for the further processing. As we can see in the results there
length of words is below 3, the words do not carry n-grams             are languages like Gujarati which go misclassified by the
representative of target language. To address this problem             classifier, having zero F1 score. Named entity detection is
we use words with in a window of two length on either side             typically addressed using sequence level features which are
to make for the sparsity of features of shorter words. This is         quite unreliable in short-message context. Our test run re-
also a heuristic approach effectively used in the other works          sults are limited to the presence in the training data.
such as Han and Baldwin [2011].
                                                                       2.1     Errors and Analysis
Named Entity detection for short text is much harder task,             The error analysis is not complete without making the clas-
as word-colocations and POS tagging do not work well with              sifier further accuracte. In this regard, we have manually
mixed script. Ritter et al. [2011] have proposed a solution            tagged the test data sentences to be of one of the languages


                                                                  38
                                                                     References
Table 5: Sentence Classification Accuracy on Test                    Crammer, K. and Singer, Y. (2002). On the algorithmic im-
Data                                                                   plementation of multiclass kernel-based vector machines.
    Method         Accuracy       Accuracy Training-
                                                                       J. Mach. Learn. Res., 2:265–292.
                 Training-Data     Data-Expanded
   Naive Bayes       0.7204             0.7751                       Gupta, K., Choudhury, M., and Bali, K. (2012). Mining
     MaxEnt          0.6887             0.7052                        hindi-english transliteration pairs from online hindi lyrics.
 Naive Bayes EM      0.7684             0.7977                        In Chair), N. C. C., Choukri, K., Declerck, T., DoÄ§an,
                                                                      M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J.,
                                                                      and Piperidis, S., editors, Proceedings of the Eight Inter-
of our interest mixing with english. The authors are confi-           national Conference on Language Resources and Evalu-
dent in tagging 6 of these languages, we depended on other            ation (LREC’12), Istanbul, Turkey. European Language
resources for distinguishing malayalam and tamil. As this             Resources Association (ELRA).
shared task largely focuses on english being the mixing lan-         Han, B. and Baldwin, T. (2011). Lexical normalisation of
guage we had to classify any of the training data sentences            short text messages: Makn sens a #twitter. In Proceed-
into 9 classes. In order to improve this sentence level lan-           ings of the 49th Annual Meeting of the Association for
guage identification task, we collected tweets of the 8 lan-           Computational Linguistics: Human Language Technolo-
guages of our interest using seed words of each of these lan-          gies - Volume 1, HLT ’11, pages 368–378, Stroudsburg,
guages. The seed words are chosen in such a way that the               PA, USA. Association for Computational Linguistics.
resultant tweets retrieved belonged to the mixed script cat-
egory. We added atleast 500 tweets for each class to make            McCallum, A. K. (2002). Mallet: A machine learning for
up a total number of labeled sentences (sentence level tags)          language toolkit. http://mallet.cs.umass.edu.
to be 7656. We compare the accuracies of various learning
algorithms using this expanded training dataset. As we can           Ritter, A., Clark, S., Mausam, and Etzioni, O. (2011).
see accuracy has increased signinficantly by 6% for Naive              Named entity recognition in tweets: An experimental
Bayes and 3% for Logistic regression. We report the sum-               study. In Proceedings of the Conference on Empirical
mary of experiments conducted in Table 5. The second col-              Methods in Natural Language Processing, EMNLP ’11,
umn shows the classification accuracy of sentence level lan-           pages 1524–1534, Stroudsburg, PA, USA. Association for
guage identification with the training data provided in FIRE           Computational Linguistics.
shared task. The third column shows the accuracy results
                                                                     Täckström, O. and McDonald, R. (2011). Semi-supervised
using the expanded training dataset. We can observe that
                                                                       latent variable models for sentence-level sentiment anal-
the semi-supervised version of Naive Bayes (Naive Bayes EM
                                                                       ysis. In Proceedings of the 49th Annual Meeting of the
) is superior among all the classifiers. We can also observe
                                                                       Association for Computational Linguistics: Human Lan-
that Naive Bayes classifier is benefitted the most (5.5% in-
                                                                       guage Technologies: short papers-Volume 2, pages 569–
crease) by the supplementary tuples added to the training
                                                                       574. Association for Computational Linguistics.
data. Naive Bayes EM is improved by approximately 3%
and MaxEnt is improved by 1.6%.


3.   CONCLUSION
In the current study we addressed the language identifica-
tion problem in mixed script socail media text, at the word
level, involving multiple indian languages namely bengali,
gujarati, hindi, kannada, malayalam, marathi, tamil, telugu.
Observing that the social media mixed script posts often
involve english as the mixing language and can involve at-
most one more other language as the length of the messages
are quite short, we used a two-stage classification approach
for sentence level langauge mode of the author and then a
binary classifier for distinguishing english and each of the
specific languages listed above. The test run submitted has
given overall weighted F-measure of 0.7692. The sentence
level classification accuracy was 68.87%. We could further
improve this accuracy to 79.77% using abundantly available
social media tweets crawled using seed words of specific lan-
gauge.


Acknowledgement
We sincerely thank the active participation of members of
CoE Analytics, IDRBT, especially K. Sai Kiran, B. Shiva
Krishna for helping us in sentence level labeling, and word
level labeling. We thank IDRBT for providing the research
environment for executing this task.


                                                                39