Mangalore-University@INLI-FIRE-2017: Indian Native Language
 Identification using Support Vector Machines and Ensemble
                          Approach
                         Hamada A. Nayel                                                      H. L. Shashirekha
              Department of Computer Science                                         Department of Computer Science
         Mangalore University, Mangalore-574199, India                            Mangalore University, Mangalore-574199
            Benha University, Benha-13518, Egypt                                             Karnataka, India
                  hamada.ali@fci.bu.edu.eg                                                hlsrekha@gmail.com

ABSTRACT                                                                  was used for NLI by Tetreault et al. [15]. Bykh and Meurers [3] ap-
This paper describes the systems submitted by our team for Indian         plied a tuned and optimized ensemble classifier on NLI 2013 shared
Native Language Identification (INLI) task held in conjunction with       task dataset and achieved an accuracy of 84.82%.
FIRE 2017. Native Language Identification (NLI) is an important task
that has different applications in different areas such as social-media   2   TASK DESCRIPTION
analysis, authorship identification, second language acquisition and      Given a comment I =<w 1 ,w 2 , . . . ,w N > where each w i , i = 1..n
forensic investigation. We submitted two systems using Support            is either an English language word or a word of native language
Vector Machine (SVM) and Ensemble Classifier based on three               written in English (or transliterated to English language) for an
different classifiers representing the comments (data) as vector          individual social media user, the objective of the task is to iden-
space model for both systems and achieved accuracy of 47.60% and          tify the native language of the user. The comment may include
47.30% respectively and secured second rank over all submissions          English words in addition to the words of any one native lan-
for the task.                                                             guage written in English. The task considers six Indian languages,
                                                                          namely Tamil (TA), Hindi (HI), Kannada (KA), Malayalam (MA),
CCS CONCEPTS                                                              Bengali (BE) and Telugu (TE). Considering the languages as a set of
• Information systems → Web and social media search; Multilin-            classes C = {T A,HI ,KA,MA,BE,T E} and comments as individual
gual and cross-lingual retrieval; • Computing methodologies →             instances I = {I1 , I2 , . . . , In } we have formulated the task as a
Language resources;                                                       classification problem that assigns one of the six predefined classes
                                                                          of C to a new unlabelled instance Iu .
KEYWORDS
                                                                          3   DATASET
Support Vector Machines, Ensemble Learning, Native Languages
                                                                          The data sets provided for this task are a collection of comments
Identification, Word Vector Space
                                                                          from different regional newspaper’s facebook pages during April-
                                                                          2017 to July-2017. Training and test sets contain 1233 and 783
1   INTRODUCTION                                                          files respectively. Each training and testing file consists of a set of
Native Language Identification (NLI) aims at identifying the native       comments. Table 1 shows a brief statistics about training set.
language (L1) of users writing in another or later learned language
or speech (L2). NLI is an important task that has many applica-                            Table 1: Training set statistics
tions in different areas such as social-media analysis, authorship
identification, second language acquisition and forensic investiga-                      Language      # of comments     Ratio
tion. In forensic analysis [7], NLI helps to glean information about                     TA            207               16.79%
the discriminant L1 cues in an anonymous text. Second Language                           HI            211               17.11%
Acquisition (SLA) [12] studies the transfer effects from the native                      KA            203               16.46%
languages on later learned language. In education, automatic cor-                        MA            200               16.22%
rection of grammatical errors is an important application of NLI                         BE            202               16.38%
[14]. NLI can be used as a feature in authorship identification task                     TE            210               17.03%
[6], which aims at assigning a text to one of the predefined list of
                                                                                         Total         1233              100%
authors. Authorship identification is used for terrorists communi-
cations investigation [1] and digital crime investigation [4].
   Supervised approaches using machine learning algorithms have
been used for NLI by many researchers. Jarvis et al. [9], used SVM        4   SYSTEM DESCRIPTION
classification algorithm to create a model for NLI and reported an        In this section, we will describe the two systems proposed for Indian
accuracy of 83.6%. They used features such as n-grams of words,           Native Language Identification (INLI) [10] task submissions. The
Part-of-Speech (PoS) tags and lemmas. Combining multiple classi-          general frame work of classifier for both systems is shown in figure
fier systems to enhance the final output, such as ensemble classifier     1. First phase of our systems is data preprocessing, also known
as corpus cleaning. This phase is important where we exclude                   4.2    Constructing Vector Space Model
non-informative tokens and phrases. Second phase comprises of                  After preprocessing, the comments will be represented as vector
constructing vector space model for the comments (input data).                 space model. If <t 1 ,t 2 , . . . ,tk > are the unique tokens/terms in a
These two phases are common for both the systems. The next phase               comment Ij , the vector space model for the comment Ij will be
is creating a model using a machine learning algorithm. Support                represented as <w j1 ,w j2 , . . . ,w jk> where w ji is the weight of the
Vector Machine (SVM) and Ensemble learning are used for the first              token/term ti in comment Ij . For term weights, we used Term
and second submission respectively. Details of each phase is given             Frequency/Inverse Document Frequency (TF/IDF) calculated as
below.                                                                         follows:-
                                                                                                                            N +1
                                                                                                                                    !
                                                                                                       t j = t f j ∗ log
                                                                                                                           d fj + 1
                                                                                  where t f j is the total number of occurrences of term t j in the
                                                                               current comment, d f j is the number of comments in which the
                                                                               token/term t j occurs and N is the total number of comments.

                                                                               4.3    Model Construction for First Submission
                                                                                      using SVM
                                                                               SVM is a binary classifier which creates a hyperplane that dis-
                                                                               criminates between the two classes [5]. SVM can be extended to
                                                                               multi-class problems by creating several binary SVMs and combin-
                                                                               ing them using a one-vs-rest method or one-vs-one method [8].
                                                                                  We implemented a six class SVM corresponding to six classes
                                                                               TA, HI, KA, MA, BE and TE, as per the framework shown in figure
                                                                               1 for comment identification using Stochastic Gradient Descent
                                                                               (SGD) for optimizing the parameters of SVM model. SGD algorithm
                                                                               updates the value of parameter θ of the objective function W (θ ) as
                                                                                                       θ = θ − η∇θ E [W (θ )]
                  Figure 1: Framework of classifier                            where η is step size and E[W (θ )] is the cost function.

                                                                               4.4    Model Construction for Second Submission
4.1     Pre-processing                                                                using Ensemble Approach
In this phase, we tokenized each comment Ij into a set of words or
tokens and removed uninformative tokens as follows to get bag of
tokens:-
      • Emoji removal
         Emoji is a small image used as a visual presentation to ex-
         press emotion. The first step in removing unrelated informa-
         tion is to remove Emojis as they are not important for the
         identification of native language.
      • Special characters and digits
         Digits and special characters such as #, %, ... are the char-
         acters which appear frequently in the comments of all the
         languages. As such characters do not contribute to the iden-
         tification of native language they are removed.
      • Modified stop words
         Stop words are the words which appear frequently and do                        Figure 2: Framework of Ensemble approach
         not contribute to the identification of native language. Hence,
         to remove stop words we used a union of different stop words
         lists, namely,                                                           Ensemble learning is a classification technique, which uses a set
       (1) stop words list extracted from nltk.corpus1 package.                of different heterogenous and diverse classifiers as base classifiers
       (2) stop words list extracted from stop_words2 package.                 and combines the output of them in different approaches to get
       (3) Manually written stop words. (The complete list of manu-            the final output [13]. Ensemble technique tries to overcome the
            ally written stop words is given in Appendix A)                    weakness of some classifiers using the strength of other classifiers.
                                                                               Figure 2 shows the framework of ensemble learning.
1 www.nltk.org/nltk_data/                                                         We have used 3 base classifiers, namely, multinomial Bayes,
2 pypi.python.org/pypi/stop-words
                                                                               SVM and random forest tree classifiers and combined the results
                                                                           2
by weighted voting. Multinomial Bayes classifier is an instance of            We used 10-fold cross-validation technique while training both
Naive Bayes classifier that captures word frequency information in         classifiers, the cross validation accuracy of both submissions is
documents [11]. Random forests classifier is a supervised classifier       given in Table 4.
which comprises of multiple decision trees and each tree depends
on independently sampled random vector [2]. The base classifiers           Table 4: 10-fold cross-validation accuracy for both submis-
are designed as per the framework shown in figure 1.                       sions

5     PERFORMANCE EVALUATION                                                               Submission 1      Submission 2
Performance evaluation of INLI task is measured as the accuracy of                         88.09%            87.30%
the system in addition to class-wise accuracy which is calculated                          84.80%            84.80%
using Precision (P), Recall (R) and F1 measure 3 . For each class, P                       90.32%            90.32%
is the measure of the number of comments correctly classified over                         91.06%            91.06%
the total number of comments that system classified as same class.                         89.43%            86.18%
R is the measure of the number of comments correctly classified                            79.68%            80.49%
over the actual number of comments of the class. F1 measure is the                         86.18%            90.24%
harmonic mean of P and R, which can be calculate as follow:-                               88.52%            89.34%
                                                                                           90.98%            90.16%
                                      2∗P ∗R
                               F1 =                                                        89.34%            91.80%
                                       P +R                                                Mean = 87.84%     Mean = 88.17%
6     RESULTS AND DISCUSSION                                                               STD = 3.32        STD = 3.33
The class wise accuracy of first submission using SVM based on
SGD algorithm to determine the parameters of the model is shown
                                                                              Results of both submissions illustrates that the performance of
in Table 2 in terms of P, R and F1 measure. The overall accuracy
                                                                           identifying Hindi is the worst. The reason may be most of the other
of this submission is 47.60% and it ranks second among all the sub-
                                                                           languages’ natives have knowledge of Hindi. Our systems depend
missions.
                                                                           essentially on the effective words for each language.
      Table 2: Results of SVM classifier based submission
                                                                           7   CONCLUSION
                  Class       P               R        F1                  In this work, SVM and Ensemble classifier have been used for INLI.
                   BE      54.00%          84.90% 66.00%                   SVM outperforms the Ensemble classifier which combines different
                                                                           three classifiers. Our Support Vector Machine (SVM) submission
                   HI      60.00%           7.20% 12.80%
                                                                           secured second rank respectively over all submissions for the task.
                   KA      40.40%          54.10% 46.20%
                  MA       42.70%          66.30% 51.90%
                                                                           A    COMPLETE LIST OF MANUALLY WRITTEN
                   TA      58.00%          58.00% 58.00%
                   TE      32.50%          48.10% 38.80%
                                                                                STOP WORDS
                  Overall Accuracy              47.60%                     The following is the full list of stopwords used in our system:-
                                                                           { a, about, above, across, after, afterwards, again,
                                                                           against, all, almost, alone, along, already, also,
    Table 3 shows the performance evaluation of the second sub-            although, always, am, among, amongst, amoungst, amount,
mission where we used Ensemble approach to combine output of               an, and, another, any, anyhow, anyone, anything, anyway,
different models. Overall accuracy of this submission is 47.30% and        anywhere, are, around, as, at, back, be, became, because,
it ranks third among all the submissions.                                  become, becomes, becoming, been, before, beforehand,
                                                                           behind, being, below, beside, besides, between, beyond,
                                                                           bill, both, bottom, but, by, call, can, cannot, cant,
    Table 3: Results of Ensemble classifier based submission               co, con, could, couldnt, cry, de, describe, detail, do,
                                                                           done, down, due, down, due, during, each, eg, eight,
                  Class P                  R              F1               either, eleven, else, elsewhere, empty, enough, etc,
                  BE      56.50%           79.50%         66.10%           even, ever, every, everyone, everything, everywhere,
                  HI      60.70%           6.80%          12.20%           except, few, fifteen, fifty, fill, find, fire, first,
                  KA      38.40%           58.10%         46.20%           five, for, former, formerly, forty, found, four, from,
                  MA      40.40%           70.70%         51.40%           front, full, further, get, give, go, had, has, hasnt,
                  TA      58.00%           58.00%         58.00%           have, he, hence, her, here, hereafter, hereby, herein,
                  TE      32.80%           49.40%         39.40%           hereupon, hers, herself, him, himself, his, how, however,
                  Overall Accuracy         47.30%                          hundred, i, ie, if, in, inc, indeed, interest, into,
                                                                           is, it, its, itself, keep, last, latter, latterly,
3 http://www.nltk.org/_modules/nltk/metrics/scores.html
                                                                           least, less, ltd, made, many, may, me, meanwhile,
                                                                       3
might, mill, mine, more, moreover, most, mostly, move,                                             [15] Joel Tetreault, Daniel Blanchard, Aoife Cahill, and Martin Chodorow. 2012. Na-
much, must, my, myself, name, namely, neither, never,                                                   tive Tongues, Lost and Found: Resources and Empirical Evaluations in Native
                                                                                                        Language Identification. In Proceedings of COLING 2012. The COLING 2012 Orga-
nevertheless, next, nine, no, nobody, none, noone, nor,                                                 nizing Committee, 2585–2602. http://aclanthology.coli.uni-saarland.de/pdf/C/
not, nothing, now, nowhere, of, off, often, on, once,                                                   C12/C12-1158.pdf
one, only, onto, or, other, others, otherwise, our,
ours, ourselves, out, over, own, part, per, perhaps,
please, put, rather, re, same, see, seem, seemed,
seeming, seems, serious, several, she, should, show,
side, since, sincere, six, sixty, so, some, somehow,
someone, something, sometime, sometimes, somewhere, still,
such, system, take, ten, than, that, the, their, them,
themselves, then, thence, there, thereafter, thereby,
therefore, therein, thereupon, these, they, thick, thin,
third, this, those, though, three, through, throughout,
thru, thus, to, together, too, top, toward, towards,
twelve, twenty, two, un, under, until, up, upon, us,
very, via, was, we, well, were, what, whatever, when,
whence, whenever, where, whereafter, whereas, whereby,
wherein, whereupon, wherever, whether, which, while,
whither, who, whoever, whole, whom, whose, why, will,
with, within, without, would, yet, you, your, yours,
yourself, yourselves }


REFERENCES
 [1] Ahmed Abbasi and Hsinchun Chen. 2005. Applying Authorship Analysis to
     Extremist-Group Web Forum Messages. IEEE Intelligent Systems 20, 5 (Sept. 2005),
     67–75. https://doi.org/10.1109/MIS.2005.81
 [2] Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (01 Oct 2001), 5–32.
     https://doi.org/10.1023/A:1010933404324
 [3] Serhiy Bykh and Detmar Meurers. 2014. Exploring Syntactic Features for Native
     Language Identification: A Variationist Perspective on Feature Encoding and
     Ensemble Optimization. In Proceedings of COLING 2014, the 25th International
     Conference on Computational Linguistics: Technical Papers. Dublin City University
     and Association for Computational Linguistics, 1962–1973. http://aclanthology.
     coli.uni-saarland.de/pdf/C/C14/C14-1185.pdf
 [4] Carole E Chaski. 2005. WhoâĂŹs at the keyboard? Authorship attribution in
     digital evidence investigations. International journal of digital evidence 4, 1 (2005),
     1–13.
 [5] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine
     learning 20, 3 (1995), 273–297.
 [6] Dominique Estival, Tanja Gaustad, Son Bao Pham, Will Radford, and Ben Hutchin-
     son. 2007. Author profiling for English emails. In Proceedings of the 10th Conference
     of the Pacific Association for Computational Linguistics. 263–272.
 [7] John Gibbons. 2003. Forensic linguistics: An introduction to language in the justice
     system. Wiley-Blackwell.
 [8] Chih-Wei Hsu and Chih-Jen Lin. 2002. A Comparison of Methods for Multiclass
     Support Vector Machines. Trans. Neur. Netw. 13, 2 (March 2002), 415–425. https:
     //doi.org/10.1109/72.991427
 [9] Scott Jarvis, Yves Bestgen, and Steve Pepper. 2013. Maximizing Classification
     Accuracy in Native Language Identification. (2013), 111–118 pages. http://
     aclanthology.coli.uni-saarland.de/pdf/W/W13/W13-1714.pdf
[10] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, and Paolo Rosso.
     2017. Overview of the INLI PAN at FIRE-2017 Track on Indian Native Lan-
     guage Identification. In Notebook Papers of FIRE 2017, FIRE-2017. Bangalore, India,
     December 8-10, CEUR Workshop Proceedings.
[11] Andrew McCallum and Kamal Nigam. 1998. A comparison of event models
     for naive bayes text classification. In AAAI-98 workshop on learning for text
     categorization. The COLING 2012 Organizing Committee, 41–48.
[12] Lourdes Ortega. 2009. Understanding Second Language Acquisition. Hodder
     Education, Oxford.
[13] R. Polikar. 2006. Ensemble based systems in decision making. IEEE Circuits and
     Systems Magazine 6, 3 (Third 2006), 21–45. https://doi.org/10.1109/MCAS.2006.
     1688199
[14] Alla Rozovskaya and Dan Roth. 2011. Algorithm Selection and Model Adap-
     tation for ESL Correction Tasks. In Proceedings of the 49th Annual Meeting of
     the Association for Computational Linguistics: Human Language Technologies.
     Association for Computational Linguistics, Portland, Oregon, USA, 924–933.
     http://www.aclweb.org/anthology/P11-1093

                                                                                               4