BMSCE_ISE@INLI-FIRE-2017: A simple n-gram based approach
           for Native Language Identification
                                        Sowmya Lakshmi B S.1, Dr. Shambhavi B R.2
                                 Department of ISE, BMS College of Engineering, Bangalore, India
                                sowmyalakshmibs.ise@bmsce.ac.in1, shambhavibr.ise@bmsce.ac.in2
ABSTRACT
Native Language Identification (NLI) aims to identify native
                                                                       2 PREVIOUS WORK
language L1 of an author by analysing the text written by him/her      The work presented in this study was a participant of Indian
in other language L2. NLI is often implemented as a supervised         Native Language Identification@FIRE 2017 shared task. Several
classification problem. In this paper, we report a NLI system          researchers have investigated NLI and similar problems. An
implemented using character tri-grams, word uni-grams and bi-          overview of few common methods used for NLI prior to this
grams methods using linear classifier, Support Vector Machines         shared task is provided.
(SVM). The work demonstrated is a participant of Indian Native             Most of the researchers have featured NLI as a supervised
Language Identification@FIRE 2017, achieving 0.27 overall              classification task, where classifiers were trained on data from
accuracy for the corpus with 6 native languages. Furthermore,          different L1. Most commonly included features for NLI are
with subsequent evaluations, the best accuracy score obtained was      character n-grams, POS n-grams, content words, function words
0.73 with 10 fold cross-validation on training data. We were able      and spelling mistakes. An SVM model [1-3] was trained on these
to achive above accuracy by incorporating uni-grams and bi-            features and obtained an accuracy of 60%-80%.
grams of words.                                                            In the recent past, word embedding and document embedding
                                                                       has gained much attention along with other features. Continuous
KEYWORDS                                                               Bag of Words (CBOW) and Skip Grams were used to obtain
                                                                       vectors of word embedding. Vector representations for documents
Language Identification; Supervised Classification; Feature
                                                                       were generated with distributed bag-of-words architectures using
Selection.
                                                                       Doc2Vec tool. In [4], authors developed a native language
                                                                       classifier using document and word embedding with an accuracy
1 INTRODUCTION                                                         of 82% for essays and 42% on speech data.
Recently, author profiling is gaining more importance to improve           LIBSVM2, variant of SVM was verified to be efficient for text
performance of certain applications like forensics, security and       classification. In [5], authors developed a NLI algorithm for
marketing. Author profiling aims to detect author’s details like       Arabic language with LIBSVM2. They combined production
age, educational level and native language. Native Language            rules, function words and POS bi-grams to perform machine
Identification (NLI) is a sub-class of author profiling where,         learning process and obtained an accuracy of 45%.
native language L1 of a writer is automatically detected by                First NLI shared task was organized with BEA workshop in
analysing the text written in the second language L2. NLI is often     2013. System participated in closed training task was presented in
implemented as a multiclass supervised classification task.            [6]. The model was trained on 11 L1 languages of TOEFL11
    The applications of NLI are categorised into two categories:       corpus and cross-validation testing was performed for unseen
security related applications and Second Language Acquisition          essays resulted in accuracy of about 84.55%. Authors adopted
(SLA)- related applications. Security related applications are         features like n-grams of words, characters and POS and spelling
identifying phishing sites or spam e-mails that usually consist of     errors with TF-IDF weighing to train SVM model.
strange sentences that might be written by non-native persons.             In [7], author reported the work participated in essay track of
SLA applications are to analyse the effect of L1 on later learned      the Second NLI Shared Task 2017 held at BEA-12 workshop. A
languages.                                                             novel 2-stacked sentence-document architecture was introduced
    As proved by preceding work in this area there exist quite a       by considering lexical and grammatical features of text. A stack of
few linguistic hints that helps in predicting native language. With    two SVM classifiers were used, where first and second classifier
the impact of their native language, authors tend to make common       were sentence and document classifiers respectively. First
mistakes in spelling, punctuation and grammar while using other        classifier aimed at predicting the native language of each sentence
languages.                                                             of a document whereas, these predictions were adopted as features
    In this work, we examine the possibility of building native        by document classiﬁer. Finally, system was used to predict native
language classifiers by ignoring grammatical errors and semantic       language of unseen documents which resulted F1-score of 0.88.
analysis of the text written in L2. A naive set of features using n-
grams of words and characters are explored to develop NLI
system.
                                                                                                                 Sowmya Lakshmi et al.

3 TASK DESCRIPTION AND DATA                                           features from the text files generated and developed two methods
                                                                      for NLI using python as explained below.
NLI has drawn the attention of many researchers in recent years.
With the influx of new researchers, the most substantive study in
this field has led to INLI@FIRE 2017 shared task [8]. Task            Character tri-grams method
focuses on identifying native language of a writer based on his       The tri-gram model reads text files and extracts all tri-grams
writing in other language. In this case, the second language was      (sequence of three bytes) and their corresponding counts from the
English. The task was, native language prediction of a writer from    text. Frequencies of tri-grams are pursued for every training
the given Text/XML file which contains Facebook comments in           language separately. For every language, frequencies are
English language. Six Indian languages were proposed to consider
                                                                      relativized by dividing individual tri-gram counts through the
for this task. They were Tamil, Hindi, Kannada, Malayalam,
                                                                      number of all tri-grams in the training corpus and are sorted based
Bengali and Telugu.
                                                                      on the relative frequency (the probability of the tri-gram in the
Dataset                                                               given corpus of a language) to create language model of that
                                                                      language. A language model for each language in the corpus
The training dataset for the task was xml files, which contains a
                                                                      provided was created.
set of Facebook comments in English by different native language
                                                                          Relative frequencies of the tri-grams for test dataset is
speakers. Xml files were annotated as BE, HI, KA, MA, TE, and
                                                                      calculated and compared with the tri-grams in language models.
TA for Bengali, Hindi, Kannada, Malayalam, Telugu and Tamil
                                                                      Intuitively, we would say that the tri-gram frequencies of tri-
language respectively. Table 1 shows the training data statistics
                                                                      grams extracted from two different texts of the same native
that was used for the task.
                                                                      language speaker should be very similar. The absolute difference
                                                                      was calculated by subtracting the relative frequency of individual
                      Table 1: Training data                          tri-gram in the test dataset from the relative frequency of
                                                                      corresponding tri-gram in each language model. The absolute
    Native Language               Files                               differences were summed up. For instance, if we compare test
                                                                      data with 5 language models, we would have 5 different values for
    Bengali                       202                                 the sum of absolute differences. The minimum value represents
    Hindi                         211                                 the best match for test data. Algorithm 1 describes the algorithm
    Kannada                       203                                 for character tri-gram approach.

    Telugu                        207
                                                                      Algorithm 1: Character tri-grams method
    Malayalam                     200                                 Input: Train Dataset for each language, Test Dataset
    Tamil                         210                                 Output: Native Language Identification of Test Dataset
                                                                      begin
                                                                      for each language in language set
4 FEATURES                                                                 for each document in Train Dataset of language
                                                                                Derive all possible tri-grams
NLI has been formulated as a multiclass classification task. We
                                                                           end for
used language-independent features such as character tri-grams
                                                                            Language model<- Frequency of each tri-gram in Train
and word n-grams for NLI as described in [1]. From the previous             Dataset of language
works we observed that character tri-grams were useful for NLI,       end for
and they suggested that this might be due to the impact of author’s   for every document in Test Dataset
native language. To reflect this, we calculate character n-grams           Derive all possible tri-grams
and word n-grams as features. For characters, we consider tri-             Calculate relative frequency of each tri-gram in the document
grams. The features are generated over the entire training data,      end for
i.e., every tri-gram in the training dataset is used as a feature.    for each language in language model
Similarly, uni-grams and bi-grams of words were used as separate            for every tri-gram in the Test Dataset document
                                                                                 Calculate absolute difference
features.
                                                                                 Absolute difference <- (Relative frequency of tri-gram
                                                                                 in Test Data) – (Relative frequency of corresponding tri-
5 APPROACH                                                                       gram in language model)
Training dataset provided were xml files which contained                     end for
Facebook comments in English written by different native                     Sum up the calculated absolute differences of each language
language speakers and files were annotated w.r.t native language             model
                                                                      end for
of the speaker. As a part of preprocessing, these xml files were
                                                                      Best match <- Among all the computed absolute differences select
scraped to extract Facebook comments and comments related to          the one with minimum value.
similar native language were saved in a text file. We extracted       end


2
A simple n-gram based approach for Native Language Identification

                                                                       7 CONCLUSION
                                                                       In this paper, a supervised system for Indian Native Language
Word n-grams method
                                                                       Identification has been presented. We describe character tri-
Frequencies of word uni-grams and bi-grams were collected for          grams, word uni-grams and bi-grams features, which are the
each language irrespective of their meanings and order of words        subset of frequently used features for NLI task. Results of the
in the document. We instantiated countvectorizer module in             supervised classiﬁcation using these features on a test data set
python to achieve word uni-grams and bi-grams. A Document              consisting of 6 languages were reported as part of INLI@FIRE
Term Matrix X [i, j] was formed, where i is the document id, j         2017 shared task. Our future work lies in improving the
represents dictionary index of each word and Wij is the frequency      performance of NLI system by considering features, which can
of occurrence of each word w in document i. Each uni-gram and          classify native languages in a better way.
bi-gram of test data was compared with their frequencies of
occurrences in the documents of all languages.                         REFERENCES
   In this experiment, we used Document Term Matrix with n-          [1]   Nicolai, Garrett, Md Asadul Islam, and Russ Greiner. "Native Language
grams and applied linear SVM from scikit-learn as a classification         Identification using probabilistic graphical models." Electrical Information and
algorithm for NLI.                                                         Communication Technology (EICT), 2013 International Conference on. IEEE,
                                                                           2014.
                                                                     [2]   Abu-Jbara, A., Jha, R., Morley, E., & Radev, D. R. (2013, June). Experimental
                                                                           Results on the Native Language Identification Shared Task. In BEA@ NAACL-
6 RESULT ANALYSIS                                                          HLT (pp. 82-88).
We submitted the output of the system for test data provided to      [3]   Mizumoto, T., Hayashibe, Y., Sakaguchi, K., Komachi, M., & Matsumoto, Y.
                                                                           (2013, June). NAIST at the NLI 2013 Shared Task. In BEA@ NAACL-HLT (pp.
INLI@FIRE 2017 shared task workshop. A single run of each                  134-139).
method for six different languages was submitted and the results     [4]   Vajjala, Sowmya, and Sagnik Banerjee. "A study of N-gram and Embedding
                                                                           Representations for Native Language Identification." In Proceedings of the 12th
of native language classification for all the languages are                Workshop on Innovative Use of NLP for Building Educational Applications, pp.
recapitulated in Table 2 and Table 3. Character tri-gram model             240-248. 2017.
achieved 22% accuracy and word n-grams model achieved an             [5]   Mechti, Seifeddine, Lamia Hadrich Belguith, Ayoub Abbassi, Rim Faiz, and
                                                                           Carthage IHEC. "An empirical method using features combination for Arabic
overall accuracy of 27%.                                                   native language identification."
    The combined features of uni-grams and bi-grams on the           [6]   Gebre, B. G., Zampieri, M., Wittenburg, P., & Heskes, T. (2013).” Improving
                                                                           native language identification with tf-idf weighting”. In the 8th NAACL
training data was used to perform 10 fold cross-validation. With           Workshop on Innovative Use of NLP for Building Educational Applications
these features an improved accuracy of 73% was achieved.                   (BEA8) (pp. 216-223).
                                                                     [7]   Cimino, A., & Dell'Orletta, F. (2017). “Stacked Sentence-Document Classifier
                                                                           Approach for Improving Native Language Identification”. In Proceedings of the
                                                                           12th Workshop on Innovative Use of NLP for Building Educational
                 Table 2: Character tri-grams                              Applications (pp. 430-437).
                                                                     [8]   Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, Paolo Rosso.
                                                                           Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language
        Class        Prec.           Rec.             F1                   Identification. In: Notebook Papers of FIRE 2017, FIRE-2017, Bangalore,
                                                                           India, December 8-10, CEUR Workshop Proceedings.
         BE           0.40           0.292          0.338
         HI           0.50           0.080          0.160
         KA          0.117           0.270          0.163
        MA          0.1730           0.641          0.272
         TA          0.533           0.080          0.139
         TE          0.267           0.383          0.315
                Overall Accuracy                     0.22


                    Table 3: Word n-grams


        Class        Prec.           Rec.            F1
         BE          0.389           0.551         0.456
         HI          0.545           0.072         0.127
         KA          0.190           0.446         0.266
        MA           0.223           0.315         0.261
         TA          0.218           0.260         0.237
         TE          0.154           0.123         0.137
                Overall Accuracy                    0.27


                                                                                                                                                        3