=Paper= {{Paper |id=Vol-2036/T4-6 |storemode=property |title=DalTeam@INLI-FIRE-2017: Native Language Identification using SVM with SGD Training |pdfUrl=https://ceur-ws.org/Vol-2036/T4-6.pdf |volume=Vol-2036 |authors=Dijana Kosmajac,Vlado Keselj |dblpUrl=https://dblp.org/rec/conf/fire/KosmajacK17 }} ==DalTeam@INLI-FIRE-2017: Native Language Identification using SVM with SGD Training== https://ceur-ws.org/Vol-2036/T4-6.pdf
      DalTeam@INLI-FIRE-2017: Native Language Identification
                 using SVM with SGD Training
                         Dijana Kosmajac                                                              Vlado Keselj
       Dalhousie University, Faculty of Computer Science                     Dalhousie University, Faculty of Computer Science
                 Halifax, Nova Scotia, Canada                                          Halifax, Nova Scotia, Canada
                   dijana.kosmajac@dal.ca                                                    vlado@cs.dal.ca
ABSTRACT                                                                 as function words, character n-grams, and Part-of-Speech (PoS) n-
Native Language Identification (NLI), as a variant of Language           grams. The task, in general, focuses on the goal to identify speaker’s
Identification task, focuses on determining an author’s native lan-      native language from the samples of text written in a second lan-
guage, based on a writing sample in their non-native language.           guage.
In recent years, the challenging nature of NLI has drawn much               One of the main challenges for this task is the lack of corpora
attention from the research community. Its application and impor-        in appropriate size, class balance and topic homogeneity. So far,
tance are relevant in many fields, such as personalization of a new      there are a couple of datasets which were used in the past research.
language learning environment, personalized grammar correction,          International Corpus of Learner English (ICLE)1 corpus is one of
and authorship attribution in forensic linguistics. We participated      the first appearing in the early studies. Released in 2002 and up-
in the INLI Shared Task 2017 held in conjunction with FIRE 2017          dated in 2009, it became commonly used in research into native
conference. To implement a machine learning method for Native            language prediction of learner writing. Brooke et al. [1] suggested
Language Identification, we used Character and Word N-grams              that ICLI has problems that can lead to drop in performance when
with SVM (Support Vector Machines) classifier trained with SGD           evaluated. They proposed additional corpora that might be useful
(Stochastic Gradient Descent) method. We achieved F1 measure of          in the task of native language prediction. They used data from a
89.60% (using 10-fold cross validation), using provided social media     language learning SNS — Lang-8.com — and they show improved
dataset and 48.80% was reported in the final testing done by INLI        performance. Another corpus [17] was presented in a shared task on
workshop organisers.                                                     Native Language Identification of learners. The corpus was named
                                                                         TOEFL11, which contains essays in English by learners from 11
CCS CONCEPTS                                                             different native languages.
                                                                            The approach we present is based on a linear Support Vector Ma-
• Computing methodologies → Supervised learning by clas-
                                                                         chine classifier trained using Stochastic Gradient Descent method.
sification; Classification and regression trees; • Social and
                                                                         As features, we used character and word n-grams. In addition, we
professional topics → Cultural characteristics;
                                                                         used tf-idf weighting technique with χ 2 feature selection. We used
                                                                         a dataset provided by the Workshop organisers.
KEYWORDS                                                                    The rest of the paper is organised as follows: in Section 2 we
Native Language Identification, Support Vector Machines, Stochas-        present some of the most recent and relevant research to our ex-
tic Gradient Descent, N-Grams, Text Classification                       periments. Section 3 gives a short description of the dataset, using
                                                                         the information provided by the organisers. In the Section 4 we
1   INTRODUCTION                                                         presented the experimental setup with details on data preprocess-
                                                                         ing, feature selection and weighting and classifier setup. Section 5
Since the 1950s there is a discussion in linguistic literature whether
                                                                         shows and discusses the results. In Section 6 we outline conclusions
and how the native speakers of particular languages have charac-
                                                                         and further work.
teristic patterns in sentence generation in their second language.
This has been investigated in different domains and from different
aspects, including qualitative research in Second Language Acquisi-      2     RELATED WORK
tion (SLA), more recently through predictive computational models        The research in NLI domain is fairly recent. We present some of
in NLP [7] and in linguistic forensics [16].                             the most relevant to our experiments.
   In addition, the speaker’s native language can have an effect            Kochmar et al. [8] study presented experiments on prediction
on the types of errors they make. A study by Flanagan et al. [3]         of the native languages of Indo-European learners through binary
investigates the characteristics of errors by native language. They      classification tasks using with linear kernel SVM. They divided the
identified the differences and similarities of error co-occurrence       native languages into two main groups: Germanic and Romance,
characteristics of the following native languages: Chinese, Japanese,    with intergroup prediction performance accuracy 68.4%. The fea-
Korean, Spanish, and Taiwanese. They have shown that some lan-           tures used for prediction were words and n-grams,different error
guages have greater differences than another (Korean and Japanese        types that had been manually tagged within the corpus.
tend to make similar mistakes).                                             Wong[19] analyzed learner writing with an extension of Adaptor
   This has motivated research in Native Language Identification         Grammars for detecting co-locations at the word level, as well as
(NLI), which was first defined as a Text Classification task by Kop-
pel et al. [9], using a classifier with a set of lexical features such   1 https://uclouvain.be/en/research-institutes/ilc/cecl/corpora.html
              Table 1: INLI training dataset statistics

                   Language             Number   Percentage
                 Hindi (HI)              211     17.11%
                Telugu (TE)              210     17.03%
                 Tamil (TA)              207     16.79%
               Kannada (KA)              203     16.46%
                Bengali (BE)              202    16.38%
              Malayalam (MA)              200    16.22%
                   Total                 1233    100%



for POS and functional words. Classification was performed at
the document level by parsing individual sentences of the learner’s
writing to detect the native language with the final prediction based
on a majority score of the sentences. Some notable characteristic
features of languages extracted by this method were also discussed.
   Bykh[2] discussed the use of recurring n-grams of variable lengths
as features for training a native language classifier. They also incor-                    Figure 1: Architecture of the system.
porated POS features. They claim that their approach outperformed
previous work under comparable data setup (ICLE corpus), reaching
89.71% accuracy for a task with seven native languages.                       4     EXPERIMENTAL METHODOLOGY
   Jarvis et al. [6] was the best performing participant in earlier
                                                                              This paper presents a supervised multi-class classification approach.
mentioned workshop by Tetreault [17]. They analyzed a set of
                                                                              The training data texts are labeled with classes according to the
features such as: word n-grams, POS n-grams, character n-grams,
                                                                              author‘s native language. Figure 1 shows a diagram of the classifier
and lemma n-grams. On top of it, they used an SVM classifier. The
                                                                              components.
prediction performance was evaluated on several different models
with varying combinations of features.
   Malmasi et al. [12–15] presented the first NLI experiments on              4.1    Data Preprocessing
Arabic2 (Arabic Learner Corpus - ALC), Chinese (Chinese Learner                  4.1.1 Cleaning. Preparing and normalising the dataset are the
Corpus [18]), Finnish and Norwegian languages data using a corpus             first and necessary subtasks prior to the selection and classifica-
of examination essays collected from learners of Norwegian. Given             tion. It includes filtering and adjusting the raw texts to make them
the differences between English and aforementioned languages,                 suitable for the input of the next subtask. In general, social media
the main objective was to determine if NLI techniques previously              user-generated texts are likely to be very noisy, containing tex-
applied to second language English can be effective for detecting             tual elements irrelevant to the observed Classification Task. Hence,
native language transfer effects in second language.                          some parts of the comments were not considered as part of the
                                                                              feature set including hashtags, mentions and links.
3    DATASET                                                                     4.1.2 Feature Extraction. Our model uses character n-grams of
The dataset used in the experiment was provided by the organizers             order 2–5. These n-grams capture small and localised syntactic
of the INLI Workshop [10]. Organizers identified the official Face-           patterns within a word of language production. Additionally, we
book pages of prominent regional language newspapers of the each              used word n-grams of order 1–2. Our preliminary experiments
region and extracted the comments. It consists of six classes: six            showed that this n-gram lengths give best accuracy (possible reason
languages of Indian subcontinent originating from different Indian            is due to the data sparsity).
states. As shown in Table 1, dataset is divided into classes named TA,
MA, HI, BE, TE and KA. The dataset has following characteristics:               4.1.3 χ 2 feature selection. The formula for χ 2 feature selection
                                                                              can be expressed as follows:
     • It‘s balanced in the terms of the number of samples for each
       language;                                                                                             Õ          Õ          (Net ec − Eet ec )2
     • The native and mixed script text is removed from the com-                        χ 2 (M, t, c) =                                                  (1)
                                                                                                                                         Eet ec
       ments;                                                                                             e t ∈ {0,1} ec ∈ {0,1}
     • The comments are related to the general news in all over
                                                                                 where M is a message (a Facebook comment), t is a feature and
       India in order to avoid topic bias.
                                                                              c is a class. N is the observed frequency in M and E the expected
                                                                              frequency. Subscript et and ec can take values 0 or 1. For example,
                                                                              Net =1,ec =0 means feature t is in N messages and is not in class c.
2 http://www.arabiclearnercorpus.com/
                                                                              We selected 50,000 features.
                                                                          2
4.2    TF-IDF Weighting                                                                   Table 2: Stratified 10-fold cross-validation
Tf-idf (term frequency - inverse term frequency) is one of the
best-known weighting algorithms. Several newer methods adapt                                                 Folds   F1
tf-idf for use as part of their process, and many others rely on the                                           #1    0.896
same fundamental concept. Idf, being the measure’s key part, was                                               #2    0.904
introduced in a 1972 paper by Karen Spärck Jones. As suggested in                                              #3    0.896
study by [5], we opted for using tf-idf measure in our experiment.                                             #4    0.901
    Tf-idf is the product of two measures, term frequency and in-                                              #5    0.869
verse document frequency. In literature, different variations can be                                           #6    0.907
found. In this work we have used normalized term frequency to                                                  #7    0.913
reduce bias towards different lengths between text samples.                                                    #8    0.861
                                         ft,d                                                                  #9    0.918
                   ntf (t, d) =                                       (2)                                     #10    0.892
                                  max{ ft ′,d : t ∈ d}
                                                                                                             Mean    0.896
                                       N comments
          idf (t, d) = logNcomments                                   (3)                                    St.D.   0.018
                                  1 + ntf (t, d comments )
                                          Í

The final weight is expressed as follows:
                 weight(t, d) = ntf (t, d) · idf (t, d)               (4)          where α represents a constant that multiplies regularization
                                                                                term, and is used in learning rate calculation.
4.3    Classifier                                                                  The goal of the SGD algorithm is to bring the primal suboptimal-
In the experiments we used a linear SVM (Support Vector Machine)                ity below a threshold ϵ P :
to perform multi-class classification. SVM was chosen primarily
                                                                                                       E(wt ) − E(w∗ ) ≤ ϵ P                      (9)
because it shows effectiveness for this particular task [17] and we
confirmed that in our preliminary experiments. The implementation
is based on Python library scikit-learn, where we used linear SVM
                                                                                4.4    Evaluation Measure
with SGD (Stochastic Gradient Descent) training.                                As suggested by INLI 2017 organisers, we used macro-averaged F1
   The textual training samples x are represented as a d-dimensional            score for evaluation measure (Eq. 10).
vector. The vector x is classified by looking at the sign of a linear
                                                                                                               =       TP
scoring function ⟨w, x⟩. The goal of learning is to estimate the d-                                      P           TP+FP ,
dimensional parameter w so that the score is positive if the vector
x belongs to the positive class and negative otherwise.                                                        =       TP
                                                                                                         R           TP+FN ,                     (10)

                 ℓi (⟨w, x⟩) = max{0, 1 − yi ⟨w, x⟩}                  (5)                                F1    =           ·R
                                                                                                                     2 · PP+R
                            n                                                      where TP are true positive predicted values, FP are false positive
                λ        1Õ
          E(w) = ∥w∥ 2 +       max{0, 1 − yi ⟨w, x⟩}.                 (6)       predicted values, FN false negative predicted values, P represents
                2        n i=1                                                  precision and R represents recall.
                  n
               1Õ                            λ
      E(w) =         Ei (w),      Ei (w) =     ∥w∥ 2 + ℓi (⟨w, x⟩).   (7)       5     RESULTS
               n i=1                         2
                                                                                The results of our final experiment for distinguishing non-native
    SGD can be used to learn an SVM by minimizing E(w). SGD                     Indian authors of the Facebook comments are shown in the accu-
performs gradient steps by considering at each iteration one term               mulated confusion matrix on Fig. 2. The results show that features
Ei (w) selected at random from this average. Conceptually, the al-              we used are useful for discriminating among non-native comments,
gorithm is:                                                                     achieving 89.60% F1 measure. The result is based on the mean per-
   (1) Start with w0 = 0;                                                       formance of 10-fold validation.
   (2) For t = 1, 2, . . . ,T ;                                                    The testing set from the organisers was a separate dataset from
      (a) Sample one index i in 1, . . . , n uniformly at random;               the one which was provided to the Workshop participants. The test
     (b) Compute a sub-gradient gt of Ei (w) at wt ;                            results from the organisers shown in Table 3 report macro-averaged
      (c) Compute the learning rate ηt ;                                        F1 measure 48.80%. The best performing class is BE (Bengali) giving
      (d) Update wt +1 = wt − ηt gt .                                           the accuracy of 67.10%. The recall for this class is significantly
   We used variable learning rate (in scikit-learn ’optimal’), which            higher compared to the other classes. The worst performing class
is computed as follows:                                                         is HI (Hindi) with the accuracy 23.80%. This is due to the very low
                                                                                recall value of 14.30%. Compared to the results of 10-fold cross-
                                        1                                       validation, we can see that HI class was performing worst. However,
                          ηt =                                        (8)       arguably due to the original dataset size and topic bias, overall
                                  α · (t + t 0 )
                                                                            3
Figure 2: Accumulated confusion matrix from 10 fold cross                  Figure 3: Accumulated confusion matrix from LLO valida-
validation on INLI dataset.                                                tion on INLI dataset.

    Table 3: Class-wise accuracy provided by the organisers
                                                                           A     LEAVE-ONE-OUT CLASSIFIER
                 Class      Precision   Recall   F1                              VALIDATION
                                                                           In addition, we performed Leave-One-Out (LLO) cross-validation
                  BE         56.20%     83.20%   67.10%
                                                                           technique. This validation technique is appropriate, first, because
                  HI         69.20%     14.30%   23.80%
                                                                           training dataset is relatively small (consisting of approximately 200
                  KA         40.50%     66.20%   50.30%
                                                                           samples per class). Second, the training set used for the final classi-
                  MA         46.70%     54.30%   50.30%
                                                                           fier is approximately equal to the training sets in LLO validation
                  TA         51.10%     48.00%   49.50%
                                                                           (all samples, but one). On Fig. 3 is shown accumulated confusion
                  TE         33.30%     55.60%   41.70%
                                                                           matrix from 1233 validation runs.
               Overall                           48.80%                       Final F1 measure is 90.90%.

                                                                           REFERENCES
system performance dropped significantly with the new test set.             [1] Julian Brooke and Graeme Hirst. 2013. Native language detection with ‘cheap’
Additional datasets should be considered in the future.                         learner corpora. In Twenty Years of Learner Corpus Research. Looking Back, Moving
                                                                                Ahead: Proceedings of the First Learner Corpus Research Conference (LCR 2011),
                                                                                Vol. 1. Presses universitaires de Louvain, 37.
6     CONCLUSION AND FURTHER WORK                                           [2] Serhiy Bykh and Detmar Meurers. 2012. Native Language Identification using
                                                                                Recurring n-grams - Investigating Abstraction and Domain Dependence.. In
In this paper, we experimented on the task of Native Language Iden-             COLING. 425–440.
tification (NLI). We used two different types of features: character        [3] Brendan Flanagan, Chengjiu Yin, Takahiko Suzuki, and Sachio Hirokawa. 2014.
and word n-grams. We use these features in a machine learning                   Classification and clustering english writing errors based on native language. In
                                                                                Advanced Applied Informatics (IIAIAAI), 2014 IIAI 3rd International Conference on.
setup using a Support Vector Machine (SVM) classifier with Sto-                 IEEE, 318–323.
chastic Gradient Descent (SGD) training on data from the INLI               [4] Brendan Flanagan, Chengjiu Yin, Takahiko Suzuki, and Sachio Hirokawa. 2015.
                                                                                Prediction of Learner Native Language by Writing Error Pattern. Springer Interna-
corpus which consists of six different native languages of Indian               tional Publishing, Cham, 87–96.
subcontinent.                                                               [5] Binyam Gebrekidan Gebre, Marcos Zampieri, Peter Wittenburg, and Tom Heskes.
    There are a couple of directions for future work. In the related            2013. Improving native language identification with tf-idf weighting. In the 8th
                                                                                NAACL Workshop on Innovative Use of NLP for Building Educational Applications
literature there are some relevant NLI approaches that could be                 (BEA8). 216–223.
tested on the data explored this paper. Some of them are analyses           [6] Scott Jarvis, Yves Bestgen, and Steve Pepper. 2013. Maximizing Classification
of feature diversity and interaction [11], and common error analy-              Accuracy in Native Language Identification.. In BEA@ NAACL-HLT. 111–118.
                                                                            [7] Scott Jarvis and Scott A Crossley. 2012. Approaching Language Transfer Through
sis by language [4]. Due to the lack of corpora for the languages               Text Classification: Explorations in the Detection based Approach. Vol. 64. Multilin-
investigated in this study, the application of more linguistically              gual Matters.
                                                                            [8] Ekaterina Kochmar. 2011. Identification of a writer’s native language by error
sophisticated features is limited, but to be explored in the future.            analysis. Ph.D. Dissertation. Master’s thesis, University of Cambridge.
For example, the use of a English parser could be used to study             [9] Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Determining an author’s
the overall structure of grammatical constructions as captured by               native language by mining a text for errors. In Proceedings of the eleventh ACM
                                                                                SIGKDD international conference on Knowledge discovery in data mining. ACM,
context-free grammar production rules using parser designed for                 624–628.
social media texts3 . Another possible improvement is the use of           [10] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, and Paolo Rosso.
classifier ensembles to improve classification accuracy. This has               2017. Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language
                                                                                Identification. In Notebook Papers of FIRE 2017. CEUR Workshop Proceedings,
previously been applied to English NLI [17] with good results.                  Bangalore, India.
                                                                           [11] Shervin Malmasi and Aoife Cahill. 2015. Measuring Feature Diversity in Native
3 http://www.cs.cmu.edu/ãrk/TweetNLP/                                           Language Identification. In Proceedings of the Tenth Workshop on Innovative Use
                                                                       4
     of NLP for Building Educational Applications. Association for Computational
     Linguistics, Denver, Colorado, 49–55. http://aclweb.org/anthology/W15-0606
[12] Shervin Malmasi and Mark Dras. 2014. Arabic Native Language Identification.
     In Proceedings of the Arabic Natural Language Processing Workshop (EMNLP
     2014). Association for Computational Linguistics, Doha, Qatar, 180–186. http:
     //aclweb.org/anthology/W14-3625
[13] Shervin Malmasi and Mark Dras. 2014. Chinese Native Language Identification.
     In Proceedings of the 14th Conference of the European Chapter of the Association for
     Computational Linguistics (EACL-14). Association for Computational Linguistics,
     Gothenburg, Sweden, 95–99. http://aclweb.org/anthology/E14-4019
[14] Shervin Malmasi and Mark Dras. 2014. Finnish Native Language Identifica-
     tion. In Proceedings of the Australasian Language Technology Workshop (ALTA).
     Melbourne, Australia, 139–144. http://www.aclweb.org/anthology/U14-1020
[15] Shervin Malmasi, Mark Dras, and Irina Temnikova. 2015. Norwegian Native
     Language Identification. In Proceedings of Recent Advances in Natural Language
     Processing (RANLP 2015). Association for Computational Linguistics, Hissar, Bul-
     garia, 404–412.
[16] Gerald R McMenamin. 2002. Forensic linguistics: Advances in forensic stylistics.
     CRC press.
[17] Joel R Tetreault, Daniel Blanchard, and Aoife Cahill. 2013. A Report on the First
     Native Language Identification Shared Task.. In BEA@ NAACL-HLT. 48–57.
[18] Maolin Wang, Qi Gong, Jie Kuang, and Ziyu Xiong. 2012. The development of a
     chinese learner corpus. In Speech Database and Assessments (Oriental COCOSDA),
     2012 International Conference on. IEEE, 1–6.
[19] Sze-Meng Jojo Wong, Mark Dras, and Mark Johnson. 2012. Exploring adaptor
     grammars for native language identification. In Proceedings of the 2012 Joint Con-
     ference on Empirical Methods in Natural Language Processing and Computational
     Natural Language Learning. Association for Computational Linguistics, 699–709.




                                                                                            5