=Paper=
{{Paper
|id=Vol-2036/T4-6
|storemode=property
|title=DalTeam@INLI-FIRE-2017: Native Language Identification using SVM with SGD Training
|pdfUrl=https://ceur-ws.org/Vol-2036/T4-6.pdf
|volume=Vol-2036
|authors=Dijana Kosmajac,Vlado Keselj
|dblpUrl=https://dblp.org/rec/conf/fire/KosmajacK17
}}
==DalTeam@INLI-FIRE-2017: Native Language Identification using SVM with SGD Training==
DalTeam@INLI-FIRE-2017: Native Language Identification using SVM with SGD Training Dijana Kosmajac Vlado Keselj Dalhousie University, Faculty of Computer Science Dalhousie University, Faculty of Computer Science Halifax, Nova Scotia, Canada Halifax, Nova Scotia, Canada dijana.kosmajac@dal.ca vlado@cs.dal.ca ABSTRACT as function words, character n-grams, and Part-of-Speech (PoS) n- Native Language Identification (NLI), as a variant of Language grams. The task, in general, focuses on the goal to identify speaker’s Identification task, focuses on determining an author’s native lan- native language from the samples of text written in a second lan- guage, based on a writing sample in their non-native language. guage. In recent years, the challenging nature of NLI has drawn much One of the main challenges for this task is the lack of corpora attention from the research community. Its application and impor- in appropriate size, class balance and topic homogeneity. So far, tance are relevant in many fields, such as personalization of a new there are a couple of datasets which were used in the past research. language learning environment, personalized grammar correction, International Corpus of Learner English (ICLE)1 corpus is one of and authorship attribution in forensic linguistics. We participated the first appearing in the early studies. Released in 2002 and up- in the INLI Shared Task 2017 held in conjunction with FIRE 2017 dated in 2009, it became commonly used in research into native conference. To implement a machine learning method for Native language prediction of learner writing. Brooke et al. [1] suggested Language Identification, we used Character and Word N-grams that ICLI has problems that can lead to drop in performance when with SVM (Support Vector Machines) classifier trained with SGD evaluated. They proposed additional corpora that might be useful (Stochastic Gradient Descent) method. We achieved F1 measure of in the task of native language prediction. They used data from a 89.60% (using 10-fold cross validation), using provided social media language learning SNS — Lang-8.com — and they show improved dataset and 48.80% was reported in the final testing done by INLI performance. Another corpus [17] was presented in a shared task on workshop organisers. Native Language Identification of learners. The corpus was named TOEFL11, which contains essays in English by learners from 11 CCS CONCEPTS different native languages. The approach we present is based on a linear Support Vector Ma- • Computing methodologies → Supervised learning by clas- chine classifier trained using Stochastic Gradient Descent method. sification; Classification and regression trees; • Social and As features, we used character and word n-grams. In addition, we professional topics → Cultural characteristics; used tf-idf weighting technique with χ 2 feature selection. We used a dataset provided by the Workshop organisers. KEYWORDS The rest of the paper is organised as follows: in Section 2 we Native Language Identification, Support Vector Machines, Stochas- present some of the most recent and relevant research to our ex- tic Gradient Descent, N-Grams, Text Classification periments. Section 3 gives a short description of the dataset, using the information provided by the organisers. In the Section 4 we 1 INTRODUCTION presented the experimental setup with details on data preprocess- ing, feature selection and weighting and classifier setup. Section 5 Since the 1950s there is a discussion in linguistic literature whether shows and discusses the results. In Section 6 we outline conclusions and how the native speakers of particular languages have charac- and further work. teristic patterns in sentence generation in their second language. This has been investigated in different domains and from different aspects, including qualitative research in Second Language Acquisi- 2 RELATED WORK tion (SLA), more recently through predictive computational models The research in NLI domain is fairly recent. We present some of in NLP [7] and in linguistic forensics [16]. the most relevant to our experiments. In addition, the speaker’s native language can have an effect Kochmar et al. [8] study presented experiments on prediction on the types of errors they make. A study by Flanagan et al. [3] of the native languages of Indo-European learners through binary investigates the characteristics of errors by native language. They classification tasks using with linear kernel SVM. They divided the identified the differences and similarities of error co-occurrence native languages into two main groups: Germanic and Romance, characteristics of the following native languages: Chinese, Japanese, with intergroup prediction performance accuracy 68.4%. The fea- Korean, Spanish, and Taiwanese. They have shown that some lan- tures used for prediction were words and n-grams,different error guages have greater differences than another (Korean and Japanese types that had been manually tagged within the corpus. tend to make similar mistakes). Wong[19] analyzed learner writing with an extension of Adaptor This has motivated research in Native Language Identification Grammars for detecting co-locations at the word level, as well as (NLI), which was first defined as a Text Classification task by Kop- pel et al. [9], using a classifier with a set of lexical features such 1 https://uclouvain.be/en/research-institutes/ilc/cecl/corpora.html Table 1: INLI training dataset statistics Language Number Percentage Hindi (HI) 211 17.11% Telugu (TE) 210 17.03% Tamil (TA) 207 16.79% Kannada (KA) 203 16.46% Bengali (BE) 202 16.38% Malayalam (MA) 200 16.22% Total 1233 100% for POS and functional words. Classification was performed at the document level by parsing individual sentences of the learner’s writing to detect the native language with the final prediction based on a majority score of the sentences. Some notable characteristic features of languages extracted by this method were also discussed. Bykh[2] discussed the use of recurring n-grams of variable lengths as features for training a native language classifier. They also incor- Figure 1: Architecture of the system. porated POS features. They claim that their approach outperformed previous work under comparable data setup (ICLE corpus), reaching 89.71% accuracy for a task with seven native languages. 4 EXPERIMENTAL METHODOLOGY Jarvis et al. [6] was the best performing participant in earlier This paper presents a supervised multi-class classification approach. mentioned workshop by Tetreault [17]. They analyzed a set of The training data texts are labeled with classes according to the features such as: word n-grams, POS n-grams, character n-grams, author‘s native language. Figure 1 shows a diagram of the classifier and lemma n-grams. On top of it, they used an SVM classifier. The components. prediction performance was evaluated on several different models with varying combinations of features. Malmasi et al. [12–15] presented the first NLI experiments on 4.1 Data Preprocessing Arabic2 (Arabic Learner Corpus - ALC), Chinese (Chinese Learner 4.1.1 Cleaning. Preparing and normalising the dataset are the Corpus [18]), Finnish and Norwegian languages data using a corpus first and necessary subtasks prior to the selection and classifica- of examination essays collected from learners of Norwegian. Given tion. It includes filtering and adjusting the raw texts to make them the differences between English and aforementioned languages, suitable for the input of the next subtask. In general, social media the main objective was to determine if NLI techniques previously user-generated texts are likely to be very noisy, containing tex- applied to second language English can be effective for detecting tual elements irrelevant to the observed Classification Task. Hence, native language transfer effects in second language. some parts of the comments were not considered as part of the feature set including hashtags, mentions and links. 3 DATASET 4.1.2 Feature Extraction. Our model uses character n-grams of The dataset used in the experiment was provided by the organizers order 2–5. These n-grams capture small and localised syntactic of the INLI Workshop [10]. Organizers identified the official Face- patterns within a word of language production. Additionally, we book pages of prominent regional language newspapers of the each used word n-grams of order 1–2. Our preliminary experiments region and extracted the comments. It consists of six classes: six showed that this n-gram lengths give best accuracy (possible reason languages of Indian subcontinent originating from different Indian is due to the data sparsity). states. As shown in Table 1, dataset is divided into classes named TA, MA, HI, BE, TE and KA. The dataset has following characteristics: 4.1.3 χ 2 feature selection. The formula for χ 2 feature selection can be expressed as follows: • It‘s balanced in the terms of the number of samples for each language; Õ Õ (Net ec − Eet ec )2 • The native and mixed script text is removed from the com- χ 2 (M, t, c) = (1) Eet ec ments; e t ∈ {0,1} ec ∈ {0,1} • The comments are related to the general news in all over where M is a message (a Facebook comment), t is a feature and India in order to avoid topic bias. c is a class. N is the observed frequency in M and E the expected frequency. Subscript et and ec can take values 0 or 1. For example, Net =1,ec =0 means feature t is in N messages and is not in class c. 2 http://www.arabiclearnercorpus.com/ We selected 50,000 features. 2 4.2 TF-IDF Weighting Table 2: Stratified 10-fold cross-validation Tf-idf (term frequency - inverse term frequency) is one of the best-known weighting algorithms. Several newer methods adapt Folds F1 tf-idf for use as part of their process, and many others rely on the #1 0.896 same fundamental concept. Idf, being the measure’s key part, was #2 0.904 introduced in a 1972 paper by Karen Spärck Jones. As suggested in #3 0.896 study by [5], we opted for using tf-idf measure in our experiment. #4 0.901 Tf-idf is the product of two measures, term frequency and in- #5 0.869 verse document frequency. In literature, different variations can be #6 0.907 found. In this work we have used normalized term frequency to #7 0.913 reduce bias towards different lengths between text samples. #8 0.861 ft,d #9 0.918 ntf (t, d) = (2) #10 0.892 max{ ft ′,d : t ∈ d} Mean 0.896 N comments idf (t, d) = logNcomments (3) St.D. 0.018 1 + ntf (t, d comments ) Í The final weight is expressed as follows: weight(t, d) = ntf (t, d) · idf (t, d) (4) where α represents a constant that multiplies regularization term, and is used in learning rate calculation. 4.3 Classifier The goal of the SGD algorithm is to bring the primal suboptimal- In the experiments we used a linear SVM (Support Vector Machine) ity below a threshold ϵ P : to perform multi-class classification. SVM was chosen primarily E(wt ) − E(w∗ ) ≤ ϵ P (9) because it shows effectiveness for this particular task [17] and we confirmed that in our preliminary experiments. The implementation is based on Python library scikit-learn, where we used linear SVM 4.4 Evaluation Measure with SGD (Stochastic Gradient Descent) training. As suggested by INLI 2017 organisers, we used macro-averaged F1 The textual training samples x are represented as a d-dimensional score for evaluation measure (Eq. 10). vector. The vector x is classified by looking at the sign of a linear = TP scoring function ⟨w, x⟩. The goal of learning is to estimate the d- P TP+FP , dimensional parameter w so that the score is positive if the vector x belongs to the positive class and negative otherwise. = TP R TP+FN , (10) ℓi (⟨w, x⟩) = max{0, 1 − yi ⟨w, x⟩} (5) F1 = ·R 2 · PP+R n where TP are true positive predicted values, FP are false positive λ 1Õ E(w) = ∥w∥ 2 + max{0, 1 − yi ⟨w, x⟩}. (6) predicted values, FN false negative predicted values, P represents 2 n i=1 precision and R represents recall. n 1Õ λ E(w) = Ei (w), Ei (w) = ∥w∥ 2 + ℓi (⟨w, x⟩). (7) 5 RESULTS n i=1 2 The results of our final experiment for distinguishing non-native SGD can be used to learn an SVM by minimizing E(w). SGD Indian authors of the Facebook comments are shown in the accu- performs gradient steps by considering at each iteration one term mulated confusion matrix on Fig. 2. The results show that features Ei (w) selected at random from this average. Conceptually, the al- we used are useful for discriminating among non-native comments, gorithm is: achieving 89.60% F1 measure. The result is based on the mean per- (1) Start with w0 = 0; formance of 10-fold validation. (2) For t = 1, 2, . . . ,T ; The testing set from the organisers was a separate dataset from (a) Sample one index i in 1, . . . , n uniformly at random; the one which was provided to the Workshop participants. The test (b) Compute a sub-gradient gt of Ei (w) at wt ; results from the organisers shown in Table 3 report macro-averaged (c) Compute the learning rate ηt ; F1 measure 48.80%. The best performing class is BE (Bengali) giving (d) Update wt +1 = wt − ηt gt . the accuracy of 67.10%. The recall for this class is significantly We used variable learning rate (in scikit-learn ’optimal’), which higher compared to the other classes. The worst performing class is computed as follows: is HI (Hindi) with the accuracy 23.80%. This is due to the very low recall value of 14.30%. Compared to the results of 10-fold cross- 1 validation, we can see that HI class was performing worst. However, ηt = (8) arguably due to the original dataset size and topic bias, overall α · (t + t 0 ) 3 Figure 2: Accumulated confusion matrix from 10 fold cross Figure 3: Accumulated confusion matrix from LLO valida- validation on INLI dataset. tion on INLI dataset. Table 3: Class-wise accuracy provided by the organisers A LEAVE-ONE-OUT CLASSIFIER Class Precision Recall F1 VALIDATION In addition, we performed Leave-One-Out (LLO) cross-validation BE 56.20% 83.20% 67.10% technique. This validation technique is appropriate, first, because HI 69.20% 14.30% 23.80% training dataset is relatively small (consisting of approximately 200 KA 40.50% 66.20% 50.30% samples per class). Second, the training set used for the final classi- MA 46.70% 54.30% 50.30% fier is approximately equal to the training sets in LLO validation TA 51.10% 48.00% 49.50% (all samples, but one). On Fig. 3 is shown accumulated confusion TE 33.30% 55.60% 41.70% matrix from 1233 validation runs. Overall 48.80% Final F1 measure is 90.90%. REFERENCES system performance dropped significantly with the new test set. [1] Julian Brooke and Graeme Hirst. 2013. Native language detection with ‘cheap’ Additional datasets should be considered in the future. learner corpora. In Twenty Years of Learner Corpus Research. Looking Back, Moving Ahead: Proceedings of the First Learner Corpus Research Conference (LCR 2011), Vol. 1. Presses universitaires de Louvain, 37. 6 CONCLUSION AND FURTHER WORK [2] Serhiy Bykh and Detmar Meurers. 2012. Native Language Identification using Recurring n-grams - Investigating Abstraction and Domain Dependence.. In In this paper, we experimented on the task of Native Language Iden- COLING. 425–440. tification (NLI). We used two different types of features: character [3] Brendan Flanagan, Chengjiu Yin, Takahiko Suzuki, and Sachio Hirokawa. 2014. and word n-grams. We use these features in a machine learning Classification and clustering english writing errors based on native language. In Advanced Applied Informatics (IIAIAAI), 2014 IIAI 3rd International Conference on. setup using a Support Vector Machine (SVM) classifier with Sto- IEEE, 318–323. chastic Gradient Descent (SGD) training on data from the INLI [4] Brendan Flanagan, Chengjiu Yin, Takahiko Suzuki, and Sachio Hirokawa. 2015. Prediction of Learner Native Language by Writing Error Pattern. Springer Interna- corpus which consists of six different native languages of Indian tional Publishing, Cham, 87–96. subcontinent. [5] Binyam Gebrekidan Gebre, Marcos Zampieri, Peter Wittenburg, and Tom Heskes. There are a couple of directions for future work. In the related 2013. Improving native language identification with tf-idf weighting. In the 8th NAACL Workshop on Innovative Use of NLP for Building Educational Applications literature there are some relevant NLI approaches that could be (BEA8). 216–223. tested on the data explored this paper. Some of them are analyses [6] Scott Jarvis, Yves Bestgen, and Steve Pepper. 2013. Maximizing Classification of feature diversity and interaction [11], and common error analy- Accuracy in Native Language Identification.. In BEA@ NAACL-HLT. 111–118. [7] Scott Jarvis and Scott A Crossley. 2012. Approaching Language Transfer Through sis by language [4]. Due to the lack of corpora for the languages Text Classification: Explorations in the Detection based Approach. Vol. 64. Multilin- investigated in this study, the application of more linguistically gual Matters. [8] Ekaterina Kochmar. 2011. Identification of a writer’s native language by error sophisticated features is limited, but to be explored in the future. analysis. Ph.D. Dissertation. Master’s thesis, University of Cambridge. For example, the use of a English parser could be used to study [9] Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Determining an author’s the overall structure of grammatical constructions as captured by native language by mining a text for errors. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, context-free grammar production rules using parser designed for 624–628. social media texts3 . Another possible improvement is the use of [10] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, and Paolo Rosso. classifier ensembles to improve classification accuracy. This has 2017. Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language Identification. In Notebook Papers of FIRE 2017. CEUR Workshop Proceedings, previously been applied to English NLI [17] with good results. Bangalore, India. [11] Shervin Malmasi and Aoife Cahill. 2015. Measuring Feature Diversity in Native 3 http://www.cs.cmu.edu/ãrk/TweetNLP/ Language Identification. In Proceedings of the Tenth Workshop on Innovative Use 4 of NLP for Building Educational Applications. Association for Computational Linguistics, Denver, Colorado, 49–55. http://aclweb.org/anthology/W15-0606 [12] Shervin Malmasi and Mark Dras. 2014. Arabic Native Language Identification. In Proceedings of the Arabic Natural Language Processing Workshop (EMNLP 2014). Association for Computational Linguistics, Doha, Qatar, 180–186. http: //aclweb.org/anthology/W14-3625 [13] Shervin Malmasi and Mark Dras. 2014. Chinese Native Language Identification. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL-14). Association for Computational Linguistics, Gothenburg, Sweden, 95–99. http://aclweb.org/anthology/E14-4019 [14] Shervin Malmasi and Mark Dras. 2014. Finnish Native Language Identifica- tion. In Proceedings of the Australasian Language Technology Workshop (ALTA). Melbourne, Australia, 139–144. http://www.aclweb.org/anthology/U14-1020 [15] Shervin Malmasi, Mark Dras, and Irina Temnikova. 2015. Norwegian Native Language Identification. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2015). Association for Computational Linguistics, Hissar, Bul- garia, 404–412. [16] Gerald R McMenamin. 2002. Forensic linguistics: Advances in forensic stylistics. CRC press. [17] Joel R Tetreault, Daniel Blanchard, and Aoife Cahill. 2013. A Report on the First Native Language Identification Shared Task.. In BEA@ NAACL-HLT. 48–57. [18] Maolin Wang, Qi Gong, Jie Kuang, and Ziyu Xiong. 2012. The development of a chinese learner corpus. In Speech Database and Assessments (Oriental COCOSDA), 2012 International Conference on. IEEE, 1–6. [19] Sze-Meng Jojo Wong, Mark Dras, and Mark Johnson. 2012. Exploring adaptor grammars for native language identification. In Proceedings of the 2012 Joint Con- ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 699–709. 5