BMSCE_ISE@INLI-FIRE-2017: A simple n-gram based approach for Native Language Identification Sowmya Lakshmi B S.1, Dr. Shambhavi B R.2 Department of ISE, BMS College of Engineering, Bangalore, India sowmyalakshmibs.ise@bmsce.ac.in1, shambhavibr.ise@bmsce.ac.in2 ABSTRACT Native Language Identification (NLI) aims to identify native 2 PREVIOUS WORK language L1 of an author by analysing the text written by him/her The work presented in this study was a participant of Indian in other language L2. NLI is often implemented as a supervised Native Language Identification@FIRE 2017 shared task. Several classification problem. In this paper, we report a NLI system researchers have investigated NLI and similar problems. An implemented using character tri-grams, word uni-grams and bi- overview of few common methods used for NLI prior to this grams methods using linear classifier, Support Vector Machines shared task is provided. (SVM). The work demonstrated is a participant of Indian Native Most of the researchers have featured NLI as a supervised Language Identification@FIRE 2017, achieving 0.27 overall classification task, where classifiers were trained on data from accuracy for the corpus with 6 native languages. Furthermore, different L1. Most commonly included features for NLI are with subsequent evaluations, the best accuracy score obtained was character n-grams, POS n-grams, content words, function words 0.73 with 10 fold cross-validation on training data. We were able and spelling mistakes. An SVM model [1-3] was trained on these to achive above accuracy by incorporating uni-grams and bi- features and obtained an accuracy of 60%-80%. grams of words. In the recent past, word embedding and document embedding has gained much attention along with other features. Continuous KEYWORDS Bag of Words (CBOW) and Skip Grams were used to obtain vectors of word embedding. Vector representations for documents Language Identification; Supervised Classification; Feature were generated with distributed bag-of-words architectures using Selection. Doc2Vec tool. In [4], authors developed a native language classifier using document and word embedding with an accuracy 1 INTRODUCTION of 82% for essays and 42% on speech data. Recently, author profiling is gaining more importance to improve LIBSVM2, variant of SVM was verified to be efficient for text performance of certain applications like forensics, security and classification. In [5], authors developed a NLI algorithm for marketing. Author profiling aims to detect author’s details like Arabic language with LIBSVM2. They combined production age, educational level and native language. Native Language rules, function words and POS bi-grams to perform machine Identification (NLI) is a sub-class of author profiling where, learning process and obtained an accuracy of 45%. native language L1 of a writer is automatically detected by First NLI shared task was organized with BEA workshop in analysing the text written in the second language L2. NLI is often 2013. System participated in closed training task was presented in implemented as a multiclass supervised classification task. [6]. The model was trained on 11 L1 languages of TOEFL11 The applications of NLI are categorised into two categories: corpus and cross-validation testing was performed for unseen security related applications and Second Language Acquisition essays resulted in accuracy of about 84.55%. Authors adopted (SLA)- related applications. Security related applications are features like n-grams of words, characters and POS and spelling identifying phishing sites or spam e-mails that usually consist of errors with TF-IDF weighing to train SVM model. strange sentences that might be written by non-native persons. In [7], author reported the work participated in essay track of SLA applications are to analyse the effect of L1 on later learned the Second NLI Shared Task 2017 held at BEA-12 workshop. A languages. novel 2-stacked sentence-document architecture was introduced As proved by preceding work in this area there exist quite a by considering lexical and grammatical features of text. A stack of few linguistic hints that helps in predicting native language. With two SVM classifiers were used, where first and second classifier the impact of their native language, authors tend to make common were sentence and document classifiers respectively. First mistakes in spelling, punctuation and grammar while using other classifier aimed at predicting the native language of each sentence languages. of a document whereas, these predictions were adopted as features In this work, we examine the possibility of building native by document classifier. Finally, system was used to predict native language classifiers by ignoring grammatical errors and semantic language of unseen documents which resulted F1-score of 0.88. analysis of the text written in L2. A naive set of features using n- grams of words and characters are explored to develop NLI system. Sowmya Lakshmi et al. 3 TASK DESCRIPTION AND DATA features from the text files generated and developed two methods for NLI using python as explained below. NLI has drawn the attention of many researchers in recent years. With the influx of new researchers, the most substantive study in this field has led to INLI@FIRE 2017 shared task [8]. Task Character tri-grams method focuses on identifying native language of a writer based on his The tri-gram model reads text files and extracts all tri-grams writing in other language. In this case, the second language was (sequence of three bytes) and their corresponding counts from the English. The task was, native language prediction of a writer from text. Frequencies of tri-grams are pursued for every training the given Text/XML file which contains Facebook comments in language separately. For every language, frequencies are English language. Six Indian languages were proposed to consider relativized by dividing individual tri-gram counts through the for this task. They were Tamil, Hindi, Kannada, Malayalam, number of all tri-grams in the training corpus and are sorted based Bengali and Telugu. on the relative frequency (the probability of the tri-gram in the Dataset given corpus of a language) to create language model of that language. A language model for each language in the corpus The training dataset for the task was xml files, which contains a provided was created. set of Facebook comments in English by different native language Relative frequencies of the tri-grams for test dataset is speakers. Xml files were annotated as BE, HI, KA, MA, TE, and calculated and compared with the tri-grams in language models. TA for Bengali, Hindi, Kannada, Malayalam, Telugu and Tamil Intuitively, we would say that the tri-gram frequencies of tri- language respectively. Table 1 shows the training data statistics grams extracted from two different texts of the same native that was used for the task. language speaker should be very similar. The absolute difference was calculated by subtracting the relative frequency of individual Table 1: Training data tri-gram in the test dataset from the relative frequency of corresponding tri-gram in each language model. The absolute Native Language Files differences were summed up. For instance, if we compare test data with 5 language models, we would have 5 different values for Bengali 202 the sum of absolute differences. The minimum value represents Hindi 211 the best match for test data. Algorithm 1 describes the algorithm Kannada 203 for character tri-gram approach. Telugu 207 Algorithm 1: Character tri-grams method Malayalam 200 Input: Train Dataset for each language, Test Dataset Tamil 210 Output: Native Language Identification of Test Dataset begin for each language in language set 4 FEATURES for each document in Train Dataset of language Derive all possible tri-grams NLI has been formulated as a multiclass classification task. We end for used language-independent features such as character tri-grams Language model<- Frequency of each tri-gram in Train and word n-grams for NLI as described in [1]. From the previous Dataset of language works we observed that character tri-grams were useful for NLI, end for and they suggested that this might be due to the impact of author’s for every document in Test Dataset native language. To reflect this, we calculate character n-grams Derive all possible tri-grams and word n-grams as features. For characters, we consider tri- Calculate relative frequency of each tri-gram in the document grams. The features are generated over the entire training data, end for i.e., every tri-gram in the training dataset is used as a feature. for each language in language model Similarly, uni-grams and bi-grams of words were used as separate for every tri-gram in the Test Dataset document Calculate absolute difference features. Absolute difference <- (Relative frequency of tri-gram in Test Data) – (Relative frequency of corresponding tri- 5 APPROACH gram in language model) Training dataset provided were xml files which contained end for Facebook comments in English written by different native Sum up the calculated absolute differences of each language language speakers and files were annotated w.r.t native language model end for of the speaker. As a part of preprocessing, these xml files were Best match <- Among all the computed absolute differences select scraped to extract Facebook comments and comments related to the one with minimum value. similar native language were saved in a text file. We extracted end 2 A simple n-gram based approach for Native Language Identification 7 CONCLUSION In this paper, a supervised system for Indian Native Language Word n-grams method Identification has been presented. We describe character tri- Frequencies of word uni-grams and bi-grams were collected for grams, word uni-grams and bi-grams features, which are the each language irrespective of their meanings and order of words subset of frequently used features for NLI task. Results of the in the document. We instantiated countvectorizer module in supervised classification using these features on a test data set python to achieve word uni-grams and bi-grams. A Document consisting of 6 languages were reported as part of INLI@FIRE Term Matrix X [i, j] was formed, where i is the document id, j 2017 shared task. Our future work lies in improving the represents dictionary index of each word and Wij is the frequency performance of NLI system by considering features, which can of occurrence of each word w in document i. Each uni-gram and classify native languages in a better way. bi-gram of test data was compared with their frequencies of occurrences in the documents of all languages. REFERENCES In this experiment, we used Document Term Matrix with n- [1] Nicolai, Garrett, Md Asadul Islam, and Russ Greiner. "Native Language grams and applied linear SVM from scikit-learn as a classification Identification using probabilistic graphical models." Electrical Information and algorithm for NLI. Communication Technology (EICT), 2013 International Conference on. IEEE, 2014. [2] Abu-Jbara, A., Jha, R., Morley, E., & Radev, D. R. (2013, June). Experimental Results on the Native Language Identification Shared Task. In BEA@ NAACL- 6 RESULT ANALYSIS HLT (pp. 82-88). We submitted the output of the system for test data provided to [3] Mizumoto, T., Hayashibe, Y., Sakaguchi, K., Komachi, M., & Matsumoto, Y. (2013, June). NAIST at the NLI 2013 Shared Task. In BEA@ NAACL-HLT (pp. INLI@FIRE 2017 shared task workshop. A single run of each 134-139). method for six different languages was submitted and the results [4] Vajjala, Sowmya, and Sagnik Banerjee. "A study of N-gram and Embedding Representations for Native Language Identification." In Proceedings of the 12th of native language classification for all the languages are Workshop on Innovative Use of NLP for Building Educational Applications, pp. recapitulated in Table 2 and Table 3. Character tri-gram model 240-248. 2017. achieved 22% accuracy and word n-grams model achieved an [5] Mechti, Seifeddine, Lamia Hadrich Belguith, Ayoub Abbassi, Rim Faiz, and Carthage IHEC. "An empirical method using features combination for Arabic overall accuracy of 27%. native language identification." The combined features of uni-grams and bi-grams on the [6] Gebre, B. G., Zampieri, M., Wittenburg, P., & Heskes, T. (2013).” Improving native language identification with tf-idf weighting”. In the 8th NAACL training data was used to perform 10 fold cross-validation. With Workshop on Innovative Use of NLP for Building Educational Applications these features an improved accuracy of 73% was achieved. (BEA8) (pp. 216-223). [7] Cimino, A., & Dell'Orletta, F. (2017). “Stacked Sentence-Document Classifier Approach for Improving Native Language Identification”. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Table 2: Character tri-grams Applications (pp. 430-437). [8] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, Paolo Rosso. Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language Class Prec. Rec. F1 Identification. In: Notebook Papers of FIRE 2017, FIRE-2017, Bangalore, India, December 8-10, CEUR Workshop Proceedings. BE 0.40 0.292 0.338 HI 0.50 0.080 0.160 KA 0.117 0.270 0.163 MA 0.1730 0.641 0.272 TA 0.533 0.080 0.139 TE 0.267 0.383 0.315 Overall Accuracy 0.22 Table 3: Word n-grams Class Prec. Rec. F1 BE 0.389 0.551 0.456 HI 0.545 0.072 0.127 KA 0.190 0.446 0.266 MA 0.223 0.315 0.261 TA 0.218 0.260 0.237 TE 0.154 0.123 0.137 Overall Accuracy 0.27 3