Team WebArch at FIRE-2018 Track on Indian Native Language Identification Aman Gupta SRM University, Kattankulathur, Chennai, Tamil Nadu, India Abstract. Native Language Identification (NLI) is the task which in- volves identification of native language (L1) of an individual based on his/her language production in a learned language (L2). It is basically a classification task where we are classifying L1 into a number of different languages. In this task I have to identify an individual’s native language (L1) among the following six Indian languages: Bengali, Hindi, Kannada, Malayalam, Tamil, and Telugu using their Facebook comments written in English language (L2). In this paper I propose to use machine learning models such as classification models together with N-grams as features and Tfidf as vectorizer. Keywords: Native Language Identification · Natural Language Pro- cessing · Classification. 1 Introduction Native Language Identification (NLI) is the task of classifying the native lan- guage (L1) of an individual into given different languages based on his/her writ- ing in another language (L2). NLI tasks involves identifying language use pat- terns that are common to certain groups of speakers that share the same native language. The native language of an individual influences the usage of words as well the errors that a person makes when writing in another language. The task is usually considered as a classification problem where a machine learning algo- rithm is trained in a supervised manner which can then be used for predicting the native language of user text. Predicting the native language of a writer has applications in different fields. It can be used for authorship identification, forensic analysis, tracing linguis- tic influence in potentially multi-author texts and naturally to support Second Language Acquisition research. In the field of cyber security, NLI can be used to determine the native language of an author of a suspicious or threatening text. In the field of academics NLI can be used for educational applications such as developing grammatical error correction systems which can personalize their feedback and model performance to the native language of the user. In my research here I have used classification models such as Logistic Re- gression, Linear SVC, Naive Bayes to name a few, together with features such as N-Grams both at character level and word level to find the best working model which can efficiently classify L1 of a person to different sets of languages. 2 Aman Gupta 2 Related Work NLI research has mostly been focused on texts where both lexical and syntac- tic features were used. Models formed try to extract patterns that speakers with different native language will have in terms of different misspellings, mispronun- ciations or usage frequency of particular words. Also some languages have specific linguistic styles, like Japanese is much more formal in nature while French and Spanish are way more romantic in nature because of their gentler vocabulary whereas Russian and German can be classified as harsh because of their string vocabulary. Kumar et al.[1] (2017) published an overview of FIRE-2017 which similar to this was based on Native Language Identification using comments of individuals on social networking sites. Malmasi et al.[2] published a report on Native Language Identification Shared Task which depicted various approaches taken by participants for solving Native Language Identification task. Malmasi and Dras[4] tested a range of linear classifiers, and observed that state of the art results was achieved by an ensemble model in 2017. The features they used were simple unigrams, bigrams, and character n-grams. They also found that char- acter level features generally outperform word level features for NLI. Tsur and Rappoport[6] also (2007) achieved an accuracy of 66% by using only character bi-grams. Besides these Swanson and Charniak[7] uses a bit different approach of using Tree Substitution Grammars (TSGs). Wong and Dras[8] also explored production rules from two parsers in 2011. 3 Task Description and Data Dataset provided by task organizers contains information collected from English speakers of six different native Indian languages namely Tamil, Telugu, Kannada, Malayalam, Bengali and Hindi. Data was collected through social networking site Facebook. The distribution of class and training instances can be seen in Table 1. Language Training Instances Bengali 202 Hindi 211 Kannada 203 Malayalam 200 Tamil 207 Telugu 210 Table 1: Distribution of Training Data Table 1. shows number of instances for each of the six languages is roughly the same. Task was to predict L1 of a candidate based on tweets posted by an individual on social networking site written in L2. Native Language Identification 3 4 Proposed Technique I have tried to model this task as a classification task. I have divided the given dataset into three parts Training set (75%) , Testing set(12.5%) and Validation set(12.5%). Languages Training Test Validation Bengali 143 28 22 Hindi 160 30 21 Kannada 156 19 28 Malayalam 149 16 29 Tamil 152 29 26 Telugu 153 30 27 Total 913 152 153 Table 2: Distribution of Data into Training, Validation and Test Sets Validation set was used to determine several hyper parameter values. For determining whether to use stop words or not I used Count Vectorizer as a sample vectorizer to calculate token counts, Logistic Regression as a sample classification model. Two classification models were trained one ” with stop words” other ” without stop words”. Both were trained using training set data while accuracy was calculated on validation set. I used Python’s NLTK stop words as sample. The results from both models were plotted (see Fig1.a). Validation Set Accuracy being represented on Y-axis and Number of features(maximum no of words in vocabulary) on X-axis. Accuracy for category ”without stop words” was found to be higher when number of features considered are large.Similar to this Validation Set was also used to determine which N-gram are giving the best results.Unigrams, Bigrams, Trigrams were under consideration. So I trained three sample Logistic regression model one for every N-gram using training set data. Results from all three models were plotted(see Fig1.b).Validation Set Accuracy being represented on Y-axis and Number of features (maximum no of words in vocabulary) on X-axis. Accuracy for Unigrams were found to be highest when large number of fea- tures are considered. Hence it is Unigrams which are able to best capture inherent features of a language. Lastly I also determined which vectorizer to use. In my research I considered TfIdf Vectorizer and Count Vectorizer as two different can- didates. Count Vectorizer basically converts a collection of text documents to a matrix of token counts whereas TfIdf Vectorizer converts a collection of raw documents to a matrix of Tfidf features in simpler terms it is basically product of term frequency(no of times a particular word appears in a single document) and inverse document frequency(log of number of docs in your corpus divided by the number of docs in which this term appears). In total six sample Logis- 4 Aman Gupta (a) StopWords Or Not (b) N-grams (c) Vectorizer Fig. 1: Validation Set Results Native Language Identification 5 tic regression models were trained using training set data while accuracy was calculated using validation set. Accuracy for Tfidf vectorizer for Unigrams were found to be highest. Now after determining the hyper parameter values I considered several classification models. Each model was trained on training set while accuracy was calculated using Validation set and compared with null accuracy (accuracy calculated from model which always predicts the label which appeared maximum number of times in training set). Table 3 summarizes the results. Classifier Accuracy(%) With Null Accuracy(%) Time(s) Logistic Regression 79.74 62.21 0.26 Linear SVC 82.35 64.83 0.26 Linear SVC(L1 selection) 74.51 56.99 0.47 Multinomial NB 77.12 59.60 0.19 Bernoulli NB 55.56 38.03 0.19 Ridge Classifier 82.35 64.83 0.37 SGD Classifier 77.78 60.25 0.26 AdaBoost 49.67 32.15 0.83 Perceptron 76.47 58.95 0.22 Passive-Aggressive 83.01 65.48 0.26 Nearest Centroid 79.08 61.56 0.22 Table 3: Accuracy of different classifiers. Based on the accuracy on Validation Set Passive-Aggressive classifier showed the best results. Accuracy on Test dataset for Passive-Aggressive classifier was found to be 82.89%. 5 Test and Results Test Dataset was released on much later date and our final model was tested on it. Test Dataset provided by organizers consisted Facebook comments for various Indian Languages. On test set1 accuracy achieved was 41.4% while accuracy on test set2 which was released on a later date accuracy achieved was 31.9%. Accuracy is much lower than that achieved during training due to lack of large training dataset due to which model was not able to extract useful patterns of different languages efficiently. Acknowledgements I would like to express my special thanks to Soumil Mandal for his guidance and constant supervision as well as providing necessary information. Also I would like to extend my gratitude towards Team WebArch for providing this opportunity. 6 Aman Gupta References 1. Anand Kumar, M., Barathi Ganesh, H.B., Singh, S., Soman, K.P., Rosso, P. Overview of the INLI PAN at FIRE-2017 track on Indian native language iden- tification (2017) CEUR Workshop Proceedings, 2036, pp. 99-105. 2. Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano and Yao Qian. A Report on the 2017 Na- tive Language Identification Shared Task. 3. Shervin Malmasi et al. 2016. Native language identification: explorations and appli- cations. 4. Shervin Malmasi and Mark Dras. 2017. Native Language Identification using Stacked Generalization. 5. Scott Jarvis, Yves Bestgen, and Steve Pepper. 2013.Maximizing Classification Ac- curacy in Native Language Identification. 6. Oren Tsur and Ari Rappoport. 2007. Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words. 7. Benjamin Swanson and Eugene Charniak. 2012. Native Language Detection with Tree Substitution Grammars. 8. Sze-Meng Jojo Wong and Mark Dras. 2011. Exploiting Parse Structures for Native Language Identification. ¡aman304gupta@gmail.com¿ 2018-10-25T14:49:57.338Z: