=Paper=
{{Paper
|id=Vol-2036/T4-3
|storemode=property
|title=Bharathi SSN @ INLI-FIRE-2017:SVM based approach for Indian Native Language Identification
|pdfUrl=https://ceur-ws.org/Vol-2036/T4-3.pdf
|volume=Vol-2036
|authors=Bharathi B,Anirudh M,Bhuvana J
|dblpUrl=https://dblp.org/rec/conf/fire/BMJ17
}}
==Bharathi SSN @ INLI-FIRE-2017:SVM based approach for Indian Native Language Identification==
Bharathi_SSN@INLI-FIRE-2017:SVM based approach for Indian Native Language identification B. Bharathi, M. Anirudh, J. Bhuvana SSN College of Engineering Chennai, Tamil Nadu bharathib@ssn.edu.in,anirudh15058@cse.ssn.edu.in,bhuvanaj@ssn.edu.in ABSTRACT the above mentioned four models are combined to create ensemble Native Language Identification (NLI) is the task of identifying the approach and achieved an accuracy of 75%. In [5], for NLI used a native language of a writer or a speaker by analyzing their text. Maximum Entropy classifier, with the features such as character NLI can be important for a number of applications. In forensic lin- and chunk n-grams, spelling and grammatical mistakes, and lexical guistics, native language is often used as an important feature for preferences. In [1], normalized lexical, syntactic and dependency authorship profiling and identification. Nowadays due to the huge features with SVM classifier has been used to identify the native usage of social media sites and online interactions, receiving a vio- language for NLI shared task 2013. For NLI task, the features used lent threat is a common issue faced by commuters. If a comment or in [4] are n-grams of words, parts-of-speech as well as lemmas. In post poses any type of threat, then identifying the native language addition to normalizing each text to unit length, the authors also of the person will be one of the significant measures in finding the applied a log-entropy weighting schema to the normalized values, source. In this paper, we present our methodology for the task of which gives the accuracy of 83.6%. An L2-regularized SVM classifier identifying the native language of an Indian writer. We have ex- was used to create a single-model system in [4] . tracted TF-IDF feature vectors from the given document and used Many of the research works on NLI system used lexical, syntatic SVM classifier to identify the native language of the document given features with different classifiers for the document specific to par- by shared task on Indian Native Language Identification@FIRE2017. ticular domain written by different native speakers. In this work, The performance is measured in terms of accuracy and we have we haved experimented the shared task of INLI@FIRE2017 which obtained overall accuracy of 43.60%. aims to identify the native language of an Indian user based on their comments in social media [6]. The text used in the shared task is not KEYWORDS specific to any particular domain. The training documents given by INLI@FIRE2017 is taken from social media. Our focus is to identify Indian Native Language Identification, Classification, Support Vec- the native language using machine learning approach with Term tor Machine, TF-IDF Frequency-Inverse Document Frequency (TF-IDF) feature vector. 1 INTRODUCTION 2 PROPOSED APPROACH Native Language Identification (NLI), is the well-known task that We have implemented supervised machine learning approach for focuses on identify the native language of the non-native speakers. this INLI task. The steps for the proposed approach are as follows: In India, English is the most important language and has a status • Data preparation of the associated language. After Hindi, it is the most commonly • Extract TF-IDF features from the given text document spoken language in India and certainly the most read and written • Train the SVM classifer using the features extracted from language. The number of second language speakers of English has the training text corpus constantly been on the increase and this has also contributed to its • Predict class label for the instance as any of the six languages rich variation. English is blended with most of the Indian languages namely Tamil, Hindi, Kannada, Malayalam, Bengali or Tel- and is used as a second language or the third language frequently. ugu using the trained SVM model Regional and educational differentiation, distinguish the language usage and shows the stylistic variations in English. Spoken English The steps involved in the experimented approach is depicted in the shows great variation across the states of India and it is relatively Fig.1. easy to identify the native speaker using their English accent. But finding the native language of the user based on the comments 2.1 Data Preparation or posts written in English is a challenging task in the current The data used for our research are Facebook comments which is scenario. NLI has been invariantly used in various applications present in the form of embedded XML files. Hence, the data from and domains. In [2], experiments on language identification of web these XML files are to be extracted and the special symbols, punctu- documents, focusing on which combination of tokenisation strategy ation symbols are removed before they can be fed into mathematical and classification model achieves the best overall performance. model for training. The XML files contained various tags out of Native Language identification for the NLI Shared Task 2013 using which only comment tag was of interest. Libraries such as minidom features based on n-grams of characters, words, Penn TreeBank and from XML.dom package was used to parse through the XML files Universal Parts of Speech tagsets, and perplexity values of character for extracting the text within the comment tags. The comments of n-grams to build four different models are presented in[3]. In [3], which were encoded using UTF-8 encoding scheme were decoded Table 1: Performance analysis of INLI task Class Precision(in %) Recall(in %) F1-measure(in %) BE 50.30 80.50 62.00 HI 51.90 5.60 10.10 KA 33.30 64.90 44.00 MA 36.30 60.90 45.50 TA 48.60 51.00 49.80 TE 40.40 28.40 33.30 freedom to the model built for selecting more samples as support vector. We achieved cross validation accuracy of 84.61%. Figure 1: Experimented system architecture 2.4 Language Identification The feature vectors for the test documents are derived similar to training data using TF-IDF features. The trained multiclass SVM was used to predict the language for the test documents. Each test and converted into a python lists with their native language. It was document was predicted as one of the six languages namely Tamil, found that the number of Hindi, Bengali, Kannda, Telugu, Tamil Hindi, Kannada, Malayalam, Bengali or Telugu. and Malayalam comments used for training the model are 211, 202, 203, 210, 207 and 200 respectively. 3 PERFORMANCE ANALYSIS Our approach for Indian native language identification has been 2.2 Feature Extraction evaluated based on the metrics namely precision, recall and F1 The data used for training the model are essentially Facebook com- measure for each language with an overall accuracy. The results ments written by non-native speakers of English language. By reported for our approach are given in Table 1. virtue of which the grammar and diction are not considered to be We have obtained an overall accuracy of 43.60% using multiclass above par, which makes it unfit for applying commonly on native SVM based approach for Indian native language identification task. language identification algorithms such as Prediction by Partial Matching (PPM) algorithm, word-length algorithm, syntactic struc- 4 CONCLUSIONS ture, error analysis algorithm and phonetic algorithm. We have presented an approach to identify the native language of This model exploits the fact that an author's native language the Indian speaker from the text posted in the social media. In the will dispose them towards particular language production patterns experimented methodology, TD-IDF features were extracted from in their second language. This theory can also be extended to the the text documents. Then a multiclass Support Vector Machine is errors made by the authors native to a particular language which trained using the extracted feature vectors. The experimented sys- clearly confides that if bag of words feature is used to extract the tem is evaluated using the test instances given by INLI@FIRE2017 proper English words will further lessen probability of qualifying shared task organizers for the six languages. We have obtained an the desired features to predict the native language. Hence, the overall accuracy of 43.60% using our experimented multiclass SVM data is taken as such for training, keeping the writing error and based approach. The system could further be improved by removing diction patterns of the different author groups intact. The feature or replacing the lexically incorrect terms such as plz, buzz, Y(why), extraction is done using the tool TF-IDF vectorizer method from r(are) into lexically correct terms in order to enhance the accuracy. the scikit learn library which yields the highest accuracy. This extraction tool first analyses the common words in a document and also counts the words as well. Then the data is transformed using REFERENCES [1] Amjad Abu-Jbara, Rahul Jha, Eric Morley, and Dragomir Radev. 2013. Experimental ”T F − I DFvect or izer ” method before training the model. Results on the Native Language Identification Shared Task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Atlanta, Georgia, 82–88. http://www. 2.3 Support Vector Machine aclweb.org/anthology/W13-1710 The Support Vector Machine (SVM) algorithm is used here for [2] Timothy Baldwin and Marco Lui. 2010. Language Identification: The Long and the Short of the Matter. In Human Language Technologies: The 2010 Annual Conference classification as it is well suited for text classification with colossal of the North American Chapter of the Association for Computational Linguistics (HLT data and features. SVM performs multi-class classification through ’10). Association for Computational Linguistics, Stroudsburg, PA, USA, 229–237. one-against-one on the six classes. The Radial Basis Function (RBF) http://dl.acm.org/citation.cfm?id=1857999.1858026 [3] Binod Gyawali, Gabriela Ramirez, and Thamar Solorio. 2013. Native Language kernel is used in training which fits the patterns produced by the Identification: a Simple n-gram Based Approach. In Proceedings of the Eighth authors in different groups better than poly and linear kernels. To Workshop on Innovative Use of NLP for Building Educational Applications. [4] Scott Jarvis, Yves Bestgen, and Steve Pepper. 2013. Maximizing Classification classify the training examples correctly, we set the "C" parameter Accuracy in Native Language Identification. In Proceedings of the Eighth Workshop in the SVM to 10,00,000 and gamma value as 0.1, which gives the on Innovative Use of NLP for Building Educational Applications. Association for 2 Computational Linguistics, Atlanta, Georgia, 111–118. http://www.aclweb.org/ anthology/W13-1714 [5] Thomas Lavergne, Gabriel Illouz, Aurélien Max, and Ryo Nagata. 2013. LIMSI’s participation to the 2013 shared task on Native Language Identification.. In BEA@ NAACL-HLT. 260–265. [6] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, and Paolo Rosso. 2017. Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language Identification. In Notebook Papers of FIRE 2017,FIRE-2017, Bangalore, India, Decem- ber 8-10. CEUR Workshop Proceedings. 3