=Paper= {{Paper |id=Vol-2036/T4-3 |storemode=property |title=Bharathi SSN @ INLI-FIRE-2017:SVM based approach for Indian Native Language Identification |pdfUrl=https://ceur-ws.org/Vol-2036/T4-3.pdf |volume=Vol-2036 |authors=Bharathi B,Anirudh M,Bhuvana J |dblpUrl=https://dblp.org/rec/conf/fire/BMJ17 }} ==Bharathi SSN @ INLI-FIRE-2017:SVM based approach for Indian Native Language Identification== https://ceur-ws.org/Vol-2036/T4-3.pdf
 Bharathi_SSN@INLI-FIRE-2017:SVM based approach for Indian
              Native Language identification
                                               B. Bharathi, M. Anirudh, J. Bhuvana
                                                   SSN College of Engineering
                                                      Chennai, Tamil Nadu
                             bharathib@ssn.edu.in,anirudh15058@cse.ssn.edu.in,bhuvanaj@ssn.edu.in
ABSTRACT                                                                 the above mentioned four models are combined to create ensemble
Native Language Identification (NLI) is the task of identifying the      approach and achieved an accuracy of 75%. In [5], for NLI used a
native language of a writer or a speaker by analyzing their text.        Maximum Entropy classifier, with the features such as character
NLI can be important for a number of applications. In forensic lin-      and chunk n-grams, spelling and grammatical mistakes, and lexical
guistics, native language is often used as an important feature for      preferences. In [1], normalized lexical, syntactic and dependency
authorship profiling and identification. Nowadays due to the huge        features with SVM classifier has been used to identify the native
usage of social media sites and online interactions, receiving a vio-    language for NLI shared task 2013. For NLI task, the features used
lent threat is a common issue faced by commuters. If a comment or        in [4] are n-grams of words, parts-of-speech as well as lemmas. In
post poses any type of threat, then identifying the native language      addition to normalizing each text to unit length, the authors also
of the person will be one of the significant measures in finding the     applied a log-entropy weighting schema to the normalized values,
source. In this paper, we present our methodology for the task of        which gives the accuracy of 83.6%. An L2-regularized SVM classifier
identifying the native language of an Indian writer. We have ex-         was used to create a single-model system in [4] .
tracted TF-IDF feature vectors from the given document and used             Many of the research works on NLI system used lexical, syntatic
SVM classifier to identify the native language of the document given     features with different classifiers for the document specific to par-
by shared task on Indian Native Language Identification@FIRE2017.        ticular domain written by different native speakers. In this work,
The performance is measured in terms of accuracy and we have             we haved experimented the shared task of INLI@FIRE2017 which
obtained overall accuracy of 43.60%.                                     aims to identify the native language of an Indian user based on their
                                                                         comments in social media [6]. The text used in the shared task is not
KEYWORDS                                                                 specific to any particular domain. The training documents given by
                                                                         INLI@FIRE2017 is taken from social media. Our focus is to identify
Indian Native Language Identification, Classification, Support Vec-      the native language using machine learning approach with Term
tor Machine, TF-IDF                                                      Frequency-Inverse Document Frequency (TF-IDF) feature vector.

1   INTRODUCTION                                                         2     PROPOSED APPROACH
Native Language Identification (NLI), is the well-known task that        We have implemented supervised machine learning approach for
focuses on identify the native language of the non-native speakers.      this INLI task. The steps for the proposed approach are as follows:
In India, English is the most important language and has a status              • Data preparation
of the associated language. After Hindi, it is the most commonly               • Extract TF-IDF features from the given text document
spoken language in India and certainly the most read and written               • Train the SVM classifer using the features extracted from
language. The number of second language speakers of English has                  the training text corpus
constantly been on the increase and this has also contributed to its           • Predict class label for the instance as any of the six languages
rich variation. English is blended with most of the Indian languages             namely Tamil, Hindi, Kannada, Malayalam, Bengali or Tel-
and is used as a second language or the third language frequently.               ugu using the trained SVM model
Regional and educational differentiation, distinguish the language
usage and shows the stylistic variations in English. Spoken English      The steps involved in the experimented approach is depicted in the
shows great variation across the states of India and it is relatively    Fig.1.
easy to identify the native speaker using their English accent. But
finding the native language of the user based on the comments            2.1     Data Preparation
or posts written in English is a challenging task in the current         The data used for our research are Facebook comments which is
scenario. NLI has been invariantly used in various applications          present in the form of embedded XML files. Hence, the data from
and domains. In [2], experiments on language identification of web       these XML files are to be extracted and the special symbols, punctu-
documents, focusing on which combination of tokenisation strategy        ation symbols are removed before they can be fed into mathematical
and classification model achieves the best overall performance.          model for training. The XML files contained various tags out of
Native Language identification for the NLI Shared Task 2013 using        which only comment tag was of interest. Libraries such as minidom
features based on n-grams of characters, words, Penn TreeBank and        from XML.dom package was used to parse through the XML files
Universal Parts of Speech tagsets, and perplexity values of character    for extracting the text within the comment tags. The comments
of n-grams to build four different models are presented in[3]. In [3],   which were encoded using UTF-8 encoding scheme were decoded
                                                                                            Table 1: Performance analysis of INLI task

                                                                                      Class     Precision(in %)        Recall(in %)       F1-measure(in %)
                                                                                      BE              50.30                80.50          62.00
                                                                                      HI              51.90                 5.60          10.10
                                                                                      KA              33.30                64.90          44.00
                                                                                      MA              36.30                60.90          45.50
                                                                                      TA              48.60                51.00          49.80
                                                                                      TE              40.40                28.40          33.30



                                                                                freedom to the model built for selecting more samples as support
                                                                                vector. We achieved cross validation accuracy of 84.61%.


         Figure 1: Experimented system architecture
                                                                                2.4     Language Identification
                                                                                The feature vectors for the test documents are derived similar to
                                                                                training data using TF-IDF features. The trained multiclass SVM
                                                                                was used to predict the language for the test documents. Each test
and converted into a python lists with their native language. It was            document was predicted as one of the six languages namely Tamil,
found that the number of Hindi, Bengali, Kannda, Telugu, Tamil                  Hindi, Kannada, Malayalam, Bengali or Telugu.
and Malayalam comments used for training the model are 211, 202,
203, 210, 207 and 200 respectively.                                             3     PERFORMANCE ANALYSIS
                                                                                Our approach for Indian native language identification has been
2.2    Feature Extraction                                                       evaluated based on the metrics namely precision, recall and F1
The data used for training the model are essentially Facebook com-              measure for each language with an overall accuracy. The results
ments written by non-native speakers of English language. By                    reported for our approach are given in Table 1.
virtue of which the grammar and diction are not considered to be                   We have obtained an overall accuracy of 43.60% using multiclass
above par, which makes it unfit for applying commonly on native                 SVM based approach for Indian native language identification task.
language identification algorithms such as Prediction by Partial
Matching (PPM) algorithm, word-length algorithm, syntactic struc-               4     CONCLUSIONS
ture, error analysis algorithm and phonetic algorithm.
                                                                                We have presented an approach to identify the native language of
    This model exploits the fact that an author's native language
                                                                                the Indian speaker from the text posted in the social media. In the
will dispose them towards particular language production patterns
                                                                                experimented methodology, TD-IDF features were extracted from
in their second language. This theory can also be extended to the
                                                                                the text documents. Then a multiclass Support Vector Machine is
errors made by the authors native to a particular language which
                                                                                trained using the extracted feature vectors. The experimented sys-
clearly confides that if bag of words feature is used to extract the
                                                                                tem is evaluated using the test instances given by INLI@FIRE2017
proper English words will further lessen probability of qualifying
                                                                                shared task organizers for the six languages. We have obtained an
the desired features to predict the native language. Hence, the
                                                                                overall accuracy of 43.60% using our experimented multiclass SVM
data is taken as such for training, keeping the writing error and
                                                                                based approach. The system could further be improved by removing
diction patterns of the different author groups intact. The feature
                                                                                or replacing the lexically incorrect terms such as plz, buzz, Y(why),
extraction is done using the tool TF-IDF vectorizer method from
                                                                                r(are) into lexically correct terms in order to enhance the accuracy.
the scikit learn library which yields the highest accuracy. This
extraction tool first analyses the common words in a document and
also counts the words as well. Then the data is transformed using               REFERENCES
                                                                                [1] Amjad Abu-Jbara, Rahul Jha, Eric Morley, and Dragomir Radev. 2013. Experimental
”T F − I DFvect or izer ” method before training the model.                         Results on the Native Language Identification Shared Task. In Proceedings of the
                                                                                    Eighth Workshop on Innovative Use of NLP for Building Educational Applications.
                                                                                    Association for Computational Linguistics, Atlanta, Georgia, 82–88. http://www.
2.3    Support Vector Machine                                                       aclweb.org/anthology/W13-1710
The Support Vector Machine (SVM) algorithm is used here for                     [2] Timothy Baldwin and Marco Lui. 2010. Language Identification: The Long and the
                                                                                    Short of the Matter. In Human Language Technologies: The 2010 Annual Conference
classification as it is well suited for text classification with colossal           of the North American Chapter of the Association for Computational Linguistics (HLT
data and features. SVM performs multi-class classification through                  ’10). Association for Computational Linguistics, Stroudsburg, PA, USA, 229–237.
one-against-one on the six classes. The Radial Basis Function (RBF)                 http://dl.acm.org/citation.cfm?id=1857999.1858026
                                                                                [3] Binod Gyawali, Gabriela Ramirez, and Thamar Solorio. 2013. Native Language
kernel is used in training which fits the patterns produced by the                  Identification: a Simple n-gram Based Approach. In Proceedings of the Eighth
authors in different groups better than poly and linear kernels. To                 Workshop on Innovative Use of NLP for Building Educational Applications.
                                                                                [4] Scott Jarvis, Yves Bestgen, and Steve Pepper. 2013. Maximizing Classification
classify the training examples correctly, we set the "C" parameter                  Accuracy in Native Language Identification. In Proceedings of the Eighth Workshop
in the SVM to 10,00,000 and gamma value as 0.1, which gives the                     on Innovative Use of NLP for Building Educational Applications. Association for
                                                                            2
    Computational Linguistics, Atlanta, Georgia, 111–118. http://www.aclweb.org/
    anthology/W13-1714
[5] Thomas Lavergne, Gabriel Illouz, Aurélien Max, and Ryo Nagata. 2013. LIMSI’s
    participation to the 2013 shared task on Native Language Identification.. In BEA@
    NAACL-HLT. 260–265.
[6] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, and Paolo Rosso.
    2017. Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language
    Identification. In Notebook Papers of FIRE 2017,FIRE-2017, Bangalore, India, Decem-
    ber 8-10. CEUR Workshop Proceedings.




                                                                                          3