-

A Machine Learning Approach to Indian Native Language Identi cation

D. Thenmozhi

d@ssn.edu.in 0

S. Kayalvizhi

kayalvizhi1704@cse.ssn.edu.in 0

Chandrabose Aravindan

aravindanc@ssn.edu.in 0 0 Department of CSE, SSN College of Engineering , Chennai theni

NLI (Native Language Identi cation) determines the native language of the non-native users using their writings in a foreign language. It has several applications namely forensic and security, author pro ling and identi cation, and educational applications. English is a most common language used in social media by many non-English people in the world to share their thoughts and ideas. They blend English with their native language for their posts and comments. Identifying the native language from the short text in English is still a challenging task. In this paper, we present a language agnostic approach without any language speci c processing and employed machine learning approach with and without feature selection to identify the native language of a Indian speaker using their comments and posts in social network. The bag of word features are extracted from the text posted by the user and the feature vectors are constructed using TF-IDF score for the training data. We have used a statistical feature selection methodology to select the features that are signi cantly contributing to NLI task. The classi er with highest cross validation accuracy was used for predicting the native language of the user. Our approaches are evaluated using INLI@FIRE2018 shared task data set.

Indian Native Language Identi cation Language Recognition Author Pro ling Machine Learning Feature Selection Text Mining

NLI (Native Language Identi cation) is the process of automatically identifying the native language of speakers using their speech or writing in di erent language. It has several applications namely forensic and security [ 7 ], authorship pro ling and identi cation [ 6 ], and educational applications [ 19 ]. Several researches have been reported on text{based NLI [ 20, 11, 5, 8, 15, 16 ]. Currently, people use social media like YouTube, Facebook, Blogs and Tweets to share their thoughts, ideas and comments. English is the prominent language used by many non-English people by blending their native languages in their social media postings. In this line, Indians also use English predominantly in their comments and postings. Indian Native Language Identi cation (INLI) focuses on identifying native language of Indians based on their English writings. Many shared tasks have been conducted on NLI since 2013 to identify the native language from English text. Recently, shared tasks on INLI are also evolving since 2017 [ 2 ]. Their focus is to research and develop techniques to identify the native language namely Tamil, Hindi, Kannada, Malayalam, Bengali and Telugu from the set of Facebook comments. Several methodologies have been reported on INLI. N-gram approach [ 13 ], Machine learning approaches with Support vector machines [ 17, 3, 12 ], ensembling approaches [ 17, 9 ] and deep learning approaches [ 21, 4 ] have been used to identify the Indian native languages. In this research, our focus is on the shared task of INLI@FIRE2018 [ 1 ] which identi es the native language (Tamil, Hindi, Kannada, Malayalam, Bengali or Telugu) of Indians based on their comments posted in social media. INLI@FIRE2018 is a shared Task on Indian Native Language Identi cation (INLI) collocated with the Forum for Information Retrieval Evaluation (FIRE), 2018. 2

Related Work

Native language identi cation is an author pro ling task. PAN 2017 [ 18 ] focuses on language variety identi cation tasks. The shared tasks on INLI are also evolving since 2017 [ 2 ]. This section describes the methodology used for INLI tasks. Nayel and Shashirekha [ 17 ] normalized the text by removing the emoji, special characters, digits, hash tags, mentions and links. They used the techniques namely removal of stop word using the NLTK stop words package, manual stop word collection and other resources (Python stop words) to preprocess the data. They used TF-IDF scores to construct feature vectors and employed SVM to classify the native language of the user. Bharathi et al. [ 3 ] and Lakshmi et al. [ 13 ] also used TF-IDF for feature construction and SVM for classi cation for this task. Also, Lakshmi et al. used character n-gram and word n-gram while computing TF-IDF score. However, they have not applied any preprocessing techniques. Kosmajac and Keselj [ 12 ] normalized the text similar to [ 17 ] used TF-IDF with character n-gram and word n-gram for feature construction and employed SVM for classi cation. Jain et al. [ 9 ] considered non-English word and nouns phrase while computing TF-IDF scores without applying any preprocessing techniques. They used Logistic Regression, SVM, Ridge Classi er and Multi-Layer Perceptron (MLP) as base classi ers and employed an ensembled approach for language identi cation. Bhargava et al. [ 4 ] used a deep learning approach using hierarchical attention with bi-directional GRU architecture for this task. Thenmozhi et al. [ 21 ] also employed a neural network approach with 2 hidden layers for this task. They normalized the text similar to [ 17 ] and handled shortened words as part of pre-processing. They have considered only the nouns and adjectives present in the text to extract the features. In this paper, we propose a language agnostic approach in which we have not used any language speci c (or linguistic related) processing to extract the features. Thus, we simply took bag of words to consider all the words in the text and went for statistical based feature selection to extract the most signi cant features for language identi cation task.

Proposed Methodology

We have used a supervised approach with three variations namely a) termfrequency (TF) without feature selection, b) TF-IDF (term-frequency inversedocument-frequency) without feature selection and c) TF-IDF with statistical feature selection for the INLI task. The steps used in our approach are given below.

{ Preprocess the data { Extract bag of words (BOW) features from training data { Construct feature vectors using TF or TF-IDF with and without 2 feature selection { Build the models using a classi er for the three variations { Predict any of the six languages namely Tamil (TA), Hindi (HI), Kannada (KN), Malayalam (ML), Bengali (BE) or Telugu (TE) as class label for the instance using the model

The steps are explained below in detail. 3.1

Feature Extraction The data for INLI task is given as XML le. The given text is preprocessed by extracting only the textual part of the content present in XML le. All the punctuations are removed before extracting the features. Since, the texts are collected from social network sites, many terms are in transliterated form and many terms are in short-hand notations like pls, sry, tc, tks, etc. Hence, we did not apply stop word removal and stemming as preprocessing steps. The unique terms present in the text are considered as features in our rst two variations. The feature vectors for the training data is constructed using termfrequency in the rst variation. TF-IDF is used to construct feature vectors in the second variation. However, the number of extracted features may be more. We have employed a 2 feature selection in our third variation to extract the useful features that are contributing to native language identi cation. The details of feature selection are explained below. 3.2

Feature Selection In our third variation, we have used 2 feature selection. INLI task involves six categories namely \BE", \HI", \KN", \ML", \TA" and \TE". Hence, 2 6 CHI table (Table 1) or contingency table [ 14, 10, 22, 23 ] is constructed for all the feature fx. Table 1 contains the observed frequency (O) of feature fx for every category \BE", \HI", \KN", \ML", \TA" and \TE".

The observed frequencies (O) are used to compute the expected frequencies (E) for the feature fx using Equations 1.

E(x; y) = a2ffx;:fxgO(a; y) b2fBE;HI;KN;ML;T A;T EgO(b; y) n (1) where n is the total no. of training instances, x indicates whether the feature fx is present or not, y represents whether the training instance belongs to any of the six languages namely \BE", \HI", \KN", \ML", \TA" or \TE".

The expected frequencies namely E(fx; BE), E(fx; HI), E(fx; KN ), E(fx; M L), E(fx; T A), E(fx; T E), E(:fx; BE), E(:fx; HI), E(:fx; KN ), E(:fx; M L), E(:fx; T A) and E(:fx; T E) are calculated using Equation 1 for language identi cation. Then, we have calculated the 2 value for each feature fx using Equation 2.

2 statfx = x2ffx;:fxg y2fBE;HI;KN;ML;T A;T Eg (O(x; y) E(x; y))2

E(x; y) (2)

The set of features whose s2tat value is greater than c2rit( =0:01;df=5) : 9:24 are considered to be signi cant features for language identi cation. These selected features are used to build a model with a classi er in our third variation. 3.3

Model Building and Prediction The models for the rst two variations for language identi cation are built from training data using Multi Layer Perceptron (MLP) and the model for the third variation is built using Multinomial Naive Bayes (MNB) classi er with the selected features. The classi ers were chosen based on the cross validation accuracies. The class label as one among the six languages namely \BE", \HI", \KN", \ML", \TA" or \TE" is predicted for the test data instances by using the models. 4

Implementation

Our methodology was implemented in Python for this Shared Task on Indian Native Language Identi cation (INLI) task. The number of training instances are 202, 211, 203, 200, 207 and 210 for the languages namely Bengali, Hindi, Kannada, Malayalam, Tamil and Telugu respectively. Two sets of test data was given for the evaluations that consist of 783 and 1185 instances for test-set-1 and test-set-2 respectively. The textual part of data is extracted from XML le using xml.etree library. The punctuations are removed and the BOW (bag of words) features are extracted using the training instances. We have obtained a total of 21813 features from training data. Scikit{learn machine learning library was used to vectorize the training instances using CountVectorizer for the rst variation and T dfVectorizer for the second variation. We have implemented 2 feature selection algorithm to extract the signi cant features for native language identi cation. We have obtained a total of 1555 features by the feature selection with alpha=0.10 and degree of freedom 5 for the six classes.

We have employed several classi ers namely, Multinomial Naive Bayes, Gaussian Naive Bayes (GNB), Random Forest (RF), Decision Tree (DT), Extra Trees (ET), Ada Boost (AB), Stochastic Gradient Descent (SGD), Support Vector Machines (SVM), and Multi Layer Perceptron, and measured 10-fold cross validation to select the best classi er for all the three variations of our approach. Table 2 shows the cross validation output of various classi ers for all the three variations. This table shows that MLP performs better for the rst two variations that are without feature selection and MNB performs better for the third variation which used feature selection. MNB performs better with less number of features that are selected using our chi-square feature selection. However, MNB was not able to perform well with all the features. This is because the likelihood would be distributed and may not follow the Gaussian or other distributions when huge feature set is used. When more features are there, they may a ect each other's likelihood which reduces the performance. Hence, we have chosen MLP to build models for the rst two variations and MNB to build model for the third variation. These models are utilized to predict the native language for the two sets of test instances. We have submitted our second variation (best among rst two without feature selection) using MLP classi er and third variation (with feature selection) using MNB classi er as two runs for the shared task. The performance is measured in terms of precision (P), recall (R) and F1-measure. The results obtained by our approach for Run 1 on two test sets are shown in Table 3. The results show that our methodology which uses TF-IDF with MLP classi er does not perform well for Hindi language. We have obtained overall accuracies as 46.1% and 34.3% for test-set-1 and test-set-2 respectively.

The results obtained by our approach for Run 2 on two test sets are shown in Table 4. The results show that our methodology which uses TF-IDF, 2 feature selection and MNB classi er improved the performance for Hindi and Tamil languages on test set 2. However, this method does not improve the performance for the other languages of test set 2 and for test set 1. We have obtained overall accuracies as 32.4% and 19.7% for test set 1 and test set 2 respectively.

It is observed from Table 5 that the maximum accuracy obtained for Test set 2 is 37%. This may be due to data set size for training the model. Thus, we have obtained a very low accuracy. The data set size may be improved further using Generative Adversarial Networks (GAN) to improve the performance. 6

Conclusions

We have presented a machine learning approach for identifying the Indian native language namely Bengali, Hindi, Kannada, Malayalam, Tamil or Telugu from the English comments posted in social media. We have presented the three variations of our approach namely term-frequency without feature selection, TF-IDF without feature selection, and TF-IDF with 2 feature selection for the language

Team identi cation task. The data set of INLI@FIRE2018 shared task is used to evaluate our approach. We have submitted our second and third variations to the task and we have obtained overall accuracies of 46.1% and 34.3% for our rst run on test-set-1 and test-set-2 respectively. We have obtained overall accuracies of 32.4% and 19.7% for our second run on test-set-1 and test-set-2 respectively. Our feature selection improved the F-measure for Hindi and Tamil for test-set2. However, it does not improve for the other languages. Since our approach is language agnostic, we have not included any character level features at present. These features may be considered in future to improve the performance of NLI task. The performance may also be improved by incorporating word embedding techniques in future. Due to data set size for training, we have obtained very low accuracy. The data set size may be improved by using Generative Adversarial Networks (GAN) in future to improve the performance.

Anand

Kumar , M. ,

Barathi

Ganesh , B. , Soman , K.P. : Overview of the INLI@FIRE2018 track on Indian native language identi cation . In: In workshop proceedings of FIRE 2018. CEUR Workshop Proceedings , Gandhinagar, India, December 6- 9 ( 2018 )

Anand

Kumar , M. ,

Barathi

Ganesh , H. , Shivkaran , S. , Soman , K. , Rosso , P. : Overview of the INLI PAN at FIRE-2017 track on Indian native language identication . CEUR Workshop Proceedings 2036 , 99 { 105 ( 2017 )

3. Bharathi , B. , Anirudh , M. , Bhuvana , J.: Bharathi

SSN

@ INLI-FIRE-2017: SVM based approach for Indian native language identi cation . In: FIRE-Working Notes . pp. 110 { 112 ( 2017 )

4. Bhargava , R. , Singh , J. , Arora , S. , Sharma , Y. : Bits pilani@ INLI-FIRE-2017: Indian native language identi cation using deep learning . In: FIRE-Working Notes . pp. 123 { 126 ( 2017 )

5. Bykh , S. , Meurers , D. : Exploring syntactic features for native language identi - cation: A variationist perspective on feature encoding and ensemble optimization . In: Proc. of COLING 2014 , the 25th Int. Conf. on Computational Linguistics: Technical Papers . pp. 1962 { 1973 . Dublin, Ireland ( 2014 )

6. Estival , D. , Gaustad , T. , Hutchinson , B. , Pham , S.B. , Radford , W. : Author proling for english emails . In: Proceedings of the 10th Conference of the Paci c Association for Computational Linguistics . pp. 263 { 272 . ACL, Australia ( 2007 )

7. Gibbons , J.: Forensic linguistics: An introduction to language in the justice system . Wiley-Blackwell ( 2003 )

8. Ionescu , R.T., Popescu , M. , Cahill , A. : Can characters reveal your native language? a language-independent approach to native language identi cation . In: Proc. of the 2014 Conf. on Empirical Methods in NLP (EMNLP) . pp. 1363 { 1373 . ACL ( 2014 )

9. Jain , R. , Duppada , V. , Hiray , S. : Seernet@ INLI-FIRE-2017: Hierarchical ensemble for Indian native language identi cation . In: FIRE-Working Notes . pp. 127 { 129 ( 2017 )

10.

Janaki

Meena , M. , Chandran , K. : Naive bayes text classi cation with positive features selected by statistical method . In: Int. Conf. on Autonomic Computing and Communications , ICAC 2009 . pp. 28 { 33 . IEEE ( 2009 )

11. Jarvis , S. , Bestgen , Y. , Pepper , S. : Maximizing classi cation accuracy in native language identi cation . In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications . pp. 111 { 118 . ACL, Atlanta, Georgia ( 2013 )

12. Kosmajac , D. , Keselj , V. : Dalteam@ INLI-FIRE-2017: Native language identi cation using SVM with SGD training . In: FIRE-Working Notes . pp. 118 { 122 ( 2017 )

13. Lakshmi , S. , Shambhavi , B. : BMSCE ISE@ INLI-FIRE-2017: A simple n-gram based approach for native language identi cation . In: FIRE-Working Notes . pp. 115 { 117 ( 2017 )

14.

Yanjun , C.L. , Chung , S.M.: Text clustering with feature selection by using statistical data . IEEE Transactions on Knowledge and Data Engineering 20 ( 5 ), 641 { 652 ( 2008 )

15. Malmasi , S. , Dras , M. : Native language identi cation using stacked generalization . arXiv preprint arXiv:1703.06541 ( 2017 )

16. Mohammadi , E. , Veisi , H. , Amini , H.: Native language identi cation using a mixture of character and word n-grams . In: Proc. of the 12th Workshop on Innovative Use of NLP for Building Educational Applications . pp. 210 { 216 . ACL, Copenhagen, Denmark ( 2017 )

17. Nayel , H.A. , Shashirekha , H.: Mangalore-university@ INLI-FIRE-2017: Indian native language identi cation using support vector machines and ensemble approach . In: FIRE -Working Notes. pp. 106 { 109 ( 2017 )

18. Rangel , F. , Rosso , P. , Potthast , M. , Stein , B. : Overview of the 5th author pro ling task at pan 2017: Gender and language variety identi cation in twitter . Working Notes Papers of the CLEF ( 2017 )

19. Rozovskaya , A. , Roth , D. : Algorithm selection and model adaptation for esl correction tasks . In: Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies-Volume 1 . pp. 924 { 933 . ACL, Portland, Oregon, USA ( 2011 )

20. Tetreault , J. , Blanchard , D. , Cahill , A. , Chodorow , M. : Native tongues, lost and found: Resources and empirical evaluations in native language identi cation . Proceedings of COLING 2012 pp. 2585 { 2602 ( 2012 )

21. Thenmozhi , D. , Kannan , K. , Aravindan , C. : SSN NLP@ INLI-FIRE-2017: A neural network approach to Indian native language identi cation . In: FIRE-Working Notes . pp. 113 { 114 ( 2017 )

22. Thenmozhi , D. , Mirunalini , P. , Aravindan , C. : Decision tree approach for consumer health information search . In: FIRE-Working Notes . pp. 221 { 225 ( 2016 )

23. Thenmozhi , D. , Mirunalini , P. , Aravindan , C. : Feature engineering and characterization of classi ers for consumer health information search . In: Forum for Information Retrieval Evaluation . pp. 182 { 196 . Springer ( 2016 )