=Paper=
{{Paper
|id=Vol-2036/T4-4
|storemode=property
|title=SSN_NLP@INLI-FIRE-2017: A Neural Network Approach to Indian Native Language Identification
|pdfUrl=https://ceur-ws.org/Vol-2036/T4-4.pdf
|volume=Vol-2036
|authors=D. Thenmozi,Kawshik Kannan,Chandrabose Aravindhan
|dblpUrl=https://dblp.org/rec/conf/fire/ThenmoziKA17
}}
==SSN_NLP@INLI-FIRE-2017: A Neural Network Approach to Indian Native Language Identification==
SSN_NLP@INLI-FIRE-2017: A Neural Network Approach to Indian Native Language Identification D. Thenmozhi Kawshik Kannan Chandrabose Aravindan SSN College of Engineering SSN College of Engineering SSN College of Engineering Chennai, Tamilnadu Chennai, Tamilnadu Chennai, Tamilnadu theni_d@ssn.edu.in kawshik98@gmail.com aravindanc@ssn.edu.in ABSTRACT • Build a neural network model from the features of training Native Language Identification (NLI) is the process of identifying data the native language of non-native speakers based on their speech • Predict class label for the instance as any of the six languages or writing. It has several applications namely authorship profiling namely Tamil, Hindi, Kannada, Malayalam, Bengali or Tel- and identification, forensic analysis, second language identifica- ugu using the model tion, and educational applications. English is one of the prominent We have implemented our methodology in Python for the INLI language used by most of the non-English people in the world. task. The data set used to evaluate the task consists of a set of The native language of the non-English speakers may be easily training data for six Indian languages and test data. The number identified based on their English accents. However, identification of of training instances are 207, 211, 203, 200, 202 and 210 for the native language based on the users posts and comments written in languages Tamil, Hindi, Kannada, Malayalam, Bengali and Telugu English is a challenging task. In this paper, we present a neural net- respectively and number of test instances are 783. The steps used in work approach to identify the native language of an Indian speaker our approach are explained in detail in the following subsections. based on the English comments that are posted in microblogs. The lexical features are extracted from the text posted by the user and 2.1 Feature Extraction are used to build a neural network classifier to identify the native As a preprocessing step, all the ’xml’ tags are removed from the language of the user. We have evaluated our approach using the given text and only the body part of the given input is considered data set given by INLI@FIRE2017 shared task. for further processing. The punctuations like “, ”, - , _, ‘, and, ’ are removed from the text and the terms such as n’t, &, ’m, ’ll are re- KEYWORDS placed as ‘not’, ‘and’, ‘am’, and ‘will’ respectively before extracting Neural Network, Machine Learning, Language Recognition, Indian the features. Each term of the text is annotated with parts of speech Native Language Identification (POS) information such as noun, verb, adjective, adverb, and deter- miner. In general, nouns present in the text can be used as features. 1 INTRODUCTION However, adjectives may also be helpful to identify the native lan- guage. For example, from the post ‘I attended my kutty brother Native Language Identification (NLI) is the process of automatically Rams birthday party’, the adjective ‘kutty’ may used to identify the identifying the native language of a person based on her/his speech language as Tamil. So, in our approach, we have considered nouns or writing in another language. It has several applications namely and adjectives as features. All forms of nouns (NN*) namely NN, authorship profiling and identification [2], forensic analysis [3], NNS and NNP, and all forms of adjectives (JJ*) JJ, JJR and JJS are second language identification [8] and educational applications extracted from the text. The feature set is constructed by lemmatiz- [11]. Several research work have been reported on NLI based on ing each extracted term and by eliminating all the duplicate terms. the speakers text [13], [5], [1], [4], [7], [9] and their speech [10], We have obtained the bag of words (BOW) by processing all the [12]. English is one of the commonly used languages by many text of given training data. people in the world and several shared tasks on NLI have been We have used the NLTK tool kit1 to preprocess the given data conducted since 2013 to identify the native language based on and to annotate the text with POS information. The Wordnet Lem- English text and speech. In this work, we have focused on the shared matizer was used to lemmatize the terms that are extracted from task of INLI@FIRE2017 (co-located with the Forum for Information POS information. We have obtained a total of 12067 features from Retrieval Evaluation (FIRE), 2017) which aims to identify the native training data. We have used the boolean model to construct the language of Indians based on their comments posted in social media feature vectors for the instances of training data. in English [6]. The focus of the task is to develop techniques for identifying the native languages namely Tamil, Hindi, Kannada, Malayalam, Bengali or Telugu from a set of Facebook comments. 2.2 Language Identification We have applied a neural network approach to identify the native 2 PROPOSED APPROACH language of the user. The set of BOW features along with the class labels namely Tamil, Hindi, Kannada, Malayalam, Bengali We have implemented a supervised approach for this INLI task. The and Telugu from training data are used to build a model using a steps used in our approach are given below. simple neural network with two hidden layers. The features are • Preprocess the given text • Extract linguistics features for training data 1 http://www.nltk.org/ extracted for each instance of test data with unknown class label ’?’, the system. The performance may improve if we select only the similar to training data using the features identified from training significant features using χ 2 feature selection [14]. data. One of the label from the given labels namely Tamil, Hindi, Kannada, Malayalam, Bengali and Telugu is identified for the test ACKNOWLEDGMENTS data instances using the built model. We would like to thank the management of SSN Institutions for We have used the Keras framework2 with Tensorflow backend to funding the High Performance Computing (HPC) lab where this implement a neural network classifier for this problem. The number work is being carried out. of BOW features (12067) constitutes the number of neurons for the input layer of the network. We have used a sequential model of REFERENCES Keras to construct our neural network. We have added two hid- [1] Serhiy Bykh and Detmar Meurers. 2014. Exploring Syntactic Features for Native den layers with number of neurons as 64 and 32 respectively with Language Identification: A Variationist Perspective on Feature Encoding and Ensemble Optimization.. In Proceedings of COLING 2014, the 25th International ’RELU’ activation function. The output layer was added by specify- Conference on Computational Linguistics: Technical Papers. Ireland, 1962–1973. ing the number of neurons as 6 (to classify the instance to one of [2] Dominique Estival, Tanja Gaustad, Ben Hutchinson, Son Bao Pham, and Will the 6 languages) with an activation function ’SOFTMAX’. We used Radford. 2007. Author profiling for English emails. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics. ACL, Australia, ’sparse_categorical_crossentropy’ loss function with ’SGD’ opti- 263–272. mizer to compile the model. We trained the model with a batch_size [3] John Gibbons. 2003. Forensic linguistics: An introduction to language in the justice system. Wiley-Blackwell. of 10 for 100 epochs and obtained a training accuracy of 98.1%. [4] Radu-Tudor Ionescu, Marius Popescu, and Aoife Cahill. 2014. Can characters reveal your native language? A language-independent approach to native lan- 3 RESULTS AND DISCUSSIONS guage identification.. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 1363–1373. Our approach for native language identification has been evaluated [5] Scott Jarvis, Yves Bestgen, and Steve Pepper. 2013. Maximizing Classification based on the metrics namely precision, recall, and F1 measure for Accuracy in Native Language Identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. ACL, Atlanta, each language and also overall accuracy. The results obtained by Georgia, 111–118. our approach are presented in Table 1. A comparative study of [6] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, and Paolo Rosso. results of all the participants of INLI@FIRE2017 is available in [6]. 2017. Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language Identification. In Notebook Papers of FIRE 2017. CEUR Workshop Proceedings, Bangalore, India. Table 1: Performance on Test Data [7] Shervin Malmasi and Mark Dras. 2017. Native Language Identification using Stacked Generalization. arXiv preprint arXiv:1703.06541 (2017). [8] Shervin Malmasi, Mark Dras, et al. 2014. Language Transfer Hypotheses with Class Precision Recall F1-measure Linear SVM Weights.. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, Qatar, 1385–1390. BE 46.20 76.20 57.60 [9] Elham Mohammadi, Hadi Veisi, and Hessam Amini. 2017. Native Language Identification Using a Mixture of Character and Word N-grams. In Proceedings of HI 49.40 16.30 24.60 the 12th Workshop on Innovative Use of NLP for Building Educational Applications. KA 39.60 48.60 43.60 ACL, Copenhagen, Denmark, 210–216. [10] Taraka Rama and Çağrı Çöltekin. 2017. Fewer features perform well at Native MA 31.70 21.70 25.80 Language Identification task. In Proceedings of the 12th Workshop on Innovative TA 27.50 49.00 35.30 Use of NLP for Building Educational Applications. 255–260. TE 27.00 21.00 23.60 [11] Alla Rozovskaya and Dan Roth. 2011. Algorithm selection and model adaptation for ESL correction tasks. In Proceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies-Volume 1. ACL, Portland, Oregon, USA, 924–933. We have obtained an overall accuracy of 38.80% using our neural [12] Charese Smiley and Sandra Kübler. 2017. Native Language Identification using network approach for Indian native language identification task. Phonetic Algorithms. In Proceedings of the 12th Workshop on Innovative Use of This is very poor compared to the training accuracy of 98.1% and is NLP for Building Educational Applications. 405–412. [13] Joel Tetreault, Daniel Blanchard, Aoife Cahill, and Martin Chodorow. 2012. Native an indication of over-fitting. We need to explore using regulariza- tongues, lost and found: Resources and empirical evaluations in native language tion techniques such as dropout during training to avoid this. identification. Proceedings of COLING 2012 (2012), 2585–2602. [14] D Thenmozhi, P Mirunalini, and Chandrabose Aravindan. 2016. Decision Tree Approach for Consumer Health Information Search.. In FIRE (Working Notes). 4 CONCLUSION CEUR, Kolkata, India, 221–225. We have presented a system that uses a neural network model for identifying the native language, namely Tamil, Hindi, Kannada, Malayalam, Bengali or Telugu, of Indians from the English com- ments posted by them in social media. We have extracted the linguis- tics features from training data to build a neural network model with two hidden layers. The data set given by INLI@FIRE2017 shared task has been used to evaluate our methodology. We have obtained an overall accuracy of 38.80%. This is very poor compared to the training accuracy and indicates over-fitting. Regularization techniques such as dropout may be used to improve generalization. A lexical database may be used to correct terms such as pls, sry, fyi, etc., present in social media text for improving the performance of 2 https://keras.io/ 2