NLPRL@INLI-2018: Hybrid gated LSTM-CNN model for Indian native language identification Rajesh Kumar Mundotiya1 , Manish Singh2 , and Anil Kumar Singh1 1 Department of Computer Science and Engineering, IIT(BHU), Varanasi {rajeshkm.rs.cse16, aksingh.cse}@iitbhu.ac.in 2 Department of Linguistics, BHU, Varanasi maneeshhsingh100@gmail.com Abstract. Native language identification (NLI) focuses on determining the native language of the author based on the writing style in En- glish. Indian native language identification is a challenging task based on users comments and posts on social media. To solve this problem, we present a hybrid gated LSTM-CNN model to solve this problem. The final vector of a sentence is generated at hybrid gate by joining the two distinct vector of a sentence. Gate seeks the optimum mixture of the LSTM and CNN level outputs. The input word for LSTM and CNN are projected into high-dimensional space by embedding technique. We obtained 88.50% accuracy during training on the provided social media dataset, and 17.10% is reported in the final testing done by Indian native language identification (INLI) workshop organizers. Keywords: Bi-LSTM · CNN · Glove. 1 Introduction Native Language Identification is a process to automatically identify the native language of an author by the writing or the speaking accent of his or her in another language that is acquired as a second language [1]. It can identify the writing structure based on the authors linguistic background. It can be used for several applications namely authorship profiling and identification, forensic anal- ysis, second language identification and educational applications. English is one of the well known and commonly used languages among humans. In this shared task, the goal is to identify the Indian native language written on social media as post or comment in English. Indian native language includes Bengali, Hindi, Kannada, Malayalam, Tamil and Telugu. The assumption behind this dataset collection is that only native language speakers will read native language news- papers [1] [12]. We have tackled this problem by supervised learning as classification problem but the main challenges for this are insufficient dataset size. There are couple of datasets used in past research which are freely available. International Corpus of Learner English (ICLE)3 corpus is one of the first appearing in the early studies. 3 https://uclouvain.be/en/research-institutes/ilc/cecl/corpora.html 2 Rajesh Kumar Mundotiya et al. It was publicly used for prediction of native language of learner based on his/her writing style. It was released in 2002 and updated in 2009. In the following sections; we mentions related work in Section 2, we describe the proposed model and training procedure in Section 3, we show the result and analysis in Section 4 and finally, draw conclusion in Section 6. 2 Related Work Native language identification is a new and significant problem. Language learn- ers are prone to make mistakes similarly if machines can learn the same tenden- cies of making mistakes then it may help in developing systems for educational domain. Several researchers worked on this problem and similar problem like sec- ond language acquisition. One of the earliest work, Tomokiyo and Jones (2001) tried to discriminate non-native statements from native statements, written in English by Nave Bayes [2]. Kochmar et al. (2011) has performed experiments on prediction of the native languages of Indo-European learners. He treated this problem as binary classifi- cation and use linear kernel SVM. The features used for prediction were n-grams and words. Also, The errors were tagged manually within the corpus [3]. Besides this, some other [4], [5] also used the SVM with different features. In the recent past, word embedding and document embedding has gained much attention along with other features. Vector representations for documents were generated with distributed bag-of-words architectures using Doc2Vec tool. The authors developed a native language classifier using document and word embed- ding with an accuracy of 82% for essays and 42% on speech data [6]. Yang et al. (2016) have purposed hierarchical attention network for classification prob- lem. They required vast corpus size attend the significant word and sentence by attention mechanisms [10]. Kim (2014) used the convolutional neural network and got a state-of-the-art accuracy, but it can hold contextual information till window-size. [11] In 2017 another shared task on NLI was organized. The corpus included essays and transcripts of utterances. According to Malmasi et. al. (2017), the ensemble methods and meta-classifiers with syntactic or lexical features were the most effective systems [7]. 3 Model Description The model architecture is showed in figure 1. Each document includes m sentences and each sentence within a document consists of n words. The word level input wi∗n projected into high-dimensional vector space with the help of pretrained glove English word embedding, wi∗n ∈