=Paper= {{Paper |id=Vol-2266/T2-8 |storemode=property |title=NLPRL@INLI-2018: Hybrid gated LSTM-CNN model for Indian Native Language Identification |pdfUrl=https://ceur-ws.org/Vol-2266/T2-8.pdf |volume=Vol-2266 |authors=Rajesh Kumar Mundotiya,Manish Singh,Anil Kumar Singh |dblpUrl=https://dblp.org/rec/conf/fire/MundotiyaSS18 }} ==NLPRL@INLI-2018: Hybrid gated LSTM-CNN model for Indian Native Language Identification== https://ceur-ws.org/Vol-2266/T2-8.pdf
 NLPRL@INLI-2018: Hybrid gated LSTM-CNN
 model for Indian native language identification

       Rajesh Kumar Mundotiya1 , Manish Singh2 , and Anil Kumar Singh1
       1
           Department of Computer Science and Engineering, IIT(BHU), Varanasi
                  {rajeshkm.rs.cse16, aksingh.cse}@iitbhu.ac.in
                      2
                        Department of Linguistics, BHU, Varanasi
                            maneeshhsingh100@gmail.com



           Abstract. Native language identification (NLI) focuses on determining
           the native language of the author based on the writing style in En-
           glish. Indian native language identification is a challenging task based
           on users comments and posts on social media. To solve this problem, we
           present a hybrid gated LSTM-CNN model to solve this problem. The
           final vector of a sentence is generated at hybrid gate by joining the two
           distinct vector of a sentence. Gate seeks the optimum mixture of the
           LSTM and CNN level outputs. The input word for LSTM and CNN
           are projected into high-dimensional space by embedding technique. We
           obtained 88.50% accuracy during training on the provided social media
           dataset, and 17.10% is reported in the final testing done by Indian native
           language identification (INLI) workshop organizers.

           Keywords: Bi-LSTM · CNN · Glove.


1     Introduction
Native Language Identification is a process to automatically identify the native
language of an author by the writing or the speaking accent of his or her in
another language that is acquired as a second language [1]. It can identify the
writing structure based on the authors linguistic background. It can be used for
several applications namely authorship profiling and identification, forensic anal-
ysis, second language identification and educational applications. English is one
of the well known and commonly used languages among humans. In this shared
task, the goal is to identify the Indian native language written on social media
as post or comment in English. Indian native language includes Bengali, Hindi,
Kannada, Malayalam, Tamil and Telugu. The assumption behind this dataset
collection is that only native language speakers will read native language news-
papers [1] [12].
We have tackled this problem by supervised learning as classification problem
but the main challenges for this are insufficient dataset size. There are couple of
datasets used in past research which are freely available. International Corpus of
Learner English (ICLE)3 corpus is one of the first appearing in the early studies.
3
    https://uclouvain.be/en/research-institutes/ilc/cecl/corpora.html
2      Rajesh Kumar Mundotiya et al.

It was publicly used for prediction of native language of learner based on his/her
writing style. It was released in 2002 and updated in 2009.

   In the following sections; we mentions related work in Section 2, we describe
the proposed model and training procedure in Section 3, we show the result and
analysis in Section 4 and finally, draw conclusion in Section 6.


2   Related Work
Native language identification is a new and significant problem. Language learn-
ers are prone to make mistakes similarly if machines can learn the same tenden-
cies of making mistakes then it may help in developing systems for educational
domain. Several researchers worked on this problem and similar problem like sec-
ond language acquisition. One of the earliest work, Tomokiyo and Jones (2001)
tried to discriminate non-native statements from native statements, written in
English by Nave Bayes [2].
Kochmar et al. (2011) has performed experiments on prediction of the native
languages of Indo-European learners. He treated this problem as binary classifi-
cation and use linear kernel SVM. The features used for prediction were n-grams
and words. Also, The errors were tagged manually within the corpus [3]. Besides
this, some other [4], [5] also used the SVM with different features.
In the recent past, word embedding and document embedding has gained much
attention along with other features. Vector representations for documents were
generated with distributed bag-of-words architectures using Doc2Vec tool. The
authors developed a native language classifier using document and word embed-
ding with an accuracy of 82% for essays and 42% on speech data [6]. Yang et
al. (2016) have purposed hierarchical attention network for classification prob-
lem. They required vast corpus size attend the significant word and sentence by
attention mechanisms [10]. Kim (2014) used the convolutional neural network
and got a state-of-the-art accuracy, but it can hold contextual information till
window-size. [11]
In 2017 another shared task on NLI was organized. The corpus included essays
and transcripts of utterances. According to Malmasi et. al. (2017), the ensemble
methods and meta-classifiers with syntactic or lexical features were the most
effective systems [7].


3   Model Description
The model architecture is showed in figure 1.

   Each document includes m sentences and each sentence within a document
consists of n words. The word level input wi∗n projected into high-dimensional
vector space with the help of pretrained glove English word embedding, wi∗n ∈