-

INLI@FIRE-2018: A Native Language Identi cation System using Convolutional Neural Networks

0 Professor, CUSAT , Cochin 682022 , INDIA 1 Research Scholar, CUSAT , Cochin 682022 , INDIA

1865

Native Language Identi cation is the problem of identifying the rst language of speakers based on his/her writings in another language. The proposed approach is a deep learning based methodology using convolutional neural networks. Convolutional neural networks are a class of neural networks that have proven very e ective in areas such as pattern recognition and classi cation. They are able to capture the local texture within the text and can be used to nd the representative patterns in a text document. The proposed system consists of a language identi cation model, which is trained by a corpus of 1233 documents. The experiments were conducted using the dataset provided for INLI@FIRE2018. The results indicate that the system is capable of giving performance comparable to the methods employing more sophisticated approaches.

Convolutional Neural Networks cation Natural Language Processing Native Language Identi-

Native Language Identi cation is the process of distinguishing the native language of a writer from his/her writings in the second language(English) [ 2 ]. It is a well-known task that nds important applications in elds like forensic, educational settings, etc. Native language is always used as an essential feature for authorship pro ling and identi cation. Nowadays, due to the enormous usage of social media sites and online interactions, getting an intense threat is a common issue faced by commuters. If a comment or post induces any type of threat, then recognizing the native language of the commenter(the one who commented/posted it) will be one of the crucial measures in nding the source. Speakers of various languages may make di erent types of errors when learning a new language [ 10 ]. Hence Native Language Identi cation nds its applications in educational environments to supply targeted feedback to language students about their errors.

Hindi is by far the most widely spoken language in India. Even though roughly 40% of the population speak Hindi, people use English as their major second language. English is spoken natively by around 375 million people across the globe. It is the second o cial language of India and is used for business, teaching, learning, and trade on a day to day basis. Around 10% of Indias population speak English and use it in their day to day activities. But it is only a rst language for 0.019% people in the Country and becoming a second language for around 125 million people all over the world [ 1 ]. This 10% of the population is from di erent parts of the country and have various native languages. Identication of the native language of such speakers is a challenging task that nds important applications in this social media world.

The structure of this paper is as follows. Section 2 brie y reviews the similar works in this area. Section 3 discusses the task description and details about the dataset. Section 4 explains the methodology and Section 5 demonstrates the results and evaluation metrics. Section 6 concludes the article along with some routes for the future works. 2

Related works

Native Language Identi cation has a lot of importance in di erent areas of Natural Language Processing [ 3 ]. Most of the works in NLI is reported by taking English as a second language. They treated NLI as a supervised classi cation task and used statistical models to train data from various languages. The rst work in the eld is reported by Koppel et al. [ 9 ] who explored a multitude of features for NLI. These features include average sentence length, average word length, word n-grams, character n-grams, POS n-grams, content words, function words, spelling errors, grammatical errors, etc. SVM was used to train these features on ICLE corpus(International Corpus of Learner English (ICLEv2)[ 7 ]). Unigrams and Bigrams are the most explored n-grams in the previous works.

Syntactic features of the text are also focussed on the recent works. Wong and Dras [ 14 ] used production rules from di erent parsers as features to Language identi cation system. Similarly, Swanson and Charniak [ 12 ] investigated the bene t of Tree Substitution Grammars for NLI. Tetreault [ 13 ] experimented the use of Tree Substitution Grammars along with dependency features extracted from the Stanford parser. Tree fragments returned from Tree Substitution Grammar were proved to be bene cial for distinguishing the native and non-native English writers by acquiring the syntactic structures. Similarly, augmenting CFG rules with the grandparent nodes and the augmented rules are found to be outperforming the simple CFG rules in authorship attribution tasks [ 5 ].

It has been found that the semantic features are the least experimented one for NLI. Gamon extracted semantic features from semantic dependency graphs[ 6 ]. These features include binary semantic features and semantic modication relations which are used as a feature set for classi cation purpose. Semantic features contain number and gender information of nouns and pronouns as well as tense and aspectual features of verbs. Similarly, semantic modi cation relations extract the semantic relations between a node and all its descendants within a semantic graph. Experiments showed that the semantic features in combination with the syntactic features resulted in improved accuracy for Authorship Classi cation tasks [ 6 ]. Throughout the literature, we have found that none of the existing works utilizes deep learning based methodologies for language identi cation tasks. Hence we decided to go for an approach which uses CNN for the above-mentioned problem. 3

Task Description and Dataset Details

The task is focused on identifying the rst language of an author from the given Text/XML le which includes a set of Facebook comments in the English language. Six Indian languages are considered for this study. They are Tamil, Hindi, Kannada, Malayalam, Bengali, and Telugu. Spoken forms of English shows signi cant variations across the di erent states of India and it is relatively easy to recognize the native language of the speaker using his English accent. But nding the rst language of a writer based on his comments or posts in English is a di cult task in the present scenario.

The shared dataset contains data from six di erent Indian languages. The training data is a set of les in XML format. Each language has around 200 les of facebook comments. Each le contains around 150 words as the comment. Sentence segmentation is carried out using the regular expression. Statistics of the training data is shown in Table 1. The testing data contains two folders say test1 and test2. Test1 consists of 783 les and test2 contains 1185 les from the above-mentioned languages. The proposed system is a CNN-based language identi cation model which predicts the native language of a writer from his scripts. CNN's are responsible for the important breakthroughs in Image Classi cation problems and are the core of the most Computer Vision systems today. But they are not common in text analytics. CNN's have been proved to be successful in various text classi cation problems in recent years [ 8 ]. They have an important property of preserving the 2D spatial orientation in computer vision problems. But when it comes to texts these orientations have a one-dimensional structure. A generalized overview of Convolutional Neural Networks is shown in Fig. 1.

The problem is shaped as a text classi cation task with language names as labels(classes). The number of classes is the same as the number of languages considered for the study. Text from each language in the training data is sent to a sentence segmentation module where the raw text is converted to a set of sentences using regular expressions. Providing sequences of raw human-alike words will make no sense to computers. For that reason, the raw words are converted into numeric values using dictionaries. For that, we create a vocabulary of words, an array which stores all the words in the training data, but each word appears only once. Two dictionaries which map from word to its corresponding index value and reverse are also created. Two special words-'ZERO' and 'UNKNOWN' are added to the dictionary. 'ZERO' is used to make all the sequences of unique length and 'UNKNOWN' is used for out of vocabulary words. Then the sequences of strings are converted into sequences of numbers using the aforementioned dictionaries. The sentences may have a di erent length. But CNN training requires sequences of uniform length. So we padded the sentences with less number of words with 'ZEROS' to make them of unique length(ZERO padding). That is why we added the word ZERO to the dictionary. Each sentence in the training data is labeled with a corresponding language label. Hence our nal training data contains a lot of sentences and their corresponding labels. Identifying the patterns within the sentences is our ultimate goal.

Sequential model of keras is used for implementation [ 4 ]. The network is designed with four convolutional layers, two max-pooling layers, two dense layers, and an embedding layer. The rst layer is an embedding layer which performs the word embeddings. The embedding size is xed at 100. The second one uses the convolutional layer for its ability to capture the local context. The following layers are alternate max-pooling and convolutional layers for acquiring the patterns within the sentence. We have used 'Relu' as the activation function to bring the nonlinearity. The number of lters used in all the convolutional layers is 256. And the kernel size is xed at 7 for the rst two convolutional layers and 3 for the remaining layers. The nal dense layer is associated with softmax activation units. During the training phase, lters slide over full rows of the matrix(words). CNN automatically learns the values of its lters based on the task assigned to it. The architecture of the proposed network is shown in Table 2.

Di erent con gurations of the network are attempted. Experiments are conducted using deep and shallow convolutional neural networks. The performance of di erent CNN architectures on the test data is given in Table 3. The best results are given by the above-described architecture. In our experiments, we selected the rst 90% of the data as training data and the remaining data as testing data. The batch size is xed at 64. 'Categorical cross entropy' is used as the loss function. We used 'Adam', the e cient gradient descent algorithm as the optimizer because it is an e cient one for optimization. Dropout is used to prevent over tting [ 11 ]. Model is compiled using Tensor ow in the backend. The network is trained for 10 epochs and the model le is saved for the testing purpose. 5

Results

Experiments are also conducted to measure the e ect of training data size on the system performance. It is observed that the performance of the system increases Name Network Con guration Accuracy CNN1 1 Conv,1 Maxpool,1 Dense,1 Dropout 17.5% CNN2 2 Conv,2 Maxpool,2 Dense,1 Dropout 21.2% CNN3 3 Conv,2 Maxpool,2 Dense,2 Dropout 22.2% CNN4 4 Conv,3 Maxpool,2 Dense,2 Dropout 25.7% CNN5 4 Conv,4 Maxpool,2 Dense,2 Dropout 25.3%

CNN6 5 Conv,5 Maxpool,2 Dense,2 Dropout 24.7% with the increase in training data size. Figure 2 shows the e ect of training data size on our best performed CNN architecture. Hence it is better to have a larger sized training corpus when dealing with deep learning based classi cation methodologies. We used accuracy to quantify the performance of our model. Accuracy computes the degree to which the result of a prediction conforms to the true value. The proposed system was tested with the test data sets provided by the task organizers. Our system predicts the tag for each sentence in the post(comment). But our goal is to predict the tag for each XML le(post). So we labeled each post according to the maximum number of predictions for that particular post. Table 4 demonstrates the results of our experimentation on both the datasets. It is clear from the table that test2 results are far better than test1 results. Di erent runs correspond to di erent architecture of the proposed network. In this article, we have discussed a deep learning based native language identi cation system. The exclusive feature of our approach is the use of the Convolutional neural network for this task. The main reason we preferred a CNN rather than traditional feature-based methods is its ability to capture local texture in a sequence. It has been found that the accuracy of the system increases with the increase in training data size. Hence it is better to have a larger sized training corpus to get improved performance. The accuracy of the system can also be improved by using trained word embeddings. Due to insu cient system requirements, we could not perform this activity. Apart from NLI, Convolutional Neural Networks can be applied e ciently for various language processing applications. We hope to apply CNN based methods to di erent language processing applications such as text classi cation, sentiment analysis, etc.

1. The top 10 most spoken languages in india . https://www.listenandlearnusa.com/blog/the-top-10 - most -spoken-languagesin-india , accessed: 2018-08-03

2. Anand Kumar

, B.G.H. , P, S.K. : Overview of the inli@ re-2018 track on indian native language identi cation . In: workshop proceedings of FIRE 2018 , FIRE-2018, Gandhinagar, India, December 6-9,

CEUR

Workshop Proceedings ( 2018 )

3. Anand Kumar

, Barathi Ganesh

, S.K.P. , Rosso , P. : Overview of the inli pan at re-2017 track on indian native language identi cation . In: Notebook Papers of FIRE 2017 , FIRE-2017, Bangalore, India, December 8-10, CEUR Workshop Proceedings ( 2017 )

4. Chollet , F. , et al.: Keras. https://github.com/fchollet/keras ( 2015 )

5. Feng , S. , Banerjee , R. , Choi , Y. : Syntactic stylometry for deception detection . In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2 . pp. 171 { 175 . Association for Computational Linguistics ( 2012 )

6. Gamon , M. : Linguistic correlates of style: authorship classi cation with deep linguistic analysis features . In: Proceedings of the 20th international conference on Computational Linguistics . p. 611 . Association for Computational Linguistics ( 2004 )

7. Granger , S. , Dagneaux , E. , Meunier , F. , Paquot , M. : International corpus of learner english ( 2009 )

8. Kim , Y. : Convolutional neural networks for sentence classi cation . arXiv preprint arXiv:1408.5882 ( 2014 )

9. Koppel , M. , Schler , J. , Zigdon , K. : Determining an author's native language by mining a text for errors . In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining . pp. 624 { 628 . ACM ( 2005 )

10. Smith , B. : Learner English: A teacher's guide to interference and other problems . Ernst Klett Sprachen ( 2001 )

11. Srivastava , N. , Hinton , G. , Krizhevsky , A. , Sutskever , I. , Salakhutdinov , R.: Dropout: A simple way to prevent neural networks from over tting . The Journal of Machine Learning Research 15 ( 1 ), 1929 { 1958 ( 2014 )

12. Swanson , B. , Charniak , E.: Extracting the native language signal for second language acquisition . In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . pp. 85 { 94 ( 2013 )

13. Tetreault , J. , Blanchard , D. , Cahill , A. , Chodorow , M. : Native tongues, lost and found: Resources and empirical evaluations in native language identi cation . Proceedings of COLING 2012 pp. 2585 { 2602 ( 2012 )

14. Wong , S.M.J. , Dras , M. : Exploiting parse structures for native language identi cation . In: Proceedings of the Conference on Empirical Methods in Natural Language Processing . pp. 1600 { 1610 . Association for Computational Linguistics ( 2011 )