-

Mangalore University INLI@FIRE2018: Arti cial Neural Network and Ensemble Based Models for INLI

Hamada A. Nayel

hamada.ali@fci.bu.edu.eg 0

H. L. Shashirekha

Native

0 Department of Computer Science, Faculty of Computers and Informatics, Benha University , Benha , Egypt 1 Department of Computer Science, Mangalore University , Mangalore , India

In this paper, the systems submitted by Mangalore University team for Indian Native Language Identi cation (INLI) task have been described. Native Language Identi cation (NLI) has di erent applications such as social media analysis, authorship identi cation, secondlanguage acquisition and forensic investigation. We submitted three systems using Arti cial Neural Network (ANN) model and Ensemble approach. All the three submitted systems achieved the same accuracy of 35.30% and secured second rank over all submissions for the task.

Arti cial Neural Network Language Identi cation

Native Language Identi cation (NLI) aims at identifying the native language (L1) of users written in another or later learned language or speech (L2). NLI is an important task that has many applications in di erent areas such as socialmedia analysis, authorship identi cation, second language acquisition and forensic investigation. In forensic analysis [ 5 ], NLI helps to glean information about the discriminant L1 cues in an anonymous text. Second Language Acquisition (SLA) [ 14 ] studies the transfer e ects from the native languages on later learned language. In academics, automatic correction of grammatical errors is an important application of NLI [ 15 ]. NLI can be used as a feature in authorship identi cation task [ 4 ] which aims at assigning a text to one of the prede ned list of authors. Authorship identi cation is used in terrorists communications investigation [ 1 ] and digital crime investigation [ 3 ]. In this era, social media is overwhelming our lives. Majority of people are communicating and discussing di erent topics using di erent platforms of social media such as Google+, Facebook and Twitter. While communicating with each other Indians prefer to use English because their native languages are di erent.In addition, most software and keyboards does not support input using Indian languages characters. So, people are using standard English keyboard to write their own words as transliterated words.

The task [ 8 ] aims at identifying the native language of the writer from the given Facebook comment written in English language. Six Indian languages - Tamil, Hindi, Kannada, Malayalam, Bengali and Telugu are considered for this shared task. 2.1

Related Work

Many researchers have explored the task of NLI for various applications. Jarvis et al. [ 6 ], used SVM to create a model for NLI and reported an accuracy of 83.6%. N-grams, PoS tags and lemmas have been used to create feature space model for training the classi er. They tested the performance of their system using TOEFL11 dataset [ 2 ]. The TOEFL11 is a collection of essays written by learners from the following native languages backgrounds: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. In this work, the feature set was not su cient to cover the characteristics of di erent languages. Tetreault et al. [ 16 ] used ensemble approach to build a classi er to improve the performance of base classi ers. A wide range of features were used to build an ensemble of logistic regression learners. Such features include word and character n-gram, PoS, function words, writing quality markers and spelling errors. In addition, a set of syntactic features such as Tree Substitution Grammars and dependency features extracted using the Stanford parser3 have been used. The system evaluated using TOEFL11 and International Corpus of Learner English (ICLE) datasets have resulted in state of the art accuracies of 90.1% and 80.9% respectively.

Nayel and Shashirekha [ 9, 12 ] used SVM and ensemble approach for the rst version of INLI and achieved accuracies of 47.60% and 47.30% respectively. 3 3.1

Approaches Arti cial Neural Networks

Arti cial Neural Networks (ANNs) are inspired by the mechanism of brain computation, which consists of computational units called neurons. The connections 3 http://nlp.stanford.edu:8080/parser/ between ANNs and the brain are in fact rather slim. In the metaphor, a neuron has scalar inputs with associated weights and outputs. The neuron multiplies each input by its weight, sums them and transforms to a working output through applying a non linear function called activation function. Table 1 shows examples of activation functions. The structure of the biological neuron and an example of an arti cial neuron model with n inputs and one output is shown in Figures 1(a), 1(b) respectively. In this example, a neuron receives simultaneous inputs X = (x1; x2; : : : ; xn) associated with weights W = (w1; w2; : : : ; wn), a bias b and calculate the output as: y = f (W

X + b) (1) where f is the activation function. ANN comprises of a large number of neurons within di erent layers. An ANN model basically consists of three layers: an input layer, a number of hidden layers and an output layer. Input layer contains a set of neurons called input nodes, which receive raw inputs directly. The hidden layers receive the data from the input nodes and are responsible for processing these data by calculating the weights of neurons at each layer. These weights are called connection weights and are passed from one node to another. Number of nodes in hidden layers in uences the number of connections. During training phase connection weights are adjusted to be able to predict the correct class label of the input. Output layer receives the processed data and uses its activation function to generate nal output. This kind of ANN where information ows in one direction is called feedforward ANN. Figure 2 shows an example of a feed-forward ANN with two hidden layers. An ANN is called fully connected if each node in a layer is connected to all nodes in the subsequent layer. (a) The structure of the biological neuron

(b) A simple neuron example Most of the classi cation tasks use a single classi er. However, for some data some classi er may give good results while other classi er may not perform well. Further, there is no generic rule which helps to choose a classi er for a particular application and data. So, instead of experimenting the single classi ers one by one in search of good results it will be bene cial to pool several such classi ers and then take the collective decision similar to the decision taken by a committee rather than an individual. This technique which overcomes the weakness of some classi ers using the strength of other classi ers is termed as "ensemble". Ensemble approach has been applied for di erent tasks such as BioNER [ 11, 13 ], word segmentation [ 10 ] and word sense disambiguation [ 7 ].

Fig. 2. A Simple Feed-Forward ANN Structure Hindi (HI), Telugu (TE), Bengali (BE) and Malayalam (MA). Considering the languages as a set of classes L = fKA; T A; HI; T E; BE; M Ag and comments as individual instances, the task of identifying the native language can be considered as a classi cation problem that assigns one of the prede ned languages of L to a new unlabelled comment.

The general framework of our system is as described in [ 12 ]. Vector space model using Term Frequency/Inverse Document Frequency (TF/IDF) has been used to represent comments. ANN based classi er is designed for the rst and second submissions. The hidden layer of rst submission contains 70 neurons and the activation function is logistic function. The hidden layer of second submission contains 80 neurons and the activation function is the identity function. Ensemble approach using majority voting technique has been used for designing the third submission. Four ANN based models with di erent parameters (shown in Table 2)) have been used as base classi ers to build the ensemble classi er.

Logistic Logistic

Tanh Identity

Results and Discussion

Accuracy and class-wise Precision (P), Recall (R) and F-measure have been used for evaluating the submitted systems [ 9 ]. Cross-Validation (CV) technique has been used while building the systems. Table 3 shows the 10-fold CV accuracy for the three systems. Performance evaluation of the rst, second and third submissions are shown in Tables 4, 5 and 6 respectively. The accuracy of each of the submitted system is 35.30% and all of them rank second among all the submissions. In all the three submissions, the lowest and the best performance was reported for Hindi and Bengali language respectively among all submissions. Most of native speakers of Indian languages have knowledge of Hindi which a ects while writing their comments in English. 5

Conclusion

In this work, ANN and Ensemble based classi ers have been used to design systems for INLI 2018. All designed classi ers reported the same accuracy and achieved the second rank over all submissions for the task. This work can be improved using di erent structures of ANN and using deep learning model. In addition, improving input representation will improve the performance of systems.

1. Abbasi , A. , Chen , H.: Applying Authorship Analysis to Extremist-Group Web Forum Messages . IEEE Intelligent Systems 20 ( 5 ), 67 {75 (Sep 2005 )

2. Blanchard , D. , Tetreault , J. , Higgins , D. , Cahill , A. , Chodorow , M. : Toe 11 : "A corpus of non-native english" . ETS Research Report Series 2013(2) ( 2013 )

3. Chaski , C.E. : Whos at the keyboard? "Authorship attribution in digital evidence investigations" . International Journal of Digital Evidence 4 ( 1 ), 1 { 13 ( 2005 )

4. Estival , D. , Gaustad , T. , Pham , S.B. , Radford , W. , Hutchinson , B. : Author proling for english emails . In: "Proceedings of the 10th Conference of the Paci c Association for Computational Linguistics" . pp. 263 { 272 ( 2007 )

5. Gibbons , J.: Forensic linguistics: "An introduction to language in the justice system" . Wiley-Blackwell ( 2003 )

6. Jarvis , S. , Bestgen , Y. , Pepper , S. : Maximizing classi cation accuracy in native language identi cation pp . 111 { 118 ( 2013 )

7. Klein , D. , Toutanova , K. , Ilhan , H.T. , Kamvar , S.D. , Manning , C.D.: Combining heterogeneous classi ers for word-sense disambiguation . In: Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions - Volume 8 . pp. 74 { 80 . WSD ' 02 , Stroudsburg , PA, USA ( 2002 )

8. Kumar , A. , Ganesh , B. , P, S.K. : Overview of the INLI@FIRE-2018 Track on Indian Native Language Identi cation . In: "workshop proceedings of FIRE 2018 , FIRE2018" . Gandhinagar, India, December 6-9,

CEUR

Workshop Proceedings ( 2018 )

9. Kumar , A. , Ganesh , B. , Shivkaran , P, S.K. , Rosso , P. : Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language Identi cation . In: "Notebook Papers of FIRE 2017 , FIRE-2017" . Bangalore, India, December 8-10, CEUR Workshop Proceedings ( 2017 )

10. Min , K. , Ma , C. , Zhao , T. , Li , H. : BosonNLP: "An Ensemble Approach for Word Segmentation and POS Tagging" . In: Natural Language Processing and Chinese Computing . pp. 520 { 526 . Springer International Publishing ( 2015 )

11. Nayel , H.A. , Shashirekha , H.L. : Improving NER for Clinical Texts by Ensemble Approach using Segment Representations . In: "Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)" . pp. 197 { 204 . NLP Association of India, Kolkata, India (December 2017 )

12. Nayel , H.A. , Shashirekha , H.L. : Mangalore-University@ INLI-FIRE- 2017 : "Indian Native Language Identi cation using Support Vector Machines and Ensemble approach" . In: Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation , Bangalore, India, December 8- 10 , 2017 . pp. 106 { 109 ( 2017 )

13. Nayel , H.A. , Shashirekha , H.L. , Shindo , H. , Matsumoto , Y. : Improving Multi-Word Entity Recognition for Biomedical Texts . International Journal of Pure and Applied Mathematics 118 ( 16 ), 301 { 3019 ( 2017 )

14. Ortega , L. : Understanding Second Language Acquisition . Hodder Education , Oxford ( 2009 )

15. Rozovskaya , A. , Roth , D. : Algorithm Selection and Model Adaptation for ESL Correction Tasks . In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies . pp. 924 { 933 . Portland , Oregon, USA ( June 2011 )

16. Tetreault , J. , Blanchard , D. , Cahill , A. , Chodorow , M. : Native Tongues, Lost and Found: " Resources and Empirical Evaluations in Native Language Identi catio" . In: "Proceedings of COLING 2012" . pp. 2585 { 2602 ( 2012 )