=Paper=
{{Paper
|id=Vol-2266/T2-10
|storemode=property
|title=Mangalore University INLI@FIRE2018: Artificial Neural Network and Ensemble based Models for INLI
|pdfUrl=https://ceur-ws.org/Vol-2266/T2-10.pdf
|volume=Vol-2266
|authors=Hamada A. Nayel,H. L. Shashirekha
|dblpUrl=https://dblp.org/rec/conf/fire/NayelS18
}}
==Mangalore University INLI@FIRE2018: Artificial Neural Network and Ensemble based Models for INLI==
Mangalore University INLI@FIRE2018: Artificial Neural Network and Ensemble Based Models for INLI Hamada A. Nayel1[1234−5678−9012] and H. L. Shashirekha2 1 Department of Computer Science, Faculty of Computers and Informatics, Benha University, Benha, Egypt hamada.ali@fci.bu.edu.eg 2 Department of Computer Science, Mangalore University, Mangalore, India hlsrekha@gmail.com Abstract. In this paper, the systems submitted by Mangalore Univer- sity team for Indian Native Language Identification (INLI) task have been described. Native Language Identification (NLI) has different appli- cations such as social media analysis, authorship identification, second- language acquisition and forensic investigation. We submitted three sys- tems using Artificial Neural Network (ANN) model and Ensemble ap- proach. All the three submitted systems achieved the same accuracy of 35.30% and secured second rank over all submissions for the task. Keywords: Artificial Neural Network · Ensemble Learning · Native Language Identification. 1 Introduction Native Language Identification (NLI) aims at identifying the native language (L1) of users written in another or later learned language or speech (L2). NLI is an important task that has many applications in different areas such as social- media analysis, authorship identification, second language acquisition and foren- sic investigation. In forensic analysis [5], NLI helps to glean information about the discriminant L1 cues in an anonymous text. Second Language Acquisition (SLA) [14] studies the transfer effects from the native languages on later learned language. In academics, automatic correction of grammatical errors is an im- portant application of NLI [15]. NLI can be used as a feature in authorship identification task [4] which aims at assigning a text to one of the predefined list of authors. Authorship identification is used in terrorists communications investigation [1] and digital crime investigation [3]. 2 H. A. Nayel and H. L. Shashirekha 2 Task Description In this era, social media is overwhelming our lives. Majority of people are com- municating and discussing different topics using different platforms of social media such as Google+, Facebook and Twitter. While communicating with each other Indians prefer to use English because their native languages are different.In addition, most software and keyboards does not support input using Indian lan- guages characters. So, people are using standard English keyboard to write their own words as transliterated words. The task [8] aims at identifying the native language of the writer from the given Facebook comment written in English language. Six Indian languages - Tamil, Hindi, Kannada, Malayalam, Bengali and Telugu are considered for this shared task. 2.1 Related Work Many researchers have explored the task of NLI for various applications. Jarvis et al. [6], used SVM to create a model for NLI and reported an accuracy of 83.6%. N-grams, PoS tags and lemmas have been used to create feature space model for training the classifier. They tested the performance of their system using TOEFL11 dataset [2]. The TOEFL11 is a collection of essays written by learn- ers from the following native languages backgrounds: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. In this work, the feature set was not sufficient to cover the characteristics of different languages. Tetreault et al. [16] used ensemble approach to build a classifier to improve the performance of base classifiers. A wide range of features were used to build an ensemble of logistic regression learners. Such features include word and character n-gram, PoS, function words, writing quality markers and spelling er- rors. In addition, a set of syntactic features such as Tree Substitution Grammars and dependency features extracted using the Stanford parser3 have been used. The system evaluated using TOEFL11 and International Corpus of Learner En- glish (ICLE) datasets have resulted in state of the art accuracies of 90.1% and 80.9% respectively. Nayel and Shashirekha [9, 12] used SVM and ensemble approach for the first version of INLI and achieved accuracies of 47.60% and 47.30% respectively. 3 Approaches 3.1 Artificial Neural Networks Artificial Neural Networks (ANNs) are inspired by the mechanism of brain com- putation, which consists of computational units called neurons. The connections 3 http://nlp.stanford.edu:8080/parser/ Title Suppressed Due to Excessive Length 3 between ANNs and the brain are in fact rather slim. In the metaphor, a neu- ron has scalar inputs with associated weights and outputs. The neuron multiplies each input by its weight, sums them and transforms to a working output through applying a non linear function called activation function. Table 1 shows examples of activation functions. The structure of the biological neuron and an example of an artificial neuron model with n inputs and one output is shown in Figures 1(a), 1(b) respectively. In this example, a neuron receives simultaneous inputs X = (x1 , x2 , . . . , xn ) associated with weights W = (w1 , w2 , . . . , wn ), a bias b and calculate the output as: y = f (W · X + b) (1) where f is the activation function. Table 1. Examples of activation functions ANN comprises of a large number of neurons within different layers. An ANN model basically consists of three layers: an input layer, a number of hidden layers and an output layer. Input layer contains a set of neurons called input nodes, which receive raw inputs directly. The hidden layers receive the data from the input nodes and are responsible for processing these data by calculating the weights of neurons at each layer. These weights are called connection weights and are passed from one node to another. Number of nodes in hidden layers influences the number of connections. During training phase connection weights are adjusted to be able to predict the correct class label of the input. Output layer receives the processed data and uses its activation function to generate final output. This kind of ANN where information flows in one direction is called feed- forward ANN. Figure 2 shows an example of a feed-forward ANN with two hidden layers. An ANN is called fully connected if each node in a layer is connected to all nodes in the subsequent layer. 4 H. A. Nayel and H. L. Shashirekha (a) The structure of the biological neuron (b) A simple neuron example Fig. 1. A typical BioNER system with an example 3.2 Ensemble Approach Most of the classification tasks use a single classifier. However, for some data some classifier may give good results while other classifier may not perform well. Further, there is no generic rule which helps to choose a classifier for a particular application and data. So, instead of experimenting the single classifiers one by one in search of good results it will be beneficial to pool several such classifiers and then take the collective decision similar to the decision taken by a commit- tee rather than an individual. This technique which overcomes the weakness of some classifiers using the strength of other classifiers is termed as ”ensemble”. Ensemble approach has been applied for different tasks such as BioNER [11, 13], word segmentation [10] and word sense disambiguation [7]. INLI considers set of Indian languages, namely Kannada (KA), Tamil (TA), Title Suppressed Due to Excessive Length 5 Fig. 2. A Simple Feed-Forward ANN Structure Hindi (HI), Telugu (TE), Bengali (BE) and Malayalam (MA). Considering the languages as a set of classes L = {KA, T A, HI, T E, BE, M A} and comments as individual instances, the task of identifying the native language can be con- sidered as a classification problem that assigns one of the predefined languages of L to a new unlabelled comment. The general framework of our system is as described in [12]. Vector space model using Term Frequency/Inverse Document Frequency (TF/IDF) has been used to represent comments. ANN based classifier is designed for the first and second submissions. The hidden layer of first submission contains 70 neurons and the activation function is logistic function. The hidden layer of second submission contains 80 neurons and the activation function is the identity function. Ensem- ble approach using majority voting technique has been used for designing the third submission. Four ANN based models with different parameters (shown in Table 2)) have been used as base classifiers to build the ensemble classifier. Table 2. Parameters of different base models Model Number Number of neurons in hidden layer Activation function 1 70 Logistic 2 80 Logistic 3 80 Tanh 4 80 Identity 6 H. A. Nayel and H. L. Shashirekha 4 Results and Discussion Accuracy and class-wise Precision (P), Recall (R) and F-measure have been used for evaluating the submitted systems [9]. Cross-Validation (CV) technique has been used while building the systems. Table 3 shows the 10-fold CV accuracy for the three systems. Table 3. Cross validation accuracies for the three submitted systems Accuracies in % Submission 1 Submission 2 Submission 3 89.68 90.48 89.68 85.60 84.80 85.60 87.10 87.90 87.90 91.87 90.24 91.06 91.87 92.68 92.68 84.55 82.93 82.93 88.62 89.43 89.43 90.16 90.16 90.98 86.88 85.25 86.07 88.52 86.88 86.88 Mean= 88.49 88.08 88.32 Performance evaluation of the first, second and third submissions are shown in Tables 4, 5 and 6 respectively. The accuracy of each of the submitted system is 35.30% and all of them rank second among all the submissions. In all the three submissions, the lowest and the best performance was reported for Hindi and Bengali language respectively among all submissions. Most of na- tive speakers of Indian languages have knowledge of Hindi which affects while writing their comments in English. 5 Conclusion In this work, ANN and Ensemble based classifiers have been used to design systems for INLI 2018. All designed classifiers reported the same accuracy and Title Suppressed Due to Excessive Length 7 Table 4. Performance evaluation of first system Confusion matrix Class-wise results Class Label BE HI KA MA TA TE P R F-measure BE 79 24 17 28 43 16 47.00 38.20 42.10 HI 19 14 12 42 24 27 13.90 10.10 11.70 KA 16 20 106 26 47 35 37.20 42.40 39.60 MA 19 19 36 87 26 13 36.6 43.50 39.70 TA 10 12 31 24 61 2 26.6 43.60 33.10 TE 25 12 83 31 28 71 43.3 28.40 34.30 Overall Accuracy 35.30% Table 5. Performance evaluation of second system Confusion matrix Class-wise results Class Label BE HI KA MA TA TE P R F-measure BE 80 20 18 29 43 17 47.60 38.60 42.70 HI 19 12 11 44 24 28 12.60 8.70 10.30 KA 13 21 112 28 41 35 38.10 44.80 41.20 MA 23 18 36 86 23 14 36.6 43.00 39.50 TA 8 15 30 23 57 7 25.80 40.70 31.60 TE 25 9 87 25 33 71 41.30 28.40 33.60 Overall Accuracy 35.30% achieved the second rank over all submissions for the task. This work can be improved using different structures of ANN and using deep learning model. In addition, improving input representation will improve the performance of sys- tems. References 1. Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-Group Web Forum Messages. IEEE Intelligent Systems 20(5), 67–75 (Sep 2005) 2. Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., Chodorow, M.: Toefl11: ”A corpus of non-native english”. ETS Research Report Series 2013(2) (2013) 3. Chaski, C.E.: Whos at the keyboard? ”Authorship attribution in digital evidence investigations”. International Journal of Digital Evidence 4(1), 1–13 (2005) 8 H. A. Nayel and H. L. Shashirekha Table 6. Performance evaluation of third system Confusion matrix Class-wise results Class Label BE HI KA MA TA TE P R F-measure BE 79 24 17 28 43 16 47.00 38.20 42.10 HI 19 14 12 42 24 27 13.90 10.10 11.70 KA 16 20 106 26 47 35 37.20 42.40 39.60 MA 19 19 36 87 26 13 36.60 43.50 39.70 TA 10 12 31 24 61 2 26.60 43.60 33.10 TE 25 12 83 31 28 71 43.30 28.40 34.30 Overall Accuracy 35.30% 4. Estival, D., Gaustad, T., Pham, S.B., Radford, W., Hutchinson, B.: Author pro- filing for english emails. In: ”Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics”. pp. 263–272 (2007) 5. Gibbons, J.: Forensic linguistics: ”An introduction to language in the justice sys- tem”. Wiley-Blackwell (2003) 6. Jarvis, S., Bestgen, Y., Pepper, S.: Maximizing classification accuracy in native language identification pp. 111–118 (2013) 7. Klein, D., Toutanova, K., Ilhan, H.T., Kamvar, S.D., Manning, C.D.: Combin- ing heterogeneous classifiers for word-sense disambiguation. In: Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions - Volume 8. pp. 74–80. WSD ’02, Stroudsburg, PA, USA (2002) 8. Kumar, A., Ganesh, B., P, S.K.: Overview of the INLI@FIRE-2018 Track on Indian Native Language Identification. In: ”workshop proceedings of FIRE 2018, FIRE- 2018”. Gandhinagar, India, December 6-9, CEUR Workshop Proceedings (2018) 9. Kumar, A., Ganesh, B., Shivkaran, P, S.K., Rosso, P.: Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language Identification. In: ”Notebook Papers of FIRE 2017, FIRE-2017”. Bangalore, India, December 8-10, CEUR Workshop Proceedings (2017) 10. Min, K., Ma, C., Zhao, T., Li, H.: BosonNLP: ”An Ensemble Approach for Word Segmentation and POS Tagging”. In: Natural Language Processing and Chinese Computing. pp. 520–526. Springer International Publishing (2015) 11. Nayel, H.A., Shashirekha, H.L.: Improving NER for Clinical Texts by Ensemble Approach using Segment Representations. In: ”Proceedings of the 14th Interna- tional Conference on Natural Language Processing (ICON-2017)”. pp. 197–204. NLP Association of India, Kolkata, India (December 2017) 12. Nayel, H.A., Shashirekha, H.L.: Mangalore-University@INLI-FIRE-2017: ”Indian Native Language Identification using Support Vector Machines and Ensemble ap- proach”. In: Working notes of FIRE 2017 - Forum for Information Retrieval Eval- uation, Bangalore, India, December 8-10, 2017. pp. 106–109 (2017) 13. Nayel, H.A., Shashirekha, H.L., Shindo, H., Matsumoto, Y.: Improving Multi-Word Entity Recognition for Biomedical Texts. International Journal of Pure and Ap- plied Mathematics 118(16), 301–3019 (2017) Title Suppressed Due to Excessive Length 9 14. Ortega, L.: Understanding Second Language Acquisition. Hodder Education, Ox- ford (2009) 15. Rozovskaya, A., Roth, D.: Algorithm Selection and Model Adaptation for ESL Correction Tasks. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp. 924–933. Portland, Oregon, USA (June 2011) 16. Tetreault, J., Blanchard, D., Cahill, A., Chodorow, M.: Native Tongues, Lost and Found: ” Resources and Empirical Evaluations in Native Language Identificatio”. In: ”Proceedings of COLING 2012”. pp. 2585–2602 (2012)