CCS CONCEPTS

Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language Identification

Anand Kumar M, Barathi Ganesh HB,

Author Profiling, Indian Languages, Native Language Identification,

Paolo Rosso

0 0 PRHLT Research Center, Universitat Politècnica de València , Spain 1 Shivkaran Singh and Soman KP, Center for Computational Engineering , and Networking (CEN) , Amrita School of Engineering , Coimbatore, Amrita Vishwa Vidyapeetham , India 2 Social Media , Text Classification

This overview paper describes the first shared task on Indian Native Language Identification (INLI) that was organized at FIRE 2017. Given a corpus with comments in English from various Facebook newspapers pages, the objective of the task is to identify the native language among the following six Indian languages: Bengali, Hindi, Kannada, Malayalam, Tamil, and Telugu. Altogether, 26 approaches of 13 diferent teams are evaluated. In this paper, we give an overview of the approaches and discuss the results that they have obtained.

CCS CONCEPTS

• Computing methodologies → Natural language processing; Language resources; Feature selection;

INTRODUCTION

Native Language Identification (NLI) is a fascinating and rapidly growing sub-field in Natural Language Processing. In the framework of the author profiling shared tasks that have been organized at PAN1, language variety identification was addressed in 2017 at CLEF [ 17 ]. NLI requires instead to automatically identify the native language (L1) of an author on the basis of the way she writes in another language (L2) that she learned. As her accent may help in identifying whether or not she is a native speaker in that language L1, in a similar way the way the language is used when she writes may unveil patterns that can help in identifying her native language [ 19 ]. From a cybersecurity viewpoint, NLI can help to determine the native language of an author of a suspicious or threatening text.

The native language influences the usage of words as well the errors that a person makes when writing in another language [ 19 ]. NLI systems can identify the writing patterns that are based on the author’s linguistic background. NLI has many applications and studying the language transfer from a forensic linguistics viewpoint is certainly one of the most important. The first shared task on native language identification was organized in 2013 [ 21 ]. The organizers made available a large text corpus for this task. Other works approach the problem of native language identification using as well speech transcripts [ 30 ]. In the Indian languages context, this is the first NLI shared task. In India there are currently 22 oficial languages with English as an additional oficial language. In this shared task, we focus on identifying the native language of Indian authors writing comments in English. We considered six languages, namely, Bengali, Hindi, Kannada, Malayalam, Tamil and Telugu for the shared task.

Since comments over the internet are usually written in social media, the corpora used for the shared task was acquired from Facebook. English comments from Facebook pages of famous regional language newspapers were crawled. These comments were further preprocessed in order to remove code-mixed and mixed scripts comments from the corpus. In the following sections we present some related work (Section 2), we describe the corpus collection (Section 3), we give an overview of the submitted approaches (Section 4), ifnally we show the results that were obtained (Section 5). Finally, in Section 6 we draw some conclusions. 2

RELATED WORK

As said in [ 14 ], one of the earliest works on identifying native language was by Tomokiyo and Jones (2001) [ 23 ] where the author used Naive Bayes to discriminate non-native from native statements in English. Koppel et. al (2005) [ 25 ] approached the problem by using stylistic, syntactic and lexical features. They also noticed that the use of character n-grams, parts of speech bi-grams and function words allowed to obtain better results. Tsur and Rappoport (2007) [ 11 ] achieved an accuracy of about 66% by using only character bi-grams. They assumed that the native language phonology influences the choice of words while writing in a second language.

Estival et. al [ 8 ] used English emails of authors with diferent native languages. They achieved an accuracy of 84% using a Random Forest classifier with character, lexical, and structural features. Wong and Dras [ 27 ] pointed out that mistakes made by authors writing in a second language is influenced by their native language. They proposed the use of syntactic features such as subject-verb disagreement, noun-number disagreement, and improper use of determiners to help in determining the native language of a writer. In their later work [ 28 ], they also investigated the usefulness of parse structures for identifying the native language. Brooke and Hirst [ 4 ] used word-to-word translation of L1 to L2 to create a mappings which are the result of language transfer. They use this information in their unsupervised approach.

Torney et. al [ 24 ] used psycho-linguistic feature for NLI. Syntactic features showed also to play a significant role in determining the native language. Other interesting studies in the NLI field are [ 29 ]

Language #XML docs # Sentences # Words #Unique Words

[ 20 ] [ 5 ]. In 2013 a shared task was organized on NLI [ 20 ]. The organizers provided a large corpus which allowed comparison among diferent approaches. In 2014 a related shared task was organized on Discriminating between Similar Languages (DSL2) [ 31 ]. The organizers provided six groups of 13 diferent languages, with each group having similar languages. In 2017 another shared task on NLI was organized. The corpus was composed by essays and transcripts of utterances. The ensemble methods and meta-classifiers with syntactic/lexical features were the most efective systems [ 15 ].

3 INLI-2017 CORPUS

Many corpora have been created from social media (Facebook, Twitter and WhatsApp) for performing language modeling [ 9 ], information retrieval tasks [ 6 ], and code-mixed sentiment analysis [ 10 ]. A monolingual corpus based on the TOEFL3 data is available for performing the NLI task for Indian languages such as Hindi and Telugu [ 16 ]. The INLI-2017 corpus includes English comments of Facebook users, whose native language is one among the following: Bengali (BE), Hindi (HI), Kannada (KA), Malayalam (MA), Tamil (TA) and Telugu (TE). The dataset collection is based on the assumption that, only native speakers will read native language newspapers. To the best of our knowledge, this is the first corpus for native language identification for Indian languages. The detailed corpus statistics are given in Table 1 and Table 2.

The texts for this corpus have been collected from the users comments in the regional newspapers and news channel Facebook pages. Around 50 Facebook pages were selected and comments written in English were extracted from these pages. The training data have been collected in the period of April-2017 to July 2017. The test data has been collected later on. It was expected that participants will focus on native language-based stylistic features. As a result, we removed code-mixed comments and comments related to the regional topics (regional leaders and comments mentioning the name of regional places). Comments with common keywords discussed across the regions were considered to avoid the topic bias. These common keywords observed were Modi, note-ban, diferent sports personalities, army, national issues, government policies, etc. Finally, the collected dataset was randomized and written to XML ifles randomly to avoid user bias.

From Table 1 and Table 2, it can be observed that except for BE and MA, the remaining languages have nearly the same ratio of average words per sentence. It is also visible that the test data was properly normalized in order to have the average words per sentence and average unique words per sentence. The variance between average of words per sentence and average of unique words per sentence for the training and the test data is shown in Figure 1 and Figure 2, respectively. This corpus will be made available after the FIRE 2017 conference in the web page of our NLP group website4. 4

OVERVIEW OF THE SUBMITTED APPROACHES

Initially, 56 teams registered at the INLI shared task at FIRE, and ifnally 13 of them submitted a total of 26 runs. Moreover, 8 of them submitted their system description working notes5. We analysed their approaches from three perspectives: preprocessing, features to represent the author’s texts and classification approaches. 4.1

Preprocessing

Most of the participants have not done any preprocessing [ 2, 7, 13, 18, 26 ]. Text is normalised by removing the emoji, special characters, digits, hash tags, mentions and links [ 1, 12, 22 ]. Stop words are removed using the nltk stop words package6, other resources7 and manual stop words collection [ 1 ]. White space based tokenization has been carried out by all other participants except [ 7 ]. The participant [ 22 ] handled the shortened words (terms such as n’t, &, ’m, ’ll are replaced as ’not’, ’and’, ’am’, and ’will’ respectively). 4.2

Features

Two of the participants directly used the Term Frequency Inverse Document Frequency (TFIDF) weighs as their features [ 1, 2 ], nonEnglish words and noun-chunks are taken as the features while computing TFIDF [ 18 ], character n-grams of order 2-5 and word n-grams of order 1-2 have been used as features while computing the TFIDF vocabulary [ 7, 12, 13 ]. Only the non-English word counts 4http://nlp.amrita.edu:8080/nlpcorpus.html 5ClassPy team did not submit any working notes, although a brief description of the approach was sent by email. 6http://www.nltk.org/book/ch02.html 7pypi.python.org/pypi/stop-words

MANGALORE DalTeam

2 3 3 have been taken as features in [ 26 ]. Nouns and adjective words have been taken as feature in [ 22 ]. Part of Speech n-grams, average word and sentence length have been used as the features in [ 7 ]. Distributional representation of words (pre-trained word vectors) have been used in [ 7 ]. 4.3

Classification Approaches

Support Vector Machine (SVM) has been used as a classifier by most of the participants [ 1, 2, 7, 12, 13 ]. Two of the participants followed the ensemble based classification with Multinomial Bayes, SVM and Random Forest Tree as the base classifiers in [ 22 ] and Logistic Regression, SVM, Ridge Classifier and Multi-Layer Perceptron (MLP) as the base classifiers in [ 18 ]. Other than this the authors in [ 7 ] used the Logistics Regression, authors in [ 26 ] used Naive Bayes, authors in [ 3 ] used hierarchical attention architecture with bidirectional Gated Recurrent Unit (GRU) cell and authors in [ 22 ] employed the neural network classifier with 2 hidden layers, Rectified Linear Unit (ReLU) as the activation function and Stochastic Gradient Descent (SGD) as the optimizer. 2 3 4 5 6 6 7 8 9 10

EXPERIMENTS AND RESULTS SEERNET Baseline IDRBT MANGALORE Bharathi_SSN SSN_NLP Bits_Pilani ClassyPy

DIG (IIT-Hyd)

Anuj BMSCE_ISE JUNLP team_CEC

2 3 4 50.3%, which is 2.3% greater than the baseline. The lowest F-measure scored for this language is 15.4% and this is 32.6% less than the baseline.

The ranking of the systems submitted for Malayalam (MA) is given in Table 6. The maximum F-measure scored for this language is 51.9%, which is 0.9% greater than the baseline. Among the all the other languages, this is the lowest variation with respect to the baseline. The lowest F-measure scored for this language is 1.8% and this is 49.2% less than the baseline.

The ranking of the submitted systems for Tamil (TA) is given in Table 7. The maximum F-measure scored for this language is 58.0%, which is 12.0% greater than the baseline. The lowest F-measure scored for this language is 13.2% and this is 32.8% less than the baseline.

The ranking of the systems submitted for Telugu (TE) is given in Table 8. The maximum F-measure scored for this language is 50.5%, which is 8.5% greater than the baseline system. The lowest F-measure scored for this language is 2.4% and this is 39.6% less than baseline.

The results rank per language is given in Table 9. The team_CEC has not identified any language apart from Hindi. The overall ranking for the submitted systems are given in Table 10. The maximum accuracy scored is 48.8%, which is 5.3% greater than the baseline.

Rank Team MANGALORE SEERNET Bharathi_SSN DalTeam Baseline ClassyPy Anuj

DIG (IIT-Hyd) SSN_NLP

Bits_Pilani BMSCE_ISE IDRBT JUNLP

team_CEC

MANGALORE Baseline DalTeam SEERNET Bharathi_SSN ClassyPy Anuj Bits_Pilani IDRBT BMSCE_ISE

SSN_NLP team_CEC

JUNLP

The lowest accuracy scored is 17.8% and this is 25.2% less than the baseline.

6 CONCLUSION

In this paper we presented the INLI2017 corpus, we briefly described the approaches of the 13 teams that participated at the Indian Native Language Identification task at FIRE 2017, and the results that they obtained. The participants had to identify the native language of the authors of English comments collected from various newspaper pages and television pages in Facebook. Six have been the native languages that have been addressed: Bengali, Hindi, Kannada, Malayalam, Tamil and Telugu. Code-mixed comments and comments related to the regional topics were removed from the corpus, and comments with common keywords discussed across the regions were considered in order to avoid possible topic biases.

The participants used diferent feature sets to address the problem: content-based (among others: bag of words, character n-grams, word n-grams, term vectors, word embedding, non-English words) and stylistic-based (among others: words frequency, POS n-grams, noun and adjective POS tag counts). A two layer based neural networks with document vectors built from TFIDF and Recurrent Neural Networks (RNN) with word embedding have been used from the field of deep learning. However, deep learning approaches obtained lower accuracy than the baseline.

Overall the best performance system obtained an accuracy of 48.8%, which is 5.8% greater than the baseline. Overall four of the systems performed better than the baseline. These systems have used the following features: character and word n-grams, nonEnglish words, and noun chunks. It is notable that all these systems have used TFIDF for representing the features. The smallest overall accuracy was 17.8%, which is 25.2% less than the baseline. Among the top performing systems, two of them used an ensemble method and all the systems employed SVM. As future work, we believe that native language identification should be addressed taking into account also socio-linguistics features to improve further.

ACKNOWLEDGEMENT

Our special thanks goes to F. Rangel, all of INLI’s participants, students in Computational Engineering and Networking Department for their eforts and time in developing INLI-2017 corpus. The work of the last author was in the framework of the SomEMBED TIN2015-71147-C2-1-P MINECO research project.

SEERNET ClassyPy MANGALORE BMSCE_ISE SSN_NLP JUNLP team_CEC

2 3 4 5 6 7 8 9 10

MANGALORE SEERNET Bharathi_SSN Baseline SSN_NLP ClassyPy Anuj IDRBT

DIG (IIT-Hyd) team_CEC

Bits_Pilani BMSCE_ISE JUNLP Run

Accuracy 48.8 47.3 47.6 45.2 46.6 46.4 46.9 43.6 43.0 38.8 38.2 37.9 28.9 38.2 36.8 25.5 2 3 4 5 6 6 7 8 9 10 11 12

[1] Hamada

Nayel and H. L.

Shashirekha . 2017 . Indian Native Language Identiifcation using Support Vector Machines and Ensemble Approach. . In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation , Bangalore, India, 8th - 10th December.

[2]

Bharathi ,

Anirudh , and

Bhuvana . 2017 . SVM based approach for Indian native language identification .. In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation , Bangalore, India, 8th - 10th December.

[3]

Rupal

Bhargava , Jaspreet Singh,

Shivangi

Arora , and

Yashvardhan

Sharma . 2017 . Indian Native Language Identification using Deep Learning. . In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation , Bangalore, India, 8th - 10th December.

[4]

Julian

Brooke and

Graeme

Hirst . 2012 . Measuring Interlanguage: Native Language Identification with L1-influence Metrics. . In LREC . 779 - 784 .

[5]

Serhiy

Bykh and

Detmar

Meurers . 2014 . Exploring Syntactic Features for Native Language Identification: A Variationist Perspective on Feature Encoding and Ensemble Optimization. . In COLING . 1962 - 1973 .

[6]

Kunal

Chakma and Amitava Das . 2016 . Cmir: A corpus for evaluation of code mixed information retrieval of hindi-english tweets . Computación y Sistemas 20 , 3 ( 2016 ), 425 - 434 .

[7] Christel and Mike . 2016 . Participation at the Indian Native language Identification task .

[8]

Dominique

Estival , Tanja Gaustad, Son Bao Pham, Will Radford, and

Ben

Hutchinson . 2007 . Author profiling for English emails . In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics . 263 - 272 .

[9]

Anupam

Jamatia , Björn Gambäck, and Amitava Das . 2016 . Collecting and Annotating Indian Social Media Code-Mixed Corpora . In the 17th International Conference on Intelligent Text Processing and Computational Linguistics . 3 - 9 .

[10] Aditya

Joshi

, Ameya Prabhu, Manish Shrivastava, and

Vasudeva

Varma . 2016 . Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text. . In COLING . 2482 - 2491 .

[11] Moshe

Koppel

, Jonathan Schler, and

Kfir

Zigdon . 2005 . Automatically determining an anonymous author's native language . Intelligence and Security Informatics ( 2005 ), 41 - 76 .

[12]

Dijana

Kosmajac and

Vlado

Keselj . 2017 . Native Language Identification using SVM with SGD Training. . In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation , Bangalore, India, 8th - 10th December.

[13] Sowmya Lakshmi B S and hambhavi

B R.

2017 . A simple n-gram based approach for Native Language Identification: FIRE NLI shared task 2017. . In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation, Bangalore , India, 8th - 10th December.

[14]

Shervin

Malmasi . 2016 . Native language identification: explorations and applications . Sydney, Australia: Macquarie University ( 2016 ). http://hdl.handle.net/ 1959 . 14/1110919

[15] Shervin

Malmasi

, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano, and

Yao

Qian . 2017 . A Report on the 2017 Native Language Identification Shared Task . In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications . 62 - 75 .

[16] Sergiu

Nisioi

, Ella Rabinovich, Liviu P Dinu, and

Shuly

Wintner . 2016 . A Corpus of Native, Non-native and Translated Texts. . In LREC.

[17] Francisco

Rangel

, Paolo Rosso,

Martin

Potthast , and

Benno

Stein . 2017 . Overview of the 5th author profiling task at pan 2017: Gender and language variety identiifcation in twitter . Working Notes Papers of the CLEF ( 2017 ).

[18]

Venkatesh

Duppada Royal Jain and

Sushant

Hiray . 2017 . Hierarchical Ensemble for Indian Native Language Identification. . In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation , Bangalore, India, 8th - 10th December.

[19]

Bernard

Smith . 2001 . Learner English: A teacher's guide to interference and other problems . Ernst Klett Sprachen.

[20] Joel

Tetreault

, Daniel Blanchard, Aoife Cahill, and

Martin

Chodorow . 2012 . Native tongues, lost and found: Resources and empirical evaluations in native language identification . Proceedings of COLING 2012 ( 2012 ), 2585 - 2602 .

[21] Joel

R Tetreault

, Daniel Blanchard, and

Aoife

Cahill . 2013 . A Report on the First Native Language Identification Shared Task. . In BEA@ NAACL-HLT . 48 - 57 .

[22]

Thenmozhi , Kawshik Kannan, and

Chandrabose

Aravindan . 2017 . A Neural Network Approach to Indian Native Language Identification. . In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation , Bangalore, India, 8th - 10th December.

[23]

Laura

Mayfield Tomokiyo and

Rosie

Jones . 2001 . You're not from'round here, are you?: naive Bayes detection of non-native utterance text. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies . Association for Computational Linguistics , 1 - 8 .

[24] Rosemary

Torney

Peter

Vamplew ,

and John

Yearwood . 2012 . Using psycholinguistic features for profiling first language of authors . Journal of the Association for Information Science and Technology 63 , 6 ( 2012 ), 1256 - 1269 .

[25]

Oren

Tsur and

Ari

Rappoport . 2007 . Using classifier features for studying the efect of native language on the choice of written second language words . In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition. Association for Computational Linguistics , 9 - 16 .

[26] Ajay

Victor and K Manju . 2017 . Indian Native Language Identification. . In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation , Bangalore, India, 8th - 10th December.

[27] Sze-Meng Jojo Wong and Mark Dras . 2009 . Contrastive analysis and native language identification . In Proceedings of the Australasian Language Technology Association Workshop . 53- 61 .

[28] Sze-Meng Jojo Wong and Mark Dras . 2011 . Exploiting parse structures for native language identification . In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics , 1600 - 1610 .

[29] Sze-Meng Jojo

Wong

Mark

Dras , and

Mark

Johnson . 2012 . Exploring adaptor grammars for native language identification . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics , 699 - 709 .

[30] Marcos

Zampieri

, Alina Maria Ciobanu, and Liviu P Dinu. 2017 . Native Language Identification on Text and Speech . arXiv preprint arXiv:1707.07182 ( 2017 ).

[31] Marcos

Zampieri

, Liling Tan, Nikola Ljubešic, Jörg Tiedemann, and

Nikola

Ljube . 2014 . A report on the DSL shared task 2014 . In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial) . 58 - 67 .