-

CIC-IPN@INLI2018: Indian Native Language Identi cation

Ilia Markov

ilia.markov@inria.fr 0

Grigori Sidorov

sidorov@cic.ipn.mx 1 0 INRIA Paris , France 1 Instituto Politecnico Nacional (IPN), Center for Computing Research (CIC) , Mexico City , Mexico

In this paper, we describe the CIC-IPN submissions to the shared task on Indian Native Language Identi cation (INLI 2018). We use the Support Vector Machines algorithm trained on numerous feature types: word, character, part-of-speech tag, and punctuation mark n-grams, as well as character n-grams from misspelled words and emotionbased features. The features are weighted using log-entropy scheme. Our team achieved 41.8% accuracy on the test set 1 and 34.5% accuracy on the test set 2, ranking 3rd in the o cial INLI shared task scoring.

Native Language Identi cation media feature engineering machine learning

The task of Native Language Identi cation (NLI) consists in identifying the native language of a person based on their text production in the second language. The underlying hypothesis is that the learner's native language (L1) in uences their second language (L2) production as a result of the language transfer e ect (native language interference) [ 14 ], which is thoroughly studied in the eld of second language acquisition (SLA).

The possible applications of the task include marketing and security, as NLI is viewed a subtask of author pro ling, as well as education, where the pedagogical material can be targetly tuned to native languages, for example, by taking into account the most common errors made by learners with a speci c background and adapting the materials to tackle such errors in more detail.

Previous studies on identifying the native language from L2 writing { most of which approached the task from a machine-learning perspective { explored a wide range of L1 phenomena that appear in L2 production, i.e., lexical choices made by learners, grammatical patterns used, the in uence of cognates and general etymology, spelling errors, punctuation, emotions, among others, and used corresponding features to capture these phenomena. Most of the NLI studies focused on English as second language; however, NLI methods have been also examined on other L2s with promising results [ 8 ].

The interest in NLI led to the organization of several NLI competitions, including the rst edition of the shared task on identifying the Indian languages [ 6 ], which was held in 2017 and attracted a large number of participating teams. The winning approach consisted in training the Support Vector Machines (SVM) classi er with SGD (Stochastic Gradient Descent) method on word n-gram and character n-gram features [ 5 ]. Other approaches included using several pre-processing steps (e.g., removing digits, emoji, stop words), classi cation algorithms (e.g., SVM, Logistic Regression, Naive Bayes), and features (e.g., non-English word counts, using adjectives and nouns as features, average sentence and word length, among others) [ 6 ].

In this paper, we present the CIC-IPN submissions to the INLI shared task 2018 [ 7 ]. We use the SVM algorithm trained on word n-grams, traditional (untyped) and typed character n-grams, part-of-speech (POS) tag n-grams, punctuation mark n-grams, character n-grams from misspelled words, and emotionbased features. In continuation we describe in detail the features used and the con guration of our runs. 2

Data

The training dataset released by the organizers consists of Facebook comments in the English language extracted from regional language newspapers. This dataset was also used in the 2017 edition of the INLI competition [ 6 ]. The dataset statistics in terms of the L1s covered, number (No.) of documents per L1, and the corresponding ratio are provided in Table 1. It can be seen that 1,233 training documents are nearly-optimally balanced across the represented L1s.

The submitted systems were evaluated on two test sets: the test set 1 (also used in the INLI 2017; 783 documents) and the test set 2 (the o cial test set of the INLI 2018; 1,185 documents).

Methodology

In this section, we give a description of the features incorporated in our runs and the con guration of our system: weighting scheme, frequency threshold, and machine-learning classi er. 3.1

Features

Word n-grams capture lexical choices of the learner in L2 production, and are considered one of the most indicative unique feature types for the task of NLI [ 4, 9 ]. Word n-gram features were also incorporated in the winning approach to the previous INLI shared task [ 5 ]. In runs 1 and 3, we use word unigrams and 2grams, when in run 2 we use word 1{3-grams. We lowercase the word-based features and replace digits by a placeholder (e.g., 12345 ! 0).

Untyped character n-grams are considered very indicative features for NLI and for other related tasks [ 3, 16 ]. In NLI, these feature are hypothesized to capture the phoneme transfer from the learner's L1 [ 18 ], among others L1 peculiarities. They were also incorporated into the winning approach to the INLI 2017 [ 5 ]. We use character n-grams with n = 2.

Typed character n-grams { character n-grams categorized into ten di erent categories { have been successfully applied to NLI [ 9 ]. We conducted an ablation study in order to identify the most indicative typed character n-gram categories. We found that the middle-punctuation and the whole-word categories did not contribute to the result, and therefore were discarded. We use typed character 4-grams; 3-grams are used for the su x category.

POS tag n-grams capture morpho-syntactic aspects of the native language in NLI. They encode word order and grammatical properties of the native language, capturing the use or misuse of grammatical structures. POS tag n-grams have proved to be useful features for NLI, especially when combined with other feature types [ 4, 9 ]. We use POS tag 3-grams; obtaining the POS tags with the TreeTagger package [ 17 ].

Punctuation mark n-grams The impact of punctuation marks (PMs) on NLI was evaluated in [ 11 ]. The authors report that punctuation usage is a strong indicator of the author's L1. We use punctuation mark n-grams (n = 3). Character n-grams from misspelled words were introduced by Chen et al. [ 2 ]. These features have been successfully used to tackle the NLI task in [ 9 ]. We extract 8,937 misspelled words (from the training dataset) using the PyEnchant package3 and build character 4-grams from them. 3 https://pypi.org/project/pyenchant/ Emotion polarity features Emotion-based features for NLI were proposed in [ 12 ]. We use emotion polarity (emoP) features similar to [ 12 ]: replace each word in the text with the information form the NRC emotion lexicon [ 13 ], e.g., excellent! \0000101001". 3.2

Weighting scheme and threshold

We use log-entropy (le) weighting scheme that measures the importance of a feature across the entire corpus. le is considered one of the best weighting schemes for the NLI task [ 4, 2, 9 ]. In our experiments under 10-fold cross-validation, le outperformed other weighting schemes we examined (tf -idf , tf , and binary). Accuracy improvement over the second best-performing weighting scheme (tf idf ) was 3.2%{3.6% depending on the run.

Tuning the size of the feature set (selecting the optimal frequency threshold values) is an e ective strategy for NLP tasks in general [ 10 ] and for NLI in particular [ 4, 9 ]. In all our runs, we include only the features that appear in two documents (min df =2). In run 3, we additionally set frequency threshold value to 3 (include only the features that appear three times in the entire corpus). 3.3

Classi er

We use the linear SVM algorithm whose e ectiveness has been proved by numerous studies on NLI [ 4, 9 ]. SVM was also the most popular algorithm in the 2017 edition of the INLI shared task [ 6 ]. SVM with OvR (one vs. the rest) multi-class strategy is used, as implemented in the scikit-learn package [ 15 ]. 3.4

Evaluation

For the evaluation of our system, we conducted experiments under 10-fold crossvalidation, measuring the results in terms of classi cation accuracy on the training corpus. 4

Results and Discussion

Run 3 showed the highest accuracy on the o cial test set due to a higher frequency threshold value. The confusion matrix for this run on the training data is shown in Figure 1; the class-wise accuracy results provided by the organizers on the test set 1 and 2 are presented in Tables 3 and 4, respectively. The highest 10-fold cross-validation result was achieved for the Hindi language, while on the both test sets this was the hardest language to identify. We described the three runs that were submitted by the CIC-IPN team to the INLI shared task 2018. Our approach uses the SVM algorithm trained on word, character, POS tag, and punctuation mark n-grams, character n-grams from misspelled words, and emotion-based features. The features are weighted using log-entropy weighting scheme. Our team achieved 41.8% accuracy on the test set 1 (run 1) and 34.5% accuracy on the o cial test set 2 (run 3), placing our team 3rd (out of 12 participating teams) in the competition.

In future work, we will evaluate the performance of our system without word and character n-grams in order to investigate their impact on the accuracy drop su ered by the system when evaluated on the test sets. We will also focus on more abstract features that perform well in the situation where topic bias may occur.

1. Brooke , J. , Hirst , G.: Native language detection with `cheap' learner corpora . In: Proceedings of the Conference of Learner Corpus Research . pp. 37 { 47 . Presses universitaires de Louvain, Louvain-la- Neuve , Belgium ( 2011 )

2. Chen , L. , Strapparava , C. , Nastase , V. : Improving native language identi cation by using spelling errors . In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics . pp. 542 { 546 . ACL, Vancouver, Canada ( 2017 )

3. Gomez-Adorno , H. , Markov , I. , Baptista , J. , Sidorov , G. , Pinto , D. : Discriminating between similar languages using a combination of typed and untyped character ngrams and words . In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects . pp. 137 { 145 . ACL, Valencia, Spain ( 2017 )

4. Jarvis , S. , Bestgen , Y. , Pepper , S. : Maximizing classi cation accuracy in native language identi cation . In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications . pp. 111 { 118 . ACL, Atlanta, GA , USA ( 2013 )

5. Kosmajac , D. , Keselj , V.: DalTeam@INLI-FIRE-2017: Native language identi cation using SVM with SGD training . In: Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation . CEUR, Bangalore, India ( 2017 )

6. Kumar , A. , Ganesh , B. , Singh , S. , Soman , P. , Rosso , P. : Overview of the INLI PAN at FIRE-2017 track on Indian native language identi cation . In: Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation . vol. 2036 , pp. 99 { 105 . CEUR Workshop Proceedings, Bangalore, India ( 2017 )

7. Kumar , A. , Ganesh , B. , Soman , P. : Overview of the INLI@FIRE-2018 track on Indian native language identi cation . In: Workshop proceedings of FIRE 2018. CEUR Workshop Proceedings , Gandhinagar, India ( 2018 )

8. Malmasi , S. , Dras , M. : Multilingual native language identi cation . Natural Language Engineering 23 ( 2 ), 163 { 215 ( 2017 )

9. Markov , I. , Chen , L. , Strapparava , C. , Sidorov , G. : CIC-FBK approach to native language identi cation . In: Proceedings of the 12th Workshop on Building Educational Applications Using NLP . pp. 374 { 381 . ACL, Copenhagen, Denmark ( 2017 )

10. Markov , I. , Gomez-Adorno , H. , Sidorov , G. , Gelbukh , A. : The winning approach to cross-genre gender identi cation in Russian at RUSPro ling 2017 . In: FIRE 2017 Working Notes . vol. 2036 , pp. 20 { 24 . CEUR-WS.org, Bangalore, India ( 2017 )

11. Markov , I. , Nastase , V. , Strapparava , C. : Punctuation as native language interference . In: Proceedings of the 27th International Conference on Computational Linguistics . pp. 3456 { 3466 . The

COLING

2018

Organizing

Committee , Santa Fe, New Mexico, USA ( 2018 )

12. Markov , I. , Nastase , V. , Strapparava , C. , Sidorov , G. : The role of emotions in native language identi cation . In: Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis . ACL , Brussels, Belgium ( 2018 )

13. Mohammad , S. , Turney , P. : Crowdsourcing a word-emotion association lexicon . Computational Intelligence 29 , 436 { 465 ( 2013 )

14. Odlin , T. : Language Transfer: cross-linguistic in uence in language learning . Cambridge University Press, Cambridge, UK ( 1989 )

15. Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , Vanderplas , J. , Passos , A. , Cournapeau , D. , Brucher , M. , Perrot , M. , Duchesnay , E.: Scikit-learn: Machine learning in Python . Journal of Machine Learning Research 12 , 2825 { 2830 ( 2011 )

16. Sanchez-Perez , M.A. , Markov , I. , Gomez-Adorno , H. , Sidorov , G.: Comparison of character n-grams and lexical features on author, gender, and language variety identi cation on the same Spanish news corpus . In: Experimental IR Meets Multilinguality, Multimodality, and Interaction . vol. 10456 , pp. 145 { 151 . Springer, Dublin, Ireland ( 2017 )

17. Schmid , H.: Improvements In Part-of-Speech Tagging With an Application to German , pp. 13 { 25 . Springer ( 1999 )

18. Tsur , O. , Rappoport , A. : Using classi er features for studying the e ect of native language on the choice of written second language words . In: Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition . pp. 9 { 16 . ACL, Stroudsburg, PA, USA ( 2007 )