=Paper=
{{Paper
|id=Vol-2036/T4-8
|storemode=property
|title=SeerNet@INLI-FIRE-2017: Hierarchical Ensemble for Indian Native Language Identification
|pdfUrl=https://ceur-ws.org/Vol-2036/T4-8.pdf
|volume=Vol-2036
|authors=Royal Jain,Venkatesh Duppada,Sushant Hiray
|dblpUrl=https://dblp.org/rec/conf/fire/JainDH17
}}
==SeerNet@INLI-FIRE-2017: Hierarchical Ensemble for Indian Native Language Identification==
SeerNet@INLI-FIRE-2017: Hierarchical Ensemble for Indian Native Language Identification Royal Jain Venkatesh Duppada Sushant Hiray royal.jain@seernet.io venkatesh.duppada@seernet.io sushant.hiray@seernet.io Seernet Technologies, LLC Milpitas, CA, USA ABSTRACT In this shared task [2] we focus on identifying the native lan- Native Language Identification has played an important role in guage for users from their comments on various Facebook news forensics primarily for author profiling and identification. In this posts. From Natural Language Processing (NLP) perspective, NLI is work, we discuss our approach to the shared task of Indian Lan- framed as a multiclass supervised classification task. The shared guage Identification. The task is primarily to identify the native task at hand is specific to identifying six Indian native languages: language of the writer from the given XML file which contains a Tamil, Hindi, Kannada, Malayalam, Bengali and Telugu. set of Facebook comments in the English language. We propose a As we explore in the next section, prior work has primarily hierarchical ensemble approach which combines various machine dealt with statistical machine learning algorithms including SVMs learning techniques along with language agnostic feature extrac- and representation methods such as tf-idf. Our approach combines tion to perform the final classification. Our hierarchical ensemble these various state of the art algorithms using a hierarchical ensem- improves the TF-IDF based baseline accuracy by 3.9%. The proposed ble. We’ve also experimented with two different types of feature system stood 3rd across unique team submissions.. extraction strategies. They are explored further in Section 3.1 CCS CONCEPTS 2 RELATED WORK • Computing methodologies → Classification and regression Most of the related NLI work can be categorized into 2 domains: trees; Support vector machines; Neural networks; Bagging; text based and speech based. Feature selection; 2.1 Text NLI KEYWORDS The 2013 Native Language Identification Shared Task [8] created Native Language Identification, Text Classification, Ensemble an increased interest in the problem by providing a large labelled dataset. [9] exploited difference in parse structure in texts of differ- ent native language speakers for reducing classification error. Very 1 INTRODUCTION recently, the 2017 shared task on Native Language Identification Native Language Identification (NLI) is primarily the task of auto- [4] provided additional contributions to the field. matically identifying the native language of an individual based on their writing or speech in another language. The underlying as- 2.2 Speech NLI sumption here is that an author’s native language (mother tongue) [10] demonstrates that the acoustic features along with various fea- will often have an influence on the way they express themselves tures computed on the transcripts can provide increased accuracy in another language. Identifying such common patterns across a in dialect identification. [7] achieved good results with i-vector and group of people can be used to determine their native language. glove vector features with a GRU deep learning model. Identifying the native language of an author has various appli- Starting from the shared task in 2013, quite a few approaches cations, primarily in forensics. In forensics, author profiling and used ensembling techniques to combine multiple base classifiers to identification using their native language is an important feature improve the performance. [1]. Identifying the native language can also be used to provide personalised training for learning new languages [6]. Recent work 3 SYSTEM DESCRIPTION by [3] focuses on using this in tracing linguistic influences in multi- author texts. 3.1 Feature Extraction Researchers have experimented with a range of machine learn- We observe from the dataset that people often use words and ing algorithms, with Support Vector Machines having found the phrases which belong to their native language transliterated into most success. However, some of the most successful approaches English. Some common examples are "Jai ho", "vadi koduthu" etc. have made use of classifier ensemble methods to further improve We also expect that people who have the same native language performance on this task. would have some topics/concerns which would not be shared by Figure 1: System Design people who have a different native language. For example an issue language. Most of these topics were expressed in the noun forms which revolves around Tamil Nadu would resonate more with Tamil and hence to extract this information we collected noun chunks speaking people as compared to others. which are present in the sentences. Noun chunks are extracted For our classification system we created two different feature using spacy3 . Now we follow procedure similar to first feature set. sets from our data. In the first feature set we take raw sentences We collect these two features for each sentence and then create as inputs. The sentences are tokenized to create vocabulary of a vocabulary for it. This vocabulary is then used to create term tokens. This vocabulary is then used to create term frequency- frequency inverse document frequency features which are then inverse document frequency features for each sample point which used as inputs for classification. are then used as feature input in the classification step. One benefit of term frequency inverse document frequency over simple bag 3.2 Classification of words approach is it mitigates the effect of common words and We perform the classification separately for both feature sets de- thus making inputs easier to discriminate. We refrained from using scribed above. The training data set in the competition was small higher n-grams features due to limited amount of data. hence, instead of creating separate train and development set, we In the second feature set we leverage our observations stated performed 10-fold cross validation. On each fold, a model was above to filter the relevant information. First, for each sentence trained and the predictions were collected on the remaining dataset. we collect words which do not belong to the English vocabulary. We calculated mean of accuracy over 10 fold for each type of classi- The sentences were tokenized using tweetokenize package1 and we fier. We also observed the performance of each classifier on points check whether the word belongs to English vocabulary by using the which were harder to classify i.e those points for which the deci- English dictionary provided in enchant 2 . These words are extracted sions were incorrect for majority of classifiers. After evaluation for capturing usage of native language in inputs. We then tested selected four classifiers, namely LogisticRegression, MLPClassifier, our hypothesis that speakers of common native language would LinearSVC and RidgeClassifier of sklearn [5], were selected for have topics/concerns which are not shared as strongly by others. ensemble creation. These classifiers were chosen based on their To this end we collected all the documents of native speakers of performance on the cross-validation and also on the basis of their each language and extracted topics from it using Latent Dirichlet complimentary performance on hard to predict data points. The Allocation. We observed a good deal of topics which were specific performance of these classifiers on cross validation is shown in to speakers of common native language. We think this is a result of table 1. regional and cultural proximity between speakers of common native 1 https://www.github.com/jaredks/ tweetokenize 2 https://pypi.python.org/pypi/pyenchant/ 3 https://spacy.io/ 2 Table 1: 10-fold Cross Validation Mean Accuracy on feature Table 2: Accuracy on test data sets Class Submission1 Submission2 Submission3 Classifier Feature Set1 Feature Set2 BE 64.40 64.80 67.10 LogisticRegression 0.887959046018 0.912401033115 HI 16.10 14.30 15.70 LinearSVC 0.894482995578 0.914853592818 KA 49.80 46.50 48.10 RidgeClassifier 0.894476444138 0.913260049508 MA 46.80 50.00 45.40 MLPClassifier 0.878357282545 0.902736002619 TA 54.40 52.10 52.20 TE 44.40 43.70 44.90 OverAll 46.60 46.40 46.90 3.3 Ensemble We created a hierarchical ensemble model for this task, consisting creating manually hand-crafted features and can provide better of two layers of ensembles. First layer consists of two ensemble. performance. First one consists of four classifiers selected in the previous section mentioned. These classifiers were trained on feature set 1 ( term ACKNOWLEDGMENTS frequency inverse document frequency features on raw input sen- We would like to thank the organisers of the FIRE-2017 Shared tences). Second ensemble also consists of same four classifiers but Task on Native language identification, for providing the data, the were trained on feature set 2, which had term frequency inverse guidelines and timely support. document frequency features computed using noun chunk and non English words extracted from each sentence. Each ensemble pre- REFERENCES dicts the output using the majority vote. We limited the decision to [1] John Gibbons. 2003. Forensic linguistics: An introduction to language in the justice majority vote as complex weighted voting would have caused over- system. Wiley-Blackwell. fitting. Final classification is predicted using a combination of two [2] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, and Paolo Rosso. 2017. Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language ensembles described above. If they output same class, we present Identification. In Notebook Papers of FIRE 2017. that class as prediction. If they differ, we calculate the confidence [3] Shervin Malmasi and Mark Dras. 2017. Native Language Identification using Stacked Generalization. arXiv preprint arXiv:1703.06541 (2017). of each ensemble using count of classifiers in the ensemble which [4] Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, support its decision. Fig 1. depicts our system. Christopher Hamill, Diane Napolitano, and Yao Qian. 2017. A Report on the 2017 Native Language Identification Shared Task. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 62–75. 4 RESULTS [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- We can see from table 1. that all four classifiers perform quite napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine well on both the extracted feature sets especially considering the Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. classification problem involves six classes. This suggest that the [6] Alla Rozovskaya and Dan Roth. 2011. Algorithm selection and model adaptation for ESL correction tasks. In Proceedings of the 49th Annual Meeting of the Asso- dataset points are easier to discriminate. We further see that the ciation for Computational Linguistics: Human Language Technologies-Volume 1. accuracy increases significantly on feature set 2, suggesting that Association for Computational Linguistics, 924–933. features such as native language words and regional/local topics [7] Ishan Somshekar, Bogac Kerem Goksel, and Huyen Nguyen. [n. d.]. Native Language Identification. ([n. d.]). are important for identification of native language. [8] Joel R Tetreault, Daniel Blanchard, and Aoife Cahill. 2013. A Report on the First We presented three submissions. Submission 1 is the output of Native Language Identification Shared Task.. In BEA@ NAACL-HLT. 48–57. [9] Sze-Meng Jojo Wong and Mark Dras. 2011. Exploiting Parse Structures for Native final classifier(see Fig 1). Submission 2 is the output of Ensemble 1, Language Identification. In Proceedings of the Conference on Empirical Methods which was trained on raw sentences. Submission 3 was generated in Natural Language Processing (EMNLP ’11). Association for Computational using ensemble 2 trained on feature set 2 (non-English phrase and Linguistics, Stroudsburg, PA, USA, 1600–1610. http://dl.acm.org/citation.cfm? id=2145432.2145603 noun chunks). We can see that Submission 3 outperform other [10] Marcos Zampieri, Shervin Malmasi, Nikola Ljubešić, Preslav Nakov, Ahmed Ali, two classifiers strengthening our belief on importance of native Jörg Tiedemann, Yves Scherrer, and Noëmi Aepli. 2017. Findings of the VarDial language phrases and shared topics in identifying native language Evaluation Campaign 2017. (2017). of speaker. 5 FUTURE WORK AND CONCLUSION This paper studies couple of approaches for identification of native language. First approach measures the power of tf-idf features for the purpose of classification. Second approach identifies certain features which separate different native language speakers from each other and utilizes those for better ac curacies of overall system. We have seen improvement in accuracy due to identification of discriminating features, however extending this procedure is time consuming and requires language expertise. Recent studies have shown use of deep neural networks can be a possible alternate to 3