=Paper= {{Paper |id=Vol-2036/T4-8 |storemode=property |title=SeerNet@INLI-FIRE-2017: Hierarchical Ensemble for Indian Native Language Identification |pdfUrl=https://ceur-ws.org/Vol-2036/T4-8.pdf |volume=Vol-2036 |authors=Royal Jain,Venkatesh Duppada,Sushant Hiray |dblpUrl=https://dblp.org/rec/conf/fire/JainDH17 }} ==SeerNet@INLI-FIRE-2017: Hierarchical Ensemble for Indian Native Language Identification== https://ceur-ws.org/Vol-2036/T4-8.pdf
     SeerNet@INLI-FIRE-2017: Hierarchical Ensemble for Indian
                 Native Language Identification
                                                               Royal Jain
                                                           Venkatesh Duppada
                                                             Sushant Hiray
                                                           royal.jain@seernet.io
                                                       venkatesh.duppada@seernet.io
                                                         sushant.hiray@seernet.io
                                                         Seernet Technologies, LLC
                                                             Milpitas, CA, USA

ABSTRACT                                                                     In this shared task [2] we focus on identifying the native lan-
Native Language Identification has played an important role in            guage for users from their comments on various Facebook news
forensics primarily for author profiling and identification. In this      posts. From Natural Language Processing (NLP) perspective, NLI is
work, we discuss our approach to the shared task of Indian Lan-           framed as a multiclass supervised classification task. The shared
guage Identification. The task is primarily to identify the native        task at hand is specific to identifying six Indian native languages:
language of the writer from the given XML file which contains a           Tamil, Hindi, Kannada, Malayalam, Bengali and Telugu.
set of Facebook comments in the English language. We propose a               As we explore in the next section, prior work has primarily
hierarchical ensemble approach which combines various machine             dealt with statistical machine learning algorithms including SVMs
learning techniques along with language agnostic feature extrac-          and representation methods such as tf-idf. Our approach combines
tion to perform the final classification. Our hierarchical ensemble       these various state of the art algorithms using a hierarchical ensem-
improves the TF-IDF based baseline accuracy by 3.9%. The proposed         ble. We’ve also experimented with two different types of feature
system stood 3rd across unique team submissions..                         extraction strategies. They are explored further in Section 3.1

CCS CONCEPTS                                                              2     RELATED WORK
• Computing methodologies → Classification and regression                 Most of the related NLI work can be categorized into 2 domains:
trees; Support vector machines; Neural networks; Bagging;                 text based and speech based.
Feature selection;
                                                                          2.1    Text NLI
KEYWORDS                                                                  The 2013 Native Language Identification Shared Task [8] created
Native Language Identification, Text Classification, Ensemble             an increased interest in the problem by providing a large labelled
                                                                          dataset. [9] exploited difference in parse structure in texts of differ-
                                                                          ent native language speakers for reducing classification error. Very
1   INTRODUCTION                                                          recently, the 2017 shared task on Native Language Identification
Native Language Identification (NLI) is primarily the task of auto-       [4] provided additional contributions to the field.
matically identifying the native language of an individual based
on their writing or speech in another language. The underlying as-        2.2    Speech NLI
sumption here is that an author’s native language (mother tongue)         [10] demonstrates that the acoustic features along with various fea-
will often have an influence on the way they express themselves           tures computed on the transcripts can provide increased accuracy
in another language. Identifying such common patterns across a            in dialect identification. [7] achieved good results with i-vector and
group of people can be used to determine their native language.           glove vector features with a GRU deep learning model.
   Identifying the native language of an author has various appli-           Starting from the shared task in 2013, quite a few approaches
cations, primarily in forensics. In forensics, author profiling and       used ensembling techniques to combine multiple base classifiers to
identification using their native language is an important feature        improve the performance.
[1]. Identifying the native language can also be used to provide
personalised training for learning new languages [6]. Recent work         3 SYSTEM DESCRIPTION
by [3] focuses on using this in tracing linguistic influences in multi-
author texts.                                                             3.1 Feature Extraction
   Researchers have experimented with a range of machine learn-           We observe from the dataset that people often use words and
ing algorithms, with Support Vector Machines having found the             phrases which belong to their native language transliterated into
most success. However, some of the most successful approaches             English. Some common examples are "Jai ho", "vadi koduthu" etc.
have made use of classifier ensemble methods to further improve           We also expect that people who have the same native language
performance on this task.                                                 would have some topics/concerns which would not be shared by
                                                           Figure 1: System Design


people who have a different native language. For example an issue            language. Most of these topics were expressed in the noun forms
which revolves around Tamil Nadu would resonate more with Tamil              and hence to extract this information we collected noun chunks
speaking people as compared to others.                                       which are present in the sentences. Noun chunks are extracted
   For our classification system we created two different feature            using spacy3 . Now we follow procedure similar to first feature set.
sets from our data. In the first feature set we take raw sentences           We collect these two features for each sentence and then create
as inputs. The sentences are tokenized to create vocabulary of               a vocabulary for it. This vocabulary is then used to create term
tokens. This vocabulary is then used to create term frequency-               frequency inverse document frequency features which are then
inverse document frequency features for each sample point which              used as inputs for classification.
are then used as feature input in the classification step. One benefit
of term frequency inverse document frequency over simple bag                 3.2      Classification
of words approach is it mitigates the effect of common words and
                                                                             We perform the classification separately for both feature sets de-
thus making inputs easier to discriminate. We refrained from using
                                                                             scribed above. The training data set in the competition was small
higher n-grams features due to limited amount of data.
                                                                             hence, instead of creating separate train and development set, we
   In the second feature set we leverage our observations stated
                                                                             performed 10-fold cross validation. On each fold, a model was
above to filter the relevant information. First, for each sentence
                                                                             trained and the predictions were collected on the remaining dataset.
we collect words which do not belong to the English vocabulary.
                                                                             We calculated mean of accuracy over 10 fold for each type of classi-
The sentences were tokenized using tweetokenize package1 and we
                                                                             fier. We also observed the performance of each classifier on points
check whether the word belongs to English vocabulary by using the
                                                                             which were harder to classify i.e those points for which the deci-
English dictionary provided in enchant 2 . These words are extracted
                                                                             sions were incorrect for majority of classifiers. After evaluation
for capturing usage of native language in inputs. We then tested
                                                                             selected four classifiers, namely LogisticRegression, MLPClassifier,
our hypothesis that speakers of common native language would
                                                                             LinearSVC and RidgeClassifier of sklearn [5], were selected for
have topics/concerns which are not shared as strongly by others.
                                                                             ensemble creation. These classifiers were chosen based on their
To this end we collected all the documents of native speakers of
                                                                             performance on the cross-validation and also on the basis of their
each language and extracted topics from it using Latent Dirichlet
                                                                             complimentary performance on hard to predict data points. The
Allocation. We observed a good deal of topics which were specific
                                                                             performance of these classifiers on cross validation is shown in
to speakers of common native language. We think this is a result of
                                                                             table 1.
regional and cultural proximity between speakers of common native

1 https://www.github.com/jaredks/ tweetokenize
2 https://pypi.python.org/pypi/pyenchant/                                    3 https://spacy.io/


                                                                         2
Table 1: 10-fold Cross Validation Mean Accuracy on feature                                          Table 2: Accuracy on test data
sets
                                                                                     Class          Submission1          Submission2         Submission3
      Classifier             Feature Set1       Feature Set2                         BE             64.40                64.80               67.10
      LogisticRegression     0.887959046018     0.912401033115                       HI             16.10                14.30               15.70
      LinearSVC              0.894482995578     0.914853592818                       KA             49.80                46.50               48.10
      RidgeClassifier        0.894476444138     0.913260049508                       MA             46.80                50.00               45.40
      MLPClassifier          0.878357282545     0.902736002619                       TA             54.40                52.10               52.20
                                                                                     TE             44.40                43.70               44.90
                                                                                     OverAll        46.60                46.40               46.90

3.3    Ensemble
We created a hierarchical ensemble model for this task, consisting            creating manually hand-crafted features and can provide better
of two layers of ensembles. First layer consists of two ensemble.             performance.
First one consists of four classifiers selected in the previous section
mentioned. These classifiers were trained on feature set 1 ( term             ACKNOWLEDGMENTS
frequency inverse document frequency features on raw input sen-               We would like to thank the organisers of the FIRE-2017 Shared
tences). Second ensemble also consists of same four classifiers but           Task on Native language identification, for providing the data, the
were trained on feature set 2, which had term frequency inverse               guidelines and timely support.
document frequency features computed using noun chunk and non
English words extracted from each sentence. Each ensemble pre-                REFERENCES
dicts the output using the majority vote. We limited the decision to           [1] John Gibbons. 2003. Forensic linguistics: An introduction to language in the justice
majority vote as complex weighted voting would have caused over-                   system. Wiley-Blackwell.
fitting. Final classification is predicted using a combination of two          [2] Anand Kumar M, Barathi Ganesh HB, Shivkaran S, Soman K P, and Paolo Rosso.
                                                                                   2017. Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language
ensembles described above. If they output same class, we present                   Identification. In Notebook Papers of FIRE 2017.
that class as prediction. If they differ, we calculate the confidence          [3] Shervin Malmasi and Mark Dras. 2017. Native Language Identification using
                                                                                   Stacked Generalization. arXiv preprint arXiv:1703.06541 (2017).
of each ensemble using count of classifiers in the ensemble which              [4] Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh,
support its decision. Fig 1. depicts our system.                                   Christopher Hamill, Diane Napolitano, and Yao Qian. 2017. A Report on the 2017
                                                                                   Native Language Identification Shared Task. In Proceedings of the 12th Workshop
                                                                                   on Innovative Use of NLP for Building Educational Applications. 62–75.
4     RESULTS                                                                  [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
                                                                                   Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
We can see from table 1. that all four classifiers perform quite                   napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
well on both the extracted feature sets especially considering the                 Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
classification problem involves six classes. This suggest that the             [6] Alla Rozovskaya and Dan Roth. 2011. Algorithm selection and model adaptation
                                                                                   for ESL correction tasks. In Proceedings of the 49th Annual Meeting of the Asso-
dataset points are easier to discriminate. We further see that the                 ciation for Computational Linguistics: Human Language Technologies-Volume 1.
accuracy increases significantly on feature set 2, suggesting that                 Association for Computational Linguistics, 924–933.
features such as native language words and regional/local topics               [7] Ishan Somshekar, Bogac Kerem Goksel, and Huyen Nguyen. [n. d.]. Native
                                                                                   Language Identification. ([n. d.]).
are important for identification of native language.                           [8] Joel R Tetreault, Daniel Blanchard, and Aoife Cahill. 2013. A Report on the First
   We presented three submissions. Submission 1 is the output of                   Native Language Identification Shared Task.. In BEA@ NAACL-HLT. 48–57.
                                                                               [9] Sze-Meng Jojo Wong and Mark Dras. 2011. Exploiting Parse Structures for Native
final classifier(see Fig 1). Submission 2 is the output of Ensemble 1,             Language Identification. In Proceedings of the Conference on Empirical Methods
which was trained on raw sentences. Submission 3 was generated                     in Natural Language Processing (EMNLP ’11). Association for Computational
using ensemble 2 trained on feature set 2 (non-English phrase and                  Linguistics, Stroudsburg, PA, USA, 1600–1610. http://dl.acm.org/citation.cfm?
                                                                                   id=2145432.2145603
noun chunks). We can see that Submission 3 outperform other                   [10] Marcos Zampieri, Shervin Malmasi, Nikola Ljubešić, Preslav Nakov, Ahmed Ali,
two classifiers strengthening our belief on importance of native                   Jörg Tiedemann, Yves Scherrer, and Noëmi Aepli. 2017. Findings of the VarDial
language phrases and shared topics in identifying native language                  Evaluation Campaign 2017. (2017).
of speaker.

5     FUTURE WORK AND CONCLUSION
This paper studies couple of approaches for identification of native
language. First approach measures the power of tf-idf features for
the purpose of classification. Second approach identifies certain
features which separate different native language speakers from
each other and utilizes those for better ac curacies of overall system.
We have seen improvement in accuracy due to identification of
discriminating features, however extending this procedure is time
consuming and requires language expertise. Recent studies have
shown use of deep neural networks can be a possible alternate to
                                                                          3