<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Machine Learning Approach to Indian Native Language Identi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>D. Thenmozhi</string-name>
          <email>d@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S. Kayalvizhi</string-name>
          <email>kayalvizhi1704@cse.ssn.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chandrabose Aravindan</string-name>
          <email>aravindanc@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of CSE, SSN College of Engineering</institution>
          ,
          <addr-line>Chennai theni</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>NLI (Native Language Identi cation) determines the native language of the non-native users using their writings in a foreign language. It has several applications namely forensic and security, author pro ling and identi cation, and educational applications. English is a most common language used in social media by many non-English people in the world to share their thoughts and ideas. They blend English with their native language for their posts and comments. Identifying the native language from the short text in English is still a challenging task. In this paper, we present a language agnostic approach without any language speci c processing and employed machine learning approach with and without feature selection to identify the native language of a Indian speaker using their comments and posts in social network. The bag of word features are extracted from the text posted by the user and the feature vectors are constructed using TF-IDF score for the training data. We have used a statistical feature selection methodology to select the features that are signi cantly contributing to NLI task. The classi er with highest cross validation accuracy was used for predicting the native language of the user. Our approaches are evaluated using INLI@FIRE2018 shared task data set.</p>
      </abstract>
      <kwd-group>
        <kwd>Indian Native Language Identi cation Language Recognition Author Pro ling Machine Learning Feature Selection Text Mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        NLI (Native Language Identi cation) is the process of automatically
identifying the native language of speakers using their speech or writing in di erent
language. It has several applications namely forensic and security [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
authorship pro ling and identi cation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and educational applications [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Several
researches have been reported on text{based NLI [
        <xref ref-type="bibr" rid="ref11 ref15 ref16 ref20 ref5 ref8">20, 11, 5, 8, 15, 16</xref>
        ]. Currently,
people use social media like YouTube, Facebook, Blogs and Tweets to share
their thoughts, ideas and comments. English is the prominent language used
by many non-English people by blending their native languages in their social
media postings. In this line, Indians also use English predominantly in their
comments and postings. Indian Native Language Identi cation (INLI) focuses
on identifying native language of Indians based on their English writings. Many
shared tasks have been conducted on NLI since 2013 to identify the native
language from English text. Recently, shared tasks on INLI are also evolving since
2017 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Their focus is to research and develop techniques to identify the native
language namely Tamil, Hindi, Kannada, Malayalam, Bengali and Telugu from
the set of Facebook comments. Several methodologies have been reported on
INLI. N-gram approach [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Machine learning approaches with Support vector
machines [
        <xref ref-type="bibr" rid="ref12 ref17 ref3">17, 3, 12</xref>
        ], ensembling approaches [
        <xref ref-type="bibr" rid="ref17 ref9">17, 9</xref>
        ] and deep learning approaches
[
        <xref ref-type="bibr" rid="ref21 ref4">21, 4</xref>
        ] have been used to identify the Indian native languages. In this research,
our focus is on the shared task of INLI@FIRE2018 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which identi es the native
language (Tamil, Hindi, Kannada, Malayalam, Bengali or Telugu) of Indians
based on their comments posted in social media. INLI@FIRE2018 is a shared
Task on Indian Native Language Identi cation (INLI) collocated with the Forum
for Information Retrieval Evaluation (FIRE), 2018.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Native language identi cation is an author pro ling task. PAN 2017 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] focuses
on language variety identi cation tasks. The shared tasks on INLI are also
evolving since 2017 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This section describes the methodology used for INLI tasks.
Nayel and Shashirekha [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] normalized the text by removing the emoji,
special characters, digits, hash tags, mentions and links. They used the techniques
namely removal of stop word using the NLTK stop words package, manual stop
word collection and other resources (Python stop words) to preprocess the data.
They used TF-IDF scores to construct feature vectors and employed SVM to
classify the native language of the user. Bharathi et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Lakshmi et al.
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] also used TF-IDF for feature construction and SVM for classi cation for this
task. Also, Lakshmi et al. used character n-gram and word n-gram while
computing TF-IDF score. However, they have not applied any preprocessing techniques.
Kosmajac and Keselj [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] normalized the text similar to [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] used TF-IDF with
character n-gram and word n-gram for feature construction and employed SVM
for classi cation. Jain et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] considered non-English word and nouns phrase
while computing TF-IDF scores without applying any preprocessing techniques.
They used Logistic Regression, SVM, Ridge Classi er and Multi-Layer
Perceptron (MLP) as base classi ers and employed an ensembled approach for language
identi cation. Bhargava et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] used a deep learning approach using
hierarchical attention with bi-directional GRU architecture for this task. Thenmozhi et
al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] also employed a neural network approach with 2 hidden layers for this
task. They normalized the text similar to [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and handled shortened words
as part of pre-processing. They have considered only the nouns and adjectives
present in the text to extract the features. In this paper, we propose a language
agnostic approach in which we have not used any language speci c (or linguistic
related) processing to extract the features. Thus, we simply took bag of words to
consider all the words in the text and went for statistical based feature selection
to extract the most signi cant features for language identi cation task.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Methodology</title>
      <p>We have used a supervised approach with three variations namely a)
termfrequency (TF) without feature selection, b) TF-IDF (term-frequency
inversedocument-frequency) without feature selection and c) TF-IDF with statistical
feature selection for the INLI task. The steps used in our approach are given
below.</p>
      <p>{ Preprocess the data
{ Extract bag of words (BOW) features from training data
{ Construct feature vectors using TF or TF-IDF with and without 2 feature
selection
{ Build the models using a classi er for the three variations
{ Predict any of the six languages namely Tamil (TA), Hindi (HI), Kannada
(KN), Malayalam (ML), Bengali (BE) or Telugu (TE) as class label for the
instance using the model</p>
      <p>The steps are explained below in detail.
3.1</p>
      <p>Feature Extraction
The data for INLI task is given as XML le. The given text is preprocessed
by extracting only the textual part of the content present in XML le. All the
punctuations are removed before extracting the features. Since, the texts are
collected from social network sites, many terms are in transliterated form and
many terms are in short-hand notations like pls, sry, tc, tks, etc. Hence, we
did not apply stop word removal and stemming as preprocessing steps. The
unique terms present in the text are considered as features in our rst two
variations. The feature vectors for the training data is constructed using
termfrequency in the rst variation. TF-IDF is used to construct feature vectors in the
second variation. However, the number of extracted features may be more. We
have employed a 2 feature selection in our third variation to extract the useful
features that are contributing to native language identi cation. The details of
feature selection are explained below.
3.2</p>
      <p>
        Feature Selection
In our third variation, we have used 2 feature selection. INLI task involves
six categories namely \BE", \HI", \KN", \ML", \TA" and \TE". Hence, 2 6
CHI table (Table 1) or contingency table [
        <xref ref-type="bibr" rid="ref10 ref14 ref22 ref23">14, 10, 22, 23</xref>
        ] is constructed for all the
feature fx. Table 1 contains the observed frequency (O) of feature fx for every
category \BE", \HI", \KN", \ML", \TA" and \TE".
      </p>
      <p>The observed frequencies (O) are used to compute the expected frequencies
(E) for the feature fx using Equations 1.</p>
      <p>E(x; y) =
a2ffx;:fxgO(a; y) b2fBE;HI;KN;ML;T A;T EgO(b; y)
n
(1)
where n is the total no. of training instances, x indicates whether the feature
fx is present or not, y represents whether the training instance belongs to any
of the six languages namely \BE", \HI", \KN", \ML", \TA" or \TE".</p>
      <p>The expected frequencies namely E(fx; BE), E(fx; HI), E(fx; KN ),
E(fx; M L), E(fx; T A), E(fx; T E), E(:fx; BE), E(:fx; HI), E(:fx; KN ),
E(:fx; M L), E(:fx; T A) and E(:fx; T E) are calculated using Equation 1 for
language identi cation. Then, we have calculated the 2 value for each feature
fx using Equation 2.</p>
      <p>2
statfx =
x2ffx;:fxg y2fBE;HI;KN;ML;T A;T Eg
(O(x; y) E(x; y))2</p>
      <p>E(x; y)
(2)</p>
      <p>The set of features whose s2tat value is greater than c2rit( =0:01;df=5) : 9:24
are considered to be signi cant features for language identi cation. These
selected features are used to build a model with a classi er in our third variation.
3.3</p>
      <p>Model Building and Prediction
The models for the rst two variations for language identi cation are built from
training data using Multi Layer Perceptron (MLP) and the model for the third
variation is built using Multinomial Naive Bayes (MNB) classi er with the
selected features. The classi ers were chosen based on the cross validation
accuracies. The class label as one among the six languages namely \BE", \HI", \KN",
\ML", \TA" or \TE" is predicted for the test data instances by using the models.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Implementation</title>
      <p>Our methodology was implemented in Python for this Shared Task on Indian
Native Language Identi cation (INLI) task. The number of training instances
are 202, 211, 203, 200, 207 and 210 for the languages namely Bengali, Hindi,
Kannada, Malayalam, Tamil and Telugu respectively. Two sets of test data was
given for the evaluations that consist of 783 and 1185 instances for test-set-1
and test-set-2 respectively. The textual part of data is extracted from XML le
using xml.etree library. The punctuations are removed and the BOW (bag of
words) features are extracted using the training instances. We have obtained a
total of 21813 features from training data. Scikit{learn machine learning library
was used to vectorize the training instances using CountVectorizer for the rst
variation and T dfVectorizer for the second variation. We have implemented 2
feature selection algorithm to extract the signi cant features for native language
identi cation. We have obtained a total of 1555 features by the feature selection
with alpha=0.10 and degree of freedom 5 for the six classes.</p>
      <p>We have employed several classi ers namely, Multinomial Naive Bayes,
Gaussian Naive Bayes (GNB), Random Forest (RF), Decision Tree (DT), Extra Trees
(ET), Ada Boost (AB), Stochastic Gradient Descent (SGD), Support Vector
Machines (SVM), and Multi Layer Perceptron, and measured 10-fold cross
validation to select the best classi er for all the three variations of our approach.
Table 2 shows the cross validation output of various classi ers for all the three
variations. This table shows that MLP performs better for the rst two
variations that are without feature selection and MNB performs better for the third
variation which used feature selection. MNB performs better with less number of
features that are selected using our chi-square feature selection. However, MNB
was not able to perform well with all the features. This is because the likelihood
would be distributed and may not follow the Gaussian or other distributions
when huge feature set is used. When more features are there, they may a ect
each other's likelihood which reduces the performance. Hence, we have chosen
MLP to build models for the rst two variations and MNB to build model for
the third variation. These models are utilized to predict the native language for
the two sets of test instances.
We have submitted our second variation (best among rst two without feature
selection) using MLP classi er and third variation (with feature selection) using
MNB classi er as two runs for the shared task. The performance is measured in
terms of precision (P), recall (R) and F1-measure. The results obtained by our
approach for Run 1 on two test sets are shown in Table 3. The results show that
our methodology which uses TF-IDF with MLP classi er does not perform well
for Hindi language. We have obtained overall accuracies as 46.1% and 34.3% for
test-set-1 and test-set-2 respectively.</p>
      <p>The results obtained by our approach for Run 2 on two test sets are shown in
Table 4. The results show that our methodology which uses TF-IDF, 2 feature
selection and MNB classi er improved the performance for Hindi and Tamil
languages on test set 2. However, this method does not improve the performance
for the other languages of test set 2 and for test set 1. We have obtained overall
accuracies as 32.4% and 19.7% for test set 1 and test set 2 respectively.</p>
      <p>It is observed from Table 5 that the maximum accuracy obtained for Test
set 2 is 37%. This may be due to data set size for training the model. Thus, we
have obtained a very low accuracy. The data set size may be improved further
using Generative Adversarial Networks (GAN) to improve the performance.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We have presented a machine learning approach for identifying the Indian native
language namely Bengali, Hindi, Kannada, Malayalam, Tamil or Telugu from the
English comments posted in social media. We have presented the three
variations of our approach namely term-frequency without feature selection, TF-IDF
without feature selection, and TF-IDF with 2 feature selection for the language</p>
      <p>Team
identi cation task. The data set of INLI@FIRE2018 shared task is used to
evaluate our approach. We have submitted our second and third variations to the
task and we have obtained overall accuracies of 46.1% and 34.3% for our rst
run on test-set-1 and test-set-2 respectively. We have obtained overall accuracies
of 32.4% and 19.7% for our second run on test-set-1 and test-set-2 respectively.
Our feature selection improved the F-measure for Hindi and Tamil for
test-set2. However, it does not improve for the other languages. Since our approach is
language agnostic, we have not included any character level features at present.
These features may be considered in future to improve the performance of NLI
task. The performance may also be improved by incorporating word embedding
techniques in future. Due to data set size for training, we have obtained very low
accuracy. The data set size may be improved by using Generative Adversarial
Networks (GAN) in future to improve the performance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Anand</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Barathi</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Soman</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.P.</surname>
          </string-name>
          :
          <article-title>Overview of the INLI@FIRE2018 track on Indian native language identi cation</article-title>
          .
          <source>In: In workshop proceedings of FIRE 2018. CEUR Workshop Proceedings</source>
          , Gandhinagar, India, December 6-
          <issue>9</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Anand</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Barathi</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Shivkaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Soman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Overview of the INLI PAN at FIRE-2017 track on Indian native language identication</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          <year>2036</year>
          ,
          <volume>99</volume>
          {
          <fpage>105</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bharathi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anirudh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhuvana</surname>
          </string-name>
          , J.:
          <string-name>
            <surname>Bharathi</surname>
            <given-names>SSN</given-names>
          </string-name>
          @
          <article-title>INLI-FIRE-2017: SVM based approach for Indian native language identi cation</article-title>
          .
          <source>In: FIRE-Working Notes</source>
          . pp.
          <volume>110</volume>
          {
          <issue>112</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bhargava</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arora</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Bits pilani@ INLI-FIRE-2017: Indian native language identi cation using deep learning</article-title>
          .
          <source>In: FIRE-Working Notes</source>
          . pp.
          <volume>123</volume>
          {
          <issue>126</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bykh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meurers</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Exploring syntactic features for native language identi - cation: A variationist perspective on feature encoding and ensemble optimization</article-title>
          .
          <source>In: Proc. of COLING</source>
          <year>2014</year>
          ,
          <source>the 25th Int. Conf. on Computational Linguistics: Technical Papers</source>
          . pp.
          <year>1962</year>
          {
          <year>1973</year>
          . Dublin, Ireland (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Estival</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaustad</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutchinson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Author proling for english emails</article-title>
          .
          <source>In: Proceedings of the 10th Conference of the Paci c Association for Computational Linguistics</source>
          . pp.
          <volume>263</volume>
          {
          <fpage>272</fpage>
          . ACL,
          <string-name>
            <surname>Australia</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gibbons</surname>
          </string-name>
          , J.:
          <article-title>Forensic linguistics: An introduction to language in the justice system</article-title>
          . Wiley-Blackwell (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ionescu</surname>
          </string-name>
          , R.T.,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cahill</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Can characters reveal your native language? a language-independent approach to native language identi cation</article-title>
          .
          <source>In: Proc. of the 2014 Conf. on Empirical Methods in NLP (EMNLP)</source>
          . pp.
          <volume>1363</volume>
          {
          <fpage>1373</fpage>
          .
          <string-name>
            <surname>ACL</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duppada</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hiray</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Seernet@ INLI-FIRE-2017: Hierarchical ensemble for Indian native language identi cation</article-title>
          .
          <source>In: FIRE-Working Notes</source>
          . pp.
          <volume>127</volume>
          {
          <issue>129</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Janaki</given-names>
            <surname>Meena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Chandran</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Naive bayes text classi cation with positive features selected by statistical method</article-title>
          .
          <source>In: Int. Conf. on Autonomic Computing and Communications</source>
          ,
          <string-name>
            <surname>ICAC</surname>
          </string-name>
          <year>2009</year>
          . pp.
          <volume>28</volume>
          {
          <fpage>33</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jarvis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bestgen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pepper</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Maximizing classi cation accuracy in native language identi cation</article-title>
          .
          <source>In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          . pp.
          <volume>111</volume>
          {
          <fpage>118</fpage>
          . ACL, Atlanta, Georgia (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kosmajac</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keselj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Dalteam@ INLI-FIRE-2017: Native language identi cation using SVM with SGD training</article-title>
          .
          <source>In: FIRE-Working Notes</source>
          . pp.
          <volume>118</volume>
          {
          <issue>122</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lakshmi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shambhavi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>BMSCE ISE@ INLI-FIRE-2017: A simple n-gram based approach for native language identi cation</article-title>
          .
          <source>In: FIRE-Working Notes</source>
          . pp.
          <volume>115</volume>
          {
          <issue>117</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Li</given-names>
            <surname>Yanjun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.L.</given-names>
            ,
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.M.:</surname>
          </string-name>
          <article-title>Text clustering with feature selection by using statistical data</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>20</volume>
          (
          <issue>5</issue>
          ),
          <volume>641</volume>
          {
          <fpage>652</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Native language identi cation using stacked generalization</article-title>
          .
          <source>arXiv preprint arXiv:1703.06541</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Mohammadi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veisi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amini</surname>
          </string-name>
          , H.:
          <article-title>Native language identi cation using a mixture of character and word n-grams</article-title>
          .
          <source>In: Proc. of the 12th Workshop on Innovative Use of NLP for Building Educational Applications</source>
          . pp.
          <volume>210</volume>
          {
          <fpage>216</fpage>
          . ACL, Copenhagen, Denmark (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Nayel</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shashirekha</surname>
          </string-name>
          , H.:
          <article-title>Mangalore-university@ INLI-FIRE-2017: Indian native language identi cation using support vector machines and ensemble approach</article-title>
          . In: FIRE -Working Notes. pp.
          <volume>106</volume>
          {
          <issue>109</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 5th author pro ling task at pan 2017: Gender and language variety identi cation in twitter</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Rozovskaya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Algorithm selection and model adaptation for esl correction tasks</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies-Volume</source>
          <volume>1</volume>
          . pp.
          <volume>924</volume>
          {
          <fpage>933</fpage>
          . ACL, Portland, Oregon, USA (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Tetreault</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blanchard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cahill</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chodorow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Native tongues, lost and found: Resources and empirical evaluations in native language identi cation</article-title>
          .
          <source>Proceedings of COLING 2012</source>
          pp.
          <volume>2585</volume>
          {
          <issue>2602</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Thenmozhi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kannan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aravindan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : SSN NLP@
          <article-title>INLI-FIRE-2017: A neural network approach to Indian native language identi cation</article-title>
          .
          <source>In: FIRE-Working Notes</source>
          . pp.
          <volume>113</volume>
          {
          <issue>114</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Thenmozhi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirunalini</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aravindan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Decision tree approach for consumer health information search</article-title>
          .
          <source>In: FIRE-Working Notes</source>
          . pp.
          <volume>221</volume>
          {
          <issue>225</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Thenmozhi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirunalini</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aravindan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Feature engineering and characterization of classi ers for consumer health information search</article-title>
          .
          <source>In: Forum for Information Retrieval Evaluation</source>
          . pp.
          <volume>182</volume>
          {
          <fpage>196</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>