<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anand Kumar M, Barathi Ganesh HB,</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Author Profiling, Indian Languages, Native Language Identification,</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>PRHLT Research Center, Universitat Politècnica de València</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Shivkaran Singh and Soman KP, Center for Computational Engineering</institution>
          ,
          <addr-line>and Networking (CEN)</addr-line>
          ,
          <institution>Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore, Amrita Vishwa Vidyapeetham</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Social Media</institution>
          ,
          <addr-line>Text Classification</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This overview paper describes the first shared task on Indian Native Language Identification (INLI) that was organized at FIRE 2017. Given a corpus with comments in English from various Facebook newspapers pages, the objective of the task is to identify the native language among the following six Indian languages: Bengali, Hindi, Kannada, Malayalam, Tamil, and Telugu. Altogether, 26 approaches of 13 diferent teams are evaluated. In this paper, we give an overview of the approaches and discuss the results that they have obtained.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Natural language
processing; Language resources; Feature selection;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        Native Language Identification (NLI) is a fascinating and rapidly
growing sub-field in Natural Language Processing. In the
framework of the author profiling shared tasks that have been organized
at PAN1, language variety identification was addressed in 2017 at
CLEF [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. NLI requires instead to automatically identify the native
language (L1) of an author on the basis of the way she writes in
another language (L2) that she learned. As her accent may help in
identifying whether or not she is a native speaker in that language
L1, in a similar way the way the language is used when she writes
may unveil patterns that can help in identifying her native language
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. From a cybersecurity viewpoint, NLI can help to determine
the native language of an author of a suspicious or threatening text.
      </p>
      <p>
        The native language influences the usage of words as well the
errors that a person makes when writing in another language [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
NLI systems can identify the writing patterns that are based on
the author’s linguistic background. NLI has many applications and
studying the language transfer from a forensic linguistics viewpoint
is certainly one of the most important. The first shared task on
native language identification was organized in 2013 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The
organizers made available a large text corpus for this task. Other
works approach the problem of native language identification using
as well speech transcripts [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. In the Indian languages context, this
is the first NLI shared task. In India there are currently 22 oficial
languages with English as an additional oficial language. In this
shared task, we focus on identifying the native language of Indian
authors writing comments in English. We considered six languages,
namely, Bengali, Hindi, Kannada, Malayalam, Tamil and Telugu for
the shared task.
      </p>
      <p>Since comments over the internet are usually written in social
media, the corpora used for the shared task was acquired from
Facebook. English comments from Facebook pages of famous regional
language newspapers were crawled. These comments were further
preprocessed in order to remove code-mixed and mixed scripts
comments from the corpus. In the following sections we present some
related work (Section 2), we describe the corpus collection (Section
3), we give an overview of the submitted approaches (Section 4),
ifnally we show the results that were obtained (Section 5). Finally,
in Section 6 we draw some conclusions.
2</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>
        As said in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], one of the earliest works on identifying native
language was by Tomokiyo and Jones (2001) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] where the author
used Naive Bayes to discriminate non-native from native
statements in English. Koppel et. al (2005) [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] approached the problem
by using stylistic, syntactic and lexical features. They also noticed
that the use of character n-grams, parts of speech bi-grams and
function words allowed to obtain better results. Tsur and Rappoport
(2007) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] achieved an accuracy of about 66% by using only
character bi-grams. They assumed that the native language phonology
influences the choice of words while writing in a second language.
      </p>
      <p>
        Estival et. al [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used English emails of authors with diferent
native languages. They achieved an accuracy of 84% using a
Random Forest classifier with character, lexical, and structural features.
Wong and Dras [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] pointed out that mistakes made by authors
writing in a second language is influenced by their native language.
They proposed the use of syntactic features such as subject-verb
disagreement, noun-number disagreement, and improper use of
determiners to help in determining the native language of a writer.
In their later work [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], they also investigated the usefulness of
parse structures for identifying the native language. Brooke and
Hirst [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] used word-to-word translation of L1 to L2 to create a
mappings which are the result of language transfer. They use this
information in their unsupervised approach.
      </p>
      <p>
        Torney et. al [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] used psycho-linguistic feature for NLI.
Syntactic features showed also to play a significant role in determining the
native language. Other interesting studies in the NLI field are [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]
      </p>
      <sec id="sec-3-1">
        <title>Language #XML docs #</title>
      </sec>
      <sec id="sec-3-2">
        <title>Sentences #</title>
      </sec>
      <sec id="sec-3-3">
        <title>Words #Unique Words</title>
        <p>
          [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In 2013 a shared task was organized on NLI [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. The
organizers provided a large corpus which allowed comparison among
diferent approaches. In 2014 a related shared task was organized
on Discriminating between Similar Languages (DSL2) [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]. The
organizers provided six groups of 13 diferent languages, with each
group having similar languages. In 2017 another shared task on
NLI was organized. The corpus was composed by essays and
transcripts of utterances. The ensemble methods and meta-classifiers
with syntactic/lexical features were the most efective systems [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 INLI-2017 CORPUS</title>
      <p>
        Many corpora have been created from social media (Facebook,
Twitter and WhatsApp) for performing language modeling [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
information retrieval tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and code-mixed sentiment analysis
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. A monolingual corpus based on the TOEFL3 data is available
for performing the NLI task for Indian languages such as Hindi
and Telugu [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The INLI-2017 corpus includes English comments
of Facebook users, whose native language is one among the
following: Bengali (BE), Hindi (HI), Kannada (KA), Malayalam (MA),
Tamil (TA) and Telugu (TE). The dataset collection is based on the
assumption that, only native speakers will read native language
newspapers. To the best of our knowledge, this is the first corpus
for native language identification for Indian languages. The detailed
corpus statistics are given in Table 1 and Table 2.
      </p>
      <p>The texts for this corpus have been collected from the users
comments in the regional newspapers and news channel Facebook
pages. Around 50 Facebook pages were selected and comments
written in English were extracted from these pages. The training
data have been collected in the period of April-2017 to July 2017.
The test data has been collected later on. It was expected that
participants will focus on native language-based stylistic features. As
a result, we removed code-mixed comments and comments related
to the regional topics (regional leaders and comments mentioning
the name of regional places). Comments with common keywords
discussed across the regions were considered to avoid the topic bias.
These common keywords observed were Modi, note-ban, diferent
sports personalities, army, national issues, government policies, etc.
Finally, the collected dataset was randomized and written to XML
ifles randomly to avoid user bias.</p>
      <p>From Table 1 and Table 2, it can be observed that except for
BE and MA, the remaining languages have nearly the same ratio
of average words per sentence. It is also visible that the test data
was properly normalized in order to have the average words per
sentence and average unique words per sentence. The variance
between average of words per sentence and average of unique
words per sentence for the training and the test data is shown
in Figure 1 and Figure 2, respectively. This corpus will be made
available after the FIRE 2017 conference in the web page of our
NLP group website4.
4</p>
    </sec>
    <sec id="sec-5">
      <title>OVERVIEW OF THE SUBMITTED</title>
    </sec>
    <sec id="sec-6">
      <title>APPROACHES</title>
      <p>Initially, 56 teams registered at the INLI shared task at FIRE, and
ifnally 13 of them submitted a total of 26 runs. Moreover, 8 of them
submitted their system description working notes5. We analysed
their approaches from three perspectives: preprocessing, features
to represent the author’s texts and classification approaches.
4.1</p>
    </sec>
    <sec id="sec-7">
      <title>Preprocessing</title>
      <p>
        Most of the participants have not done any preprocessing [
        <xref ref-type="bibr" rid="ref13 ref18 ref2 ref26 ref7">2, 7, 13,
18, 26</xref>
        ]. Text is normalised by removing the emoji, special
characters, digits, hash tags, mentions and links [
        <xref ref-type="bibr" rid="ref1 ref12 ref22">1, 12, 22</xref>
        ]. Stop words
are removed using the nltk stop words package6, other resources7
and manual stop words collection [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. White space based
tokenization has been carried out by all other participants except [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The
participant [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] handled the shortened words (terms such as n’t, &amp;,
’m, ’ll are replaced as ’not’, ’and’, ’am’, and ’will’ respectively).
4.2
      </p>
    </sec>
    <sec id="sec-8">
      <title>Features</title>
      <p>
        Two of the participants directly used the Term Frequency Inverse
Document Frequency (TFIDF) weighs as their features [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ],
nonEnglish words and noun-chunks are taken as the features while
computing TFIDF [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], character n-grams of order 2-5 and word
n-grams of order 1-2 have been used as features while computing
the TFIDF vocabulary [
        <xref ref-type="bibr" rid="ref12 ref13 ref7">7, 12, 13</xref>
        ]. Only the non-English word counts
4http://nlp.amrita.edu:8080/nlpcorpus.html
5ClassPy team did not submit any working notes, although a brief description of the
approach was sent by email.
6http://www.nltk.org/book/ch02.html
7pypi.python.org/pypi/stop-words
      </p>
      <sec id="sec-8-1">
        <title>MANGALORE</title>
      </sec>
      <sec id="sec-8-2">
        <title>DalTeam</title>
        <p>
          2
3
3
have been taken as features in [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Nouns and adjective words
have been taken as feature in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Part of Speech n-grams, average
word and sentence length have been used as the features in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
Distributional representation of words (pre-trained word vectors)
have been used in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
4.3
        </p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Classification Approaches</title>
      <p>
        Support Vector Machine (SVM) has been used as a classifier by most
of the participants [
        <xref ref-type="bibr" rid="ref1 ref12 ref13 ref2 ref7">1, 2, 7, 12, 13</xref>
        ]. Two of the participants followed
the ensemble based classification with Multinomial Bayes, SVM and
Random Forest Tree as the base classifiers in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and Logistic
Regression, SVM, Ridge Classifier and Multi-Layer Perceptron (MLP)
as the base classifiers in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Other than this the authors in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] used
the Logistics Regression, authors in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] used Naive Bayes, authors
in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] used hierarchical attention architecture with bidirectional
Gated Recurrent Unit (GRU) cell and authors in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] employed the
neural network classifier with 2 hidden layers, Rectified Linear Unit
(ReLU) as the activation function and Stochastic Gradient Descent
(SGD) as the optimizer.
2
3
4
5
6
6
7
8
9
10
      </p>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENTS AND RESULTS</title>
      <sec id="sec-10-1">
        <title>SEERNET</title>
      </sec>
      <sec id="sec-10-2">
        <title>Baseline</title>
      </sec>
      <sec id="sec-10-3">
        <title>IDRBT</title>
      </sec>
      <sec id="sec-10-4">
        <title>MANGALORE</title>
      </sec>
      <sec id="sec-10-5">
        <title>Bharathi_SSN SSN_NLP</title>
      </sec>
      <sec id="sec-10-6">
        <title>Bits_Pilani</title>
      </sec>
      <sec id="sec-10-7">
        <title>ClassyPy</title>
        <p>DIG (IIT-Hyd)</p>
      </sec>
      <sec id="sec-10-8">
        <title>Anuj BMSCE_ISE JUNLP team_CEC</title>
        <p>2
3
4
50.3%, which is 2.3% greater than the baseline. The lowest F-measure
scored for this language is 15.4% and this is 32.6% less than the
baseline.</p>
        <p>The ranking of the systems submitted for Malayalam (MA) is
given in Table 6. The maximum F-measure scored for this language
is 51.9%, which is 0.9% greater than the baseline. Among the all
the other languages, this is the lowest variation with respect to the
baseline. The lowest F-measure scored for this language is 1.8% and
this is 49.2% less than the baseline.</p>
        <p>The ranking of the submitted systems for Tamil (TA) is given in
Table 7. The maximum F-measure scored for this language is 58.0%,
which is 12.0% greater than the baseline. The lowest F-measure
scored for this language is 13.2% and this is 32.8% less than the
baseline.</p>
        <p>The ranking of the systems submitted for Telugu (TE) is given
in Table 8. The maximum F-measure scored for this language is
50.5%, which is 8.5% greater than the baseline system. The lowest
F-measure scored for this language is 2.4% and this is 39.6% less
than baseline.</p>
        <p>The results rank per language is given in Table 9. The team_CEC
has not identified any language apart from Hindi. The overall
ranking for the submitted systems are given in Table 10. The maximum
accuracy scored is 48.8%, which is 5.3% greater than the baseline.</p>
      </sec>
      <sec id="sec-10-9">
        <title>Rank Team</title>
      </sec>
      <sec id="sec-10-10">
        <title>MANGALORE</title>
      </sec>
      <sec id="sec-10-11">
        <title>SEERNET</title>
      </sec>
      <sec id="sec-10-12">
        <title>Bharathi_SSN</title>
      </sec>
      <sec id="sec-10-13">
        <title>DalTeam</title>
      </sec>
      <sec id="sec-10-14">
        <title>Baseline</title>
      </sec>
      <sec id="sec-10-15">
        <title>ClassyPy</title>
      </sec>
      <sec id="sec-10-16">
        <title>Anuj</title>
        <p>DIG (IIT-Hyd)
SSN_NLP</p>
      </sec>
      <sec id="sec-10-17">
        <title>Bits_Pilani</title>
      </sec>
      <sec id="sec-10-18">
        <title>BMSCE_ISE</title>
      </sec>
      <sec id="sec-10-19">
        <title>IDRBT</title>
      </sec>
      <sec id="sec-10-20">
        <title>JUNLP</title>
        <p>team_CEC</p>
      </sec>
      <sec id="sec-10-21">
        <title>MANGALORE Baseline DalTeam</title>
      </sec>
      <sec id="sec-10-22">
        <title>SEERNET</title>
      </sec>
      <sec id="sec-10-23">
        <title>Bharathi_SSN</title>
      </sec>
      <sec id="sec-10-24">
        <title>ClassyPy</title>
      </sec>
      <sec id="sec-10-25">
        <title>Anuj</title>
      </sec>
      <sec id="sec-10-26">
        <title>Bits_Pilani</title>
      </sec>
      <sec id="sec-10-27">
        <title>IDRBT</title>
      </sec>
      <sec id="sec-10-28">
        <title>BMSCE_ISE</title>
        <p>SSN_NLP
team_CEC</p>
      </sec>
      <sec id="sec-10-29">
        <title>JUNLP</title>
        <p>The lowest accuracy scored is 17.8% and this is 25.2% less than the
baseline.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>6 CONCLUSION</title>
      <p>In this paper we presented the INLI2017 corpus, we briefly described
the approaches of the 13 teams that participated at the Indian
Native Language Identification task at FIRE 2017, and the results that
they obtained. The participants had to identify the native language
of the authors of English comments collected from various
newspaper pages and television pages in Facebook. Six have been the
native languages that have been addressed: Bengali, Hindi,
Kannada, Malayalam, Tamil and Telugu. Code-mixed comments and
comments related to the regional topics were removed from the
corpus, and comments with common keywords discussed across
the regions were considered in order to avoid possible topic biases.</p>
      <p>The participants used diferent feature sets to address the
problem: content-based (among others: bag of words, character n-grams,
word n-grams, term vectors, word embedding, non-English words)
and stylistic-based (among others: words frequency, POS n-grams,
noun and adjective POS tag counts). A two layer based neural
networks with document vectors built from TFIDF and Recurrent
Neural Networks (RNN) with word embedding have been used
from the field of deep learning. However, deep learning approaches
obtained lower accuracy than the baseline.</p>
      <p>Overall the best performance system obtained an accuracy of
48.8%, which is 5.8% greater than the baseline. Overall four of the
systems performed better than the baseline. These systems have
used the following features: character and word n-grams,
nonEnglish words, and noun chunks. It is notable that all these systems
have used TFIDF for representing the features. The smallest overall
accuracy was 17.8%, which is 25.2% less than the baseline. Among
the top performing systems, two of them used an ensemble method
and all the systems employed SVM. As future work, we believe
that native language identification should be addressed taking into
account also socio-linguistics features to improve further.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGEMENT</title>
      <p>Our special thanks goes to F. Rangel, all of INLI’s participants,
students in Computational Engineering and Networking
Department for their eforts and time in developing INLI-2017 corpus. The
work of the last author was in the framework of the SomEMBED
TIN2015-71147-C2-1-P MINECO research project.</p>
      <sec id="sec-12-1">
        <title>SEERNET</title>
      </sec>
      <sec id="sec-12-2">
        <title>ClassyPy</title>
      </sec>
      <sec id="sec-12-3">
        <title>MANGALORE</title>
      </sec>
      <sec id="sec-12-4">
        <title>BMSCE_ISE SSN_NLP JUNLP team_CEC</title>
        <p>2
3
4
5
6
7
8
9
10</p>
      </sec>
      <sec id="sec-12-5">
        <title>MANGALORE</title>
      </sec>
      <sec id="sec-12-6">
        <title>SEERNET</title>
      </sec>
      <sec id="sec-12-7">
        <title>Bharathi_SSN</title>
      </sec>
      <sec id="sec-12-8">
        <title>Baseline SSN_NLP</title>
      </sec>
      <sec id="sec-12-9">
        <title>ClassyPy</title>
      </sec>
      <sec id="sec-12-10">
        <title>Anuj</title>
      </sec>
      <sec id="sec-12-11">
        <title>IDRBT</title>
        <p>DIG (IIT-Hyd)
team_CEC</p>
      </sec>
      <sec id="sec-12-12">
        <title>Bits_Pilani</title>
      </sec>
      <sec id="sec-12-13">
        <title>BMSCE_ISE</title>
      </sec>
      <sec id="sec-12-14">
        <title>JUNLP Run</title>
        <p>Accuracy
48.8
47.3
47.6
45.2
46.6
46.4
46.9
43.6
43.0
38.8
38.2
37.9
28.9
38.2
36.8
25.5
2
3
4
5
6
6
7
8
9
10
11
12</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Hamada</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Nayel</surname>
            and
            <given-names>H. L.</given-names>
          </string-name>
          <string-name>
            <surname>Shashirekha</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Indian Native Language Identiifcation using Support Vector Machines and Ensemble Approach.</article-title>
          .
          <source>In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India,
          <fpage>8th</fpage>
          -
          <lpage>10th</lpage>
          December.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Anirudh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Bhuvana</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>SVM based approach for Indian native language identification</article-title>
          ..
          <source>In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India,
          <fpage>8th</fpage>
          -
          <lpage>10th</lpage>
          December.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Rupal</given-names>
            <surname>Bhargava</surname>
          </string-name>
          , Jaspreet Singh,
          <string-name>
            <given-names>Shivangi</given-names>
            <surname>Arora</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yashvardhan</given-names>
            <surname>Sharma</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Indian Native Language Identification using Deep Learning.</article-title>
          .
          <source>In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India,
          <fpage>8th</fpage>
          -
          <lpage>10th</lpage>
          December.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Julian</given-names>
            <surname>Brooke</surname>
          </string-name>
          and
          <string-name>
            <given-names>Graeme</given-names>
            <surname>Hirst</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Measuring Interlanguage: Native Language Identification with L1-influence Metrics.</article-title>
          .
          <source>In LREC</source>
          .
          <volume>779</volume>
          -
          <fpage>784</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Serhiy</given-names>
            <surname>Bykh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Detmar</given-names>
            <surname>Meurers</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Exploring Syntactic Features for Native Language Identification: A Variationist Perspective on Feature Encoding and Ensemble Optimization.</article-title>
          .
          <source>In COLING</source>
          .
          <year>1962</year>
          -
          <fpage>1973</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Kunal</given-names>
            <surname>Chakma and Amitava Das</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Cmir: A corpus for evaluation of code mixed information retrieval of hindi-english tweets</article-title>
          .
          <source>Computación y Sistemas</source>
          <volume>20</volume>
          ,
          <issue>3</issue>
          (
          <year>2016</year>
          ),
          <fpage>425</fpage>
          -
          <lpage>434</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[7] Christel and Mike</source>
          .
          <year>2016</year>
          .
          <article-title>Participation at the Indian Native language Identification task</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Dominique</given-names>
            <surname>Estival</surname>
          </string-name>
          , Tanja Gaustad, Son Bao Pham, Will Radford, and
          <string-name>
            <given-names>Ben</given-names>
            <surname>Hutchinson</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Author profiling for English emails</article-title>
          .
          <source>In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics</source>
          .
          <fpage>263</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Anupam</given-names>
            <surname>Jamatia</surname>
          </string-name>
          , Björn Gambäck, and
          <string-name>
            <surname>Amitava Das</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Collecting and Annotating Indian Social Media Code-Mixed Corpora</article-title>
          .
          <source>In the 17th International Conference on Intelligent Text Processing and Computational Linguistics</source>
          .
          <fpage>3</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Aditya</surname>
            <given-names>Joshi</given-names>
          </string-name>
          , Ameya Prabhu, Manish Shrivastava, and
          <string-name>
            <given-names>Vasudeva</given-names>
            <surname>Varma</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text.</article-title>
          .
          <source>In COLING</source>
          .
          <volume>2482</volume>
          -
          <fpage>2491</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Moshe</surname>
            <given-names>Koppel</given-names>
          </string-name>
          , Jonathan Schler, and
          <string-name>
            <given-names>Kfir</given-names>
            <surname>Zigdon</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Automatically determining an anonymous author's native language</article-title>
          .
          <source>Intelligence and Security Informatics</source>
          (
          <year>2005</year>
          ),
          <fpage>41</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Dijana</given-names>
            <surname>Kosmajac</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vlado</given-names>
            <surname>Keselj</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Native Language Identification using SVM with SGD Training.</article-title>
          .
          <source>In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India,
          <fpage>8th</fpage>
          -
          <lpage>10th</lpage>
          December.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Sowmya Lakshmi B S and hambhavi</surname>
            <given-names>B R.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>A simple n-gram based approach for Native Language Identification: FIRE NLI shared task 2017.</article-title>
          . In Working notes of FIRE 2017 -
          <article-title>Forum for Information Retrieval Evaluation, Bangalore</article-title>
          , India,
          <fpage>8th</fpage>
          -
          <lpage>10th</lpage>
          December.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Shervin</given-names>
            <surname>Malmasi</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Native language identification: explorations and applications</article-title>
          . Sydney, Australia: Macquarie University (
          <year>2016</year>
          ). http://hdl.handle.net/
          <year>1959</year>
          . 14/1110919
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Shervin</surname>
            <given-names>Malmasi</given-names>
          </string-name>
          , Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano, and
          <string-name>
            <given-names>Yao</given-names>
            <surname>Qian</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Report on the 2017 Native Language Identification Shared Task</article-title>
          .
          <source>In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications</source>
          .
          <fpage>62</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Sergiu</surname>
            <given-names>Nisioi</given-names>
          </string-name>
          , Ella Rabinovich, Liviu P Dinu, and
          <string-name>
            <given-names>Shuly</given-names>
            <surname>Wintner</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A Corpus of Native, Non-native and Translated Texts.</article-title>
          .
          <source>In LREC.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Francisco</surname>
            <given-names>Rangel</given-names>
          </string-name>
          , Paolo Rosso,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the 5th author profiling task at pan 2017: Gender and language variety identiifcation in twitter</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Venkatesh</given-names>
            <surname>Duppada Royal Jain</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sushant</given-names>
            <surname>Hiray</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Hierarchical Ensemble for Indian Native Language Identification.</article-title>
          .
          <source>In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India,
          <fpage>8th</fpage>
          -
          <lpage>10th</lpage>
          December.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Bernard</given-names>
            <surname>Smith</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Learner English: A teacher's guide to interference and other problems</article-title>
          . Ernst Klett Sprachen.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Joel</surname>
            <given-names>Tetreault</given-names>
          </string-name>
          , Daniel Blanchard, Aoife Cahill, and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Chodorow</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Native tongues, lost and found: Resources and empirical evaluations in native language identification</article-title>
          .
          <source>Proceedings of COLING</source>
          <year>2012</year>
          (
          <year>2012</year>
          ),
          <fpage>2585</fpage>
          -
          <lpage>2602</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Joel</surname>
            <given-names>R Tetreault</given-names>
          </string-name>
          , Daniel Blanchard, and
          <string-name>
            <given-names>Aoife</given-names>
            <surname>Cahill</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A Report on the First Native Language Identification Shared Task.</article-title>
          .
          <source>In BEA@ NAACL-HLT</source>
          .
          <fpage>48</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D.</given-names>
            <surname>Thenmozhi</surname>
          </string-name>
          , Kawshik Kannan, and
          <string-name>
            <given-names>Chandrabose</given-names>
            <surname>Aravindan</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Neural Network Approach to Indian Native Language Identification.</article-title>
          .
          <source>In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India,
          <fpage>8th</fpage>
          -
          <lpage>10th</lpage>
          December.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Laura</given-names>
            <surname>Mayfield</surname>
          </string-name>
          Tomokiyo and
          <string-name>
            <given-names>Rosie</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>You're not from'round here, are you?: naive Bayes detection of non-native utterance text. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies</article-title>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Rosemary</surname>
            <given-names>Torney</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Vamplew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and John</given-names>
            <surname>Yearwood</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Using psycholinguistic features for profiling first language of authors</article-title>
          .
          <source>Journal of the Association for Information Science and Technology 63</source>
          ,
          <issue>6</issue>
          (
          <year>2012</year>
          ),
          <fpage>1256</fpage>
          -
          <lpage>1269</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Oren</given-names>
            <surname>Tsur</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ari</given-names>
            <surname>Rappoport</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Using classifier features for studying the efect of native language on the choice of written second language words</article-title>
          .
          <source>In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition. Association for Computational Linguistics</source>
          ,
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Ajay</surname>
            <given-names>P</given-names>
          </string-name>
          <string-name>
            <surname>Victor and K Manju</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Indian Native Language Identification.</article-title>
          .
          <source>In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India,
          <fpage>8th</fpage>
          -
          <lpage>10th</lpage>
          December.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Sze-Meng Jojo</surname>
            Wong and
            <given-names>Mark</given-names>
          </string-name>
          <string-name>
            <surname>Dras</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Contrastive analysis and native language identification</article-title>
          .
          <source>In Proceedings of the Australasian Language Technology Association Workshop</source>
          . 53-
          <fpage>61</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Sze-Meng Jojo</surname>
            Wong and
            <given-names>Mark</given-names>
          </string-name>
          <string-name>
            <surname>Dras</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Exploiting parse structures for native language identification</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics</source>
          ,
          <fpage>1600</fpage>
          -
          <lpage>1610</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Sze-Meng Jojo</surname>
            <given-names>Wong</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Dras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Johnson</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Exploring adaptor grammars for native language identification</article-title>
          .
          <source>In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics</source>
          ,
          <fpage>699</fpage>
          -
          <lpage>709</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Marcos</surname>
            <given-names>Zampieri</given-names>
          </string-name>
          , Alina Maria Ciobanu, and Liviu P Dinu.
          <year>2017</year>
          .
          <article-title>Native Language Identification on Text and Speech</article-title>
          .
          <source>arXiv preprint arXiv:1707.07182</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Marcos</surname>
            <given-names>Zampieri</given-names>
          </string-name>
          , Liling Tan, Nikola Ljubešic, Jörg Tiedemann, and
          <string-name>
            <given-names>Nikola</given-names>
            <surname>Ljube</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A report on the DSL shared task 2014</article-title>
          .
          <source>In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial)</source>
          .
          <fpage>58</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>