Overview of the INLI PAN at FIRE-2017 Track on
                            Indian Native Language Identification
             Anand Kumar M, Barathi Ganesh HB,                                                   Paolo Rosso
               Shivkaran Singh and Soman KP                                              PRHLT Research Center,
                Center for Computational Engineering                                 Universitat Politècnica de València,
                        and Networking (CEN)                                                        Spain
               Amrita School of Engineering, Coimbatore
                 Amrita Vishwa Vidyapeetham, India

ABSTRACT                                                                 is the first NLI shared task. In India there are currently 22 official
This overview paper describes the first shared task on Indian Na-        languages with English as an additional official language. In this
tive Language Identification (INLI) that was organized at FIRE 2017.     shared task, we focus on identifying the native language of Indian
Given a corpus with comments in English from various Facebook            authors writing comments in English. We considered six languages,
newspapers pages, the objective of the task is to identify the na-       namely, Bengali, Hindi, Kannada, Malayalam, Tamil and Telugu for
tive language among the following six Indian languages: Bengali,         the shared task.
Hindi, Kannada, Malayalam, Tamil, and Telugu. Altogether, 26 ap-             Since comments over the internet are usually written in social
proaches of 13 different teams are evaluated. In this paper, we give     media, the corpora used for the shared task was acquired from Face-
an overview of the approaches and discuss the results that they          book. English comments from Facebook pages of famous regional
have obtained.                                                           language newspapers were crawled. These comments were further
                                                                         preprocessed in order to remove code-mixed and mixed scripts com-
CCS CONCEPTS                                                             ments from the corpus. In the following sections we present some
                                                                         related work (Section 2), we describe the corpus collection (Section
• Computing methodologies → Natural language process-
                                                                         3), we give an overview of the submitted approaches (Section 4),
ing; Language resources; Feature selection;
                                                                         finally we show the results that were obtained (Section 5). Finally,
                                                                         in Section 6 we draw some conclusions.
KEYWORDS
Author Profiling, Indian Languages, Native Language Identification,
Social Media, Text Classification                                        2   RELATED WORK
                                                                         As said in [14], one of the earliest works on identifying native lan-
1     INTRODUCTION                                                       guage was by Tomokiyo and Jones (2001) [23] where the author
                                                                         used Naive Bayes to discriminate non-native from native state-
Native Language Identification (NLI) is a fascinating and rapidly
                                                                         ments in English. Koppel et. al (2005) [25] approached the problem
growing sub-field in Natural Language Processing. In the frame-
                                                                         by using stylistic, syntactic and lexical features. They also noticed
work of the author profiling shared tasks that have been organized
                                                                         that the use of character n-grams, parts of speech bi-grams and
at PAN1 , language variety identification was addressed in 2017 at
                                                                         function words allowed to obtain better results. Tsur and Rappoport
CLEF [17]. NLI requires instead to automatically identify the native
                                                                         (2007) [11] achieved an accuracy of about 66% by using only char-
language (L1) of an author on the basis of the way she writes in
                                                                         acter bi-grams. They assumed that the native language phonology
another language (L2) that she learned. As her accent may help in
                                                                         influences the choice of words while writing in a second language.
identifying whether or not she is a native speaker in that language
                                                                             Estival et. al [8] used English emails of authors with different
L1, in a similar way the way the language is used when she writes
                                                                         native languages. They achieved an accuracy of 84% using a Ran-
may unveil patterns that can help in identifying her native language
                                                                         dom Forest classifier with character, lexical, and structural features.
[19]. From a cybersecurity viewpoint, NLI can help to determine
                                                                         Wong and Dras [27] pointed out that mistakes made by authors
the native language of an author of a suspicious or threatening text.
                                                                         writing in a second language is influenced by their native language.
   The native language influences the usage of words as well the
                                                                         They proposed the use of syntactic features such as subject-verb
errors that a person makes when writing in another language [19].
                                                                         disagreement, noun-number disagreement, and improper use of
NLI systems can identify the writing patterns that are based on
                                                                         determiners to help in determining the native language of a writer.
the author’s linguistic background. NLI has many applications and
                                                                         In their later work [28], they also investigated the usefulness of
studying the language transfer from a forensic linguistics viewpoint
                                                                         parse structures for identifying the native language. Brooke and
is certainly one of the most important. The first shared task on
                                                                         Hirst [4] used word-to-word translation of L1 to L2 to create a
native language identification was organized in 2013 [21]. The
                                                                         mappings which are the result of language transfer. They use this
organizers made available a large text corpus for this task. Other
                                                                         information in their unsupervised approach.
works approach the problem of native language identification using
                                                                             Torney et. al [24] used psycho-linguistic feature for NLI. Syntac-
as well speech transcripts [30]. In the Indian languages context, this
                                                                         tic features showed also to play a significant role in determining the
1 http://pan.webis.de                                                    native language. Other interesting studies in the NLI field are [29]
                                                                                                             Avg. #
                                                                                      Avg. #                           Avg. #
                                                                                                   Avg. #    Unique
                                      # XML           #          #       # Unique     Words/                           Unique
                      Language                                                                     Words/    Words/
                                       docs       Sentences     Words     Words        XML                             Words/
                                                                                                  Sentence    XML
                                                                                       docs                           Sentence
                                                                                                              docs
                          BE            202           1616      37623       8180       186.3        23.3      40.5       5.1
                          HI            211           1688      28983       6285       137.4        17.2      29.9       3.7
                          KA            203           1624      45738       8740       225.3        28.2      43.1       5.4
                          MA            200           1600      47167       8854       235.8        29.5      44.3       5.5
                          TA            207           1656      34606       6716       167.2        20.9      32.4       4.1
                          TE            210           1680      49176       8483       234.1        29.3      40.4       5.0
                                                              Table 1: Training data statistics


                                                                                                              Avg.#
                                                                                     Avg.#                              Avg.#
                                                                                                    Avg.#    Unique
                                      #XML            #          #       #Unique     Words/                            Unique
                      Language                                                                     Words/    Words/
                                       docs       Sentences     Words     Words       XML                              Words/
                                                                                                  Sentence    XML
                                                                                      docs                            Sentence
                                                                                                              docs
                              BE        185           1480      26653      5647        144.1        18.0       30.5     3.8
                              HI        251           2008      37232      6616        148.3        18.5       26.4     3.3
                              KA         74            592      12225      3477        165.2        20.7       46.9     5.9
                              MA         92            736      16805      4658        182.7        22.8       50.6     6.3
                              TA        100            800      14780      4192        147.8        18.5       41.9     5.2
                              TE         81            648      14692      3989        181.4        22.7       49.2     6.2
                                                                Table 2: Test data statistics


[20] [5]. In 2013 a shared task was organized on NLI [20]. The orga-
nizers provided a large corpus which allowed comparison among
different approaches. In 2014 a related shared task was organized
on Discriminating between Similar Languages (DSL2 ) [31]. The
organizers provided six groups of 13 different languages, with each
group having similar languages. In 2017 another shared task on
NLI was organized. The corpus was composed by essays and tran-
scripts of utterances. The ensemble methods and meta-classifiers
with syntactic/lexical features were the most effective systems [15].

3    INLI-2017 CORPUS
                                                                                        Figure 1: Variance b/w training and test corpus
Many corpora have been created from social media (Facebook,
Twitter and WhatsApp) for performing language modeling [9], in-
formation retrieval tasks [6], and code-mixed sentiment analysis
[10]. A monolingual corpus based on the TOEFL3 data is available
for performing the NLI task for Indian languages such as Hindi
and Telugu [16]. The INLI-2017 corpus includes English comments
of Facebook users, whose native language is one among the fol-
lowing: Bengali (BE), Hindi (HI), Kannada (KA), Malayalam (MA),
Tamil (TA) and Telugu (TE). The dataset collection is based on the
assumption that, only native speakers will read native language
newspapers. To the best of our knowledge, this is the first corpus
for native language identification for Indian languages. The detailed
corpus statistics are given in Table 1 and Table 2.                                     Figure 2: Variance b/w training and test corpus

2 http://corporavm.uni-koeln.de/vardial/sharedtask.html
3 https://www.ets.org/toefl

                                                                              2
    The texts for this corpus have been collected from the users                                      Team        Run     P     R      F             Rank
comments in the regional newspapers and news channel Facebook                                        IDRBT         1     96.4 57.3 71.9                1
pages. Around 50 Facebook pages were selected and comments                                                         1     56.5 79.5 66.1
written in English were extracted from these pages. The training                                 MANGALORE         2     54.0 84.9 66.0                 2
data have been collected in the period of April-2017 to July 2017.                                                 3     59.2 78.4 67.4
The test data has been collected later on. It was expected that par-                               DalTeam         1     56.2 83.2 67.1                 3
ticipants will focus on native language-based stylistic features. As                                               1     59.4 70.3 64.4
a result, we removed code-mixed comments and comments related                                      SEERNET         2     57.6 74.1 64.8                 3
to the regional topics (regional leaders and comments mentioning                                                   3     60.7 75.1 67.1
the name of regional places). Comments with common keywords                                         Baseline        -    58.0 79.0 67.0                 -
discussed across the regions were considered to avoid the topic bias.                            Bharathi_SSN      1     50.3 80.5 62.0                 4
These common keywords observed were Modi, note-ban, different                                      SSN_NLP         1     46.2 76.2 57.6                 5
sports personalities, army, national issues, government policies, etc.                                             1     56.6 50.8 53.6
Finally, the collected dataset was randomized and written to XML                                      Anuj         2     56.5 47.0 51.3                 6
files randomly to avoid user bias.                                                                                 3     45.5 18.9 26.7
    From Table 1 and Table 2, it can be observed that except for                                                   1     67.9 40.0 50.3
BE and MA, the remaining languages have nearly the same ratio                                      ClassyPy        2     66.7 40.0 50.0                 7
of average words per sentence. It is also visible that the test data                                               3     40.6 22.2 28.7
was properly normalized in order to have the average words per                                                     1     55.2 45.9 50.1
sentence and average unique words per sentence. The variance                                     DIG (IIT-Hyd)     2     55.6 45.9 50.3                 7
between average of words per sentence and average of unique                                                        3     45.5 10.8 17.5
words per sentence for the training and the test data is shown                                                     1     39.7 15.7 22.5
in Figure 1 and Figure 2, respectively. This corpus will be made                                  Bits_Pilani      2     56.3 38.4 45.7                 8
available after the FIRE 2017 conference in the web page of our                                                    3     39.4 23.2 29.3
NLP group website4 .                                                                                               1     40.0 29.2 33.8
                                                                                                  BMSCE_ISE                                             9
                                                                                                                   2     38.9 55.1 45.6
4     OVERVIEW OF THE SUBMITTED                                                                      JUNLP         1     8.3   7.0    7.6              10
      APPROACHES                                                                                  team_CEC         1     0.0   0.0    0.0              11
Initially, 56 teams registered at the INLI shared task at FIRE, and                                           Table 3: BE-NLI results
finally 13 of them submitted a total of 26 runs. Moreover, 8 of them
submitted their system description working notes5 . We analysed
their approaches from three perspectives: preprocessing, features
to represent the author’s texts and classification approaches.

4.1     Preprocessing
Most of the participants have not done any preprocessing [2, 7, 13,
                                                                                           have been taken as features in [26]. Nouns and adjective words
18, 26]. Text is normalised by removing the emoji, special charac-
                                                                                           have been taken as feature in [22]. Part of Speech n-grams, average
ters, digits, hash tags, mentions and links [1, 12, 22]. Stop words
                                                                                           word and sentence length have been used as the features in [7].
are removed using the nltk stop words package6 , other resources7
                                                                                           Distributional representation of words (pre-trained word vectors)
and manual stop words collection [1]. White space based tokeniza-
                                                                                           have been used in [7].
tion has been carried out by all other participants except [7]. The
participant [22] handled the shortened words (terms such as n’t, &,
’m, ’ll are replaced as ’not’, ’and’, ’am’, and ’will’ respectively).
                                                                                           4.3    Classification Approaches
4.2     Features                                                                           Support Vector Machine (SVM) has been used as a classifier by most
                                                                                           of the participants [1, 2, 7, 12, 13]. Two of the participants followed
Two of the participants directly used the Term Frequency Inverse
                                                                                           the ensemble based classification with Multinomial Bayes, SVM and
Document Frequency (TFIDF) weighs as their features [1, 2], non-
                                                                                           Random Forest Tree as the base classifiers in [22] and Logistic Re-
English words and noun-chunks are taken as the features while
                                                                                           gression, SVM, Ridge Classifier and Multi-Layer Perceptron (MLP)
computing TFIDF [18], character n-grams of order 2-5 and word
                                                                                           as the base classifiers in [18]. Other than this the authors in [7] used
n-grams of order 1-2 have been used as features while computing
                                                                                           the Logistics Regression, authors in [26] used Naive Bayes, authors
the TFIDF vocabulary [7, 12, 13]. Only the non-English word counts
                                                                                           in [3] used hierarchical attention architecture with bidirectional
4 http://nlp.amrita.edu:8080/nlpcorpus.html                                                Gated Recurrent Unit (GRU) cell and authors in [22] employed the
5 ClassPy team did not submit any working notes, although a brief description of the       neural network classifier with 2 hidden layers, Rectified Linear Unit
approach was sent by email.                                                                (ReLU) as the activation function and Stochastic Gradient Descent
6 http://www.nltk.org/book/ch02.html
7 pypi.python.org/pypi/stop-words                                                          (SGD) as the optimizer.
                                                                                       3
            Team      Run      P      R                     F    Rank                 Team         Run     P     R      F           Rank
         team_CEC       1    32.1 100.0                   48.6     1                 DalTeam         1    40.5 66.2 50.3              1
                        1    59.8 19.5                    29.4                                       1    38.1 71.6 49.8
           Anuj         2    52.4 17.5                    26.3    2                 SEERNET          2    37.1 62.2 46.5              2
                        3    41.8 27.5                    33.2                                       3    37.0 68.9 48.1
          JUNLP         1    26.1 37.8                    30.9    3                  Baseline        -    39.0 64.0 48.0              -
                        1    49.5 19.1                    27.6                        IDRBT          1    40.0 59.5 47.8              3
      DIG (IIT-Hyd)     2    50.0 19.5                    28.1    4                                  1    38.4 58.1 46.2
                        3    34.1 11.6                    17.3                    MANGALORE          2    40.4 54.1 46.2              4
        SSN_NLP         1    49.4 16.3                    24.6    5                                  3    34.8 64.9 45.3
                        1    50.6 15.5                    23.8                    Bharathi_SSN       1    33.3 64.9 44.0              5
        ClassyPy        2    43.7 15.1                    22.5    6                 SSN_NLP          1    39.6 48.6 43.6              6
                        3    30.2 11.6                    16.7                                       1    30.4 45.9 36.6
        DalTeam         1    69.2 14.3                    23.8    6                Bits_Pilani       2    26.0 45.9 33.2              7
                        1    24.0 19.5                    21.5                                       3    20.8 59.5 30.8
       Bits_Pilani      2    23.9 6.8                     10.6    7                                  1    22.2 77.0 34.4
                        3    19.0 8.8                     12.0                      ClassyPy         2    23.7 77.0 36.2              8
         Baseline       -    57.0 13.0                    21.0    -                                  3    19.7 60.8 29.7
                        1    50.0 9.6                     16.1                                       1    21.8 59.5 31.9
        SEERNET         2    50.0 8.4                     14.3    8               DIG (IIT-Hyd)      2    21.7 59.5 31.8              9
                        3    54.8 9.2                     15.7                                       3    21.1 40.5 27.8
                        1    60.7 6.8                     12.2                                       1    19.4 40.5 26.2
      MANGALORE         2    60.0 7.2                     12.8    9                    Anuj          2    20.3 41.9 27.3              10
                        3    66.7 4.8                     8.9                                        3    27.5 14.9 19.3
                        1    50.0 0.8                     1.6                                        1    11.7 27.0 16.3
       BMSCE_ISE                                                  10               BMSCE_ISE                                          11
                        2    54.5 7.2                     12.7                                       2    19.0 44.6 26.6
      Bharathi_SSN      1    51.9 5.6                     10.1    11                  JUNLP          1    17.9 13.5 15.4              12
          IDRBT         1    25.0 0.4                     0.8     12               team_CEC          1    0.0   0.0    0.0            13
                   Table 4: HI-NLI results                                                     Table 5: KA-NLI results


5    EXPERIMENTS AND RESULTS
Accuracy was used as measure to evaluate the performance of the             50.3%, which is 2.3% greater than the baseline. The lowest F-measure
systems8 . In baseline system, the Term Frequency-Inverse Doc-              scored for this language is 15.4% and this is 32.6% less than the
ument Frequency (TF-IDF) features with SVM linear kernel and                baseline.
default parameters were used.                                                  The ranking of the systems submitted for Malayalam (MA) is
   Each team was allowed to submit up to three systems. For the             given in Table 6. The maximum F-measure scored for this language
final ranking the best performing system is considered. We have             is 51.9%, which is 0.9% greater than the baseline. Among the all
evaluated 26 submissions from 13 participants. The submissions are          the other languages, this is the lowest variation with respect to the
ranked per language and the final ranking is based on the overall           baseline. The lowest F-measure scored for this language is 1.8% and
accuracy of the system across all the languages.                            this is 49.2% less than the baseline.
   The ranking of the submitted systems for Bengali (BE) is given              The ranking of the submitted systems for Tamil (TA) is given in
in Table 3. The maximum F-measure scored for this language is               Table 7. The maximum F-measure scored for this language is 58.0%,
71.9%, which is 4.9% greater than the base line system. The lowest          which is 12.0% greater than the baseline. The lowest F-measure
F-measure scored for this language is 7.6% and this is 59.4% less           scored for this language is 13.2% and this is 32.8% less than the
than the baseline. Among the all the other languages, this is the           baseline.
highest variation with respect to the baseline.                                The ranking of the systems submitted for Telugu (TE) is given
   The ranking of the systems submitted for Hindi (HI) is given in          in Table 8. The maximum F-measure scored for this language is
Table 4. The maximum F-measure scored for this language is 48.6%,           50.5%, which is 8.5% greater than the baseline system. The lowest
which is 27.6% higher than the baseline. The lowest F-measure               F-measure scored for this language is 2.4% and this is 39.6% less
scored for this language is 0.8% and this is 20.8% less than the            than baseline.
baseline. This is the lowest F-measure across all the languages.               The results rank per language is given in Table 9. The team_CEC
   The ranking of the submitted systems for Kannada (KA) is given           has not identified any language apart from Hindi. The overall rank-
in Table 5. The maximum F-measure scored for this language is               ing for the submitted systems are given in Table 10. The maximum
8 http://www.nltk.org/_modules/nltk/metrics/scores.html                     accuracy scored is 48.8%, which is 5.3% greater than the baseline.
                                                                        4
          Team       Run      P      R     F            Rank                            Team      Run     P     R      F                        Rank
                       1    40.4   70.7 51.4                                                       1     58.0 58.0 58.0
     MANGALORE         2    42.7   66.3 51.9              1                      MANGALORE         2     58.0 58.0 58.0                             1
                       3    32.6   78.3 46.0                                                       3     54.4 49.0 51.6
        Baseline       -    42.0   65.0 51.0              -                                        1     50.4 59.0 54.4
       DalTeam         1    46.7   54.3 50.3              2                        SEERNET         2     47.9 57.0 52.1                             2
                       1    38.5   59.8 46.8                                                       3     46.8 59.0 52.2
       SEERNET         2    41.0   64.1 50.0              3                      Bharathi_SSN      1     48.6 51.0 49.8                             3
                       3    39.5   53.3 45.4                                       DalTeam         1     51.1 48.0 49.5                             4
     Bharathi_SSN      1    36.4   60.9 45.5              4                         Baseline        -    42.0 50.0 46.0                             -
                       1    34.3   53.3 41.7                                                       1     41.0 41.0 41.0
       ClassyPy        2    34.5   52.2 41.6              5                        ClassyPy        2     38.7 43.0 40.8                             5
                       3    33.7   31.5 32.6                                                       3     30.4 41.0 34.9
                       1    48.4   33.7 39.7                                                       1     28.3 63.0 39.0
          Anuj         2    51.7   32.6 40.0              6                           Anuj         2     27.3 66.0 38.6                             6
                       3    26.7   21.7 24.0                                                       3     14.3 57.0 22.9
                       1    37.9   39.1 38.5                                                       1     33.3 45.0 38.3
     DIG (IIT-Hyd)     2    37.5   39.1 38.3              7                      DIG (IIT-Hyd)     2     32.8 44.0 37.6                             7
                       3    21.4   19.6 20.5                                                       3     17.6 74.0 28.4
                       1    20.0   28.3 23.4                                       SSN_NLP         1     27.5 49.0 35.3                             8
      Bits_Pilani      2    15.5   31.5 20.8              8                                        1     26.6 37.0 31.0
                       3    39.4   34.6 36.8                                      Bits_Pilani      2     22.5 39.0 28.6                             9
         IDRBT         1    18.1   84.8 29.9              9                                        3     21.5 40.0 28.0
                       1    17.3   64.1 27.2                                                       1     53.3 8.0     13.9
      BMSCE_ISE                                          10                       BMSCE_ISE                                                        10
                       2    22.3   31.5 26.1                                                       2     21.8 26.0 23.7
       SSN_NLP         1    31.7   21.7 25.8             11                          IDRBT         1     81.2 13.0 22.4                            11
      team_CEC         1    100.0 1.1     2.2            12                          JUNLP         1     10.2 19.0 13.2                            12
         JUNLP         1    5.3    1.1    1.8            13                       team_CEC         1     0.0   0.0    0.0                          13
                  Table 6: MA-NLI results                                                     Table 7: TA-NLI results


                                                                              Overall the best performance system obtained an accuracy of
The lowest accuracy scored is 17.8% and this is 25.2% less than the
                                                                           48.8%, which is 5.8% greater than the baseline. Overall four of the
baseline.
                                                                           systems performed better than the baseline. These systems have
                                                                           used the following features: character and word n-grams, non-
6   CONCLUSION
                                                                           English words, and noun chunks. It is notable that all these systems
In this paper we presented the INLI2017 corpus, we briefly described       have used TFIDF for representing the features. The smallest overall
the approaches of the 13 teams that participated at the Indian Na-         accuracy was 17.8%, which is 25.2% less than the baseline. Among
tive Language Identification task at FIRE 2017, and the results that       the top performing systems, two of them used an ensemble method
they obtained. The participants had to identify the native language        and all the systems employed SVM. As future work, we believe
of the authors of English comments collected from various news-            that native language identification should be addressed taking into
paper pages and television pages in Facebook. Six have been the            account also socio-linguistics features to improve further.
native languages that have been addressed: Bengali, Hindi, Kan-
nada, Malayalam, Tamil and Telugu. Code-mixed comments and                 ACKNOWLEDGEMENT
comments related to the regional topics were removed from the
                                                                           Our special thanks goes to F. Rangel, all of INLI’s participants,
corpus, and comments with common keywords discussed across
                                                                           students in Computational Engineering and Networking Depart-
the regions were considered in order to avoid possible topic biases.
                                                                           ment for their efforts and time in developing INLI-2017 corpus. The
   The participants used different feature sets to address the prob-
                                                                           work of the last author was in the framework of the SomEMBED
lem: content-based (among others: bag of words, character n-grams,
                                                                           TIN2015-71147-C2-1-P MINECO research project.
word n-grams, term vectors, word embedding, non-English words)
and stylistic-based (among others: words frequency, POS n-grams,
                                                                           REFERENCES
noun and adjective POS tag counts). A two layer based neural
                                                                           [1] Hamada A. Nayel and H. L. Shashirekha. 2017. Indian Native Language Identi-
networks with document vectors built from TFIDF and Recurrent                  fication using Support Vector Machines and Ensemble Approach.. In Working
Neural Networks (RNN) with word embedding have been used                       notes of FIRE 2017 - Forum for Information Retrieval Evaluation, Bangalore, India,
                                                                               8th - 10th December.
from the field of deep learning. However, deep learning approaches         [2] B. Bharathi, M. Anirudh, and J. Bhuvana. 2017. SVM based approach for In-
obtained lower accuracy than the baseline.                                     dian native language identification.. In Working notes of FIRE 2017 - Forum for
                                                                       5
            Team     Run      P      R                         F      Rank                                     Team       Run Accuracy                          Rank
           IDRBT       1    43.4    60.5                     50.5       1                                     DalTeam       1        48.8                         1
                       1    37.6    54.3                     44.4                                                           1        47.3
       SEERNET         2    37.1    53.1                     43.7        2                                MANGALORE         2        47.6                          2
                       3    37.1    56.8                     44.9                                                           3        45.2
                       1    40.2    48.1                     43.8                                                           1        46.6
       ClassyPy        2    39.4    45.7                     42.3        3                                  SEERNET         2        46.4                          3
                       3    30.1    50.6                     37.8                                                           3        46.9
        Baseline       -    40.0    44.0                     42.0        -                                Bharathi_SSN      1        43.6                          4
       DalTeam         1    33.3    55.6                     41.7        4                                   Baseline        -       43.0                          -
                       1    32.8    49.4                     39.4                                           SSN_NLP         1        38.8                          5
     MANGALORE         2    32.5    48.1                     38.8        5                                                  1        38.2
                       3    39.4    34.6                     36.8                                           ClassyPy        2        37.9                          6
                       1    34.4    39.5                     36.8                                                           3        28.9
          Anuj         2    32.6    37.0                     34.7        6                                                  1        38.2
                       3    28.6    9.9                      14.7                                              Anuj         2        36.8                          6
                       1    28.8    44.4                     35.0                                                           3        25.5
      Bits_Pilani      2    30.5    35.8                     33.0        7                                    IDRBT         1        37.2                          7
                       3    43.6    29.6                     35.3                                                           1        36.7
     Bharathi_SSN      1    40.4    28.4                     33.3        8                                DIG (IIT-Hyd)     2        36.7                          8
                       1    29.0    35.8                     32.0                                                           3        22.3
     DIG (IIT-Hyd)     2    29.3    35.8                     32.2        9                                 team_CEC         1        32.2                          9
                       3    57.1    4.9                      9.1                                                            1        26.9
                       1    26.7    38.3                     31.5                                          Bits_Pilani      2        28.0                          10
      BMSCE_ISE                                                          10
                       2    15.4    12.3                     13.7                                                           3        26.7
       SSN_NLP         1    27.0    21.0                     23.6        11                                                 1        22.2
                                                                                                           BMSCE_ISE                                               11
         JUNLP         1    100.0 1.2                        2.4         12                                                 2        27.8
      team_CEC         1    0.0     0.0                      0.0         13                                   JUNLP         1        17.8                          12
                  Table 8: TE-NLI results                                                                         Table 10: Overall results


                                      Rank
             Team                                                                             [5] Serhiy Bykh and Detmar Meurers. 2014. Exploring Syntactic Features for Native
                       BE HI KA MA TA TE
                                                                                                  Language Identification: A Variationist Perspective on Feature Encoding and
         DalTeam         3    6     1     2    4   4                                              Ensemble Optimization.. In COLING. 1962–1973.
       MANGALORE         2    9     4     1    1   5                                          [6] Kunal Chakma and Amitava Das. 2016. Cmir: A corpus for evaluation of code
                                                                                                  mixed information retrieval of hindi-english tweets. Computación y Sistemas 20,
         SEERNET         3    8     2     3    2   2                                              3 (2016), 425–434.
       BharathiSSN       4   11     5     4    3   8                                          [7] Christel and Mike. 2016. Participation at the Indian Native language Identification
                                                                                                  task.
         SSN_NLP         5    5     6    11    8  11                                          [8] Dominique Estival, Tanja Gaustad, Son Bao Pham, Will Radford, and Ben Hutchin-
         ClassyPy        7    6     8     5    5   3                                              son. 2007. Author profiling for English emails. In Proceedings of the 10th Conference
           Anuj          6    2    11     6    6   6                                              of the Pacific Association for Computational Linguistics. 263–272.
                                                                                              [9] Anupam Jamatia, Björn Gambäck, and Amitava Das. 2016. Collecting and An-
          IDRBT          1   12     3     9   11   1                                              notating Indian Social Media Code-Mixed Corpora. In the 17th International
       DIG (IIT-Hyd)     7    4     9     7    7   9                                              Conference on Intelligent Text Processing and Computational Linguistics. 3–9.
                                                                                             [10] Aditya Joshi, Ameya Prabhu, Manish Shrivastava, and Vasudeva Varma. 2016.
        team_CEC        11    1    13    12   13  13                                              Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English
        Bits_Pilani      8    7     7     8    9   7                                              Code Mixed Text.. In COLING. 2482–2491.
                                                                                             [11] Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Automatically determining
        BMSCE_ISE        9   10    10    10   10  10                                              an anonymous author’s native language. Intelligence and Security Informatics
          JUNLP         10    3    12    13   12  12                                              (2005), 41–76.
                                                                                             [12] Dijana Kosmajac and Vlado Keselj. 2017. Native Language Identification using
           Table 9: NLI results rank per language                                                 SVM with SGD Training.. In Working notes of FIRE 2017 - Forum for Information
                                                                                                  Retrieval Evaluation, Bangalore, India, 8th - 10th December.
                                                                                             [13] Sowmya Lakshmi B S and hambhavi B R. 2017. A simple n-gram based approach
                                                                                                  for Native Language Identification: FIRE NLI shared task 2017.. In Working notes
                                                                                                  of FIRE 2017 - Forum for Information Retrieval Evaluation, Bangalore, India, 8th -
    Information Retrieval Evaluation, Bangalore, India, 8th - 10th December.                      10th December.
[3] Rupal Bhargava, Jaspreet Singh, Shivangi Arora, and Yashvardhan Sharma. 2017.            [14] Shervin Malmasi. 2016. Native language identification: explorations and applica-
    Indian Native Language Identification using Deep Learning.. In Working notes                  tions. Sydney, Australia: Macquarie University (2016). http://hdl.handle.net/1959.
    of FIRE 2017 - Forum for Information Retrieval Evaluation, Bangalore, India, 8th -            14/1110919
    10th December.                                                                           [15] Shervin Malmasi, Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh,
[4] Julian Brooke and Graeme Hirst. 2012. Measuring Interlanguage: Native Language                Christopher Hamill, Diane Napolitano, and Yao Qian. 2017. A Report on the 2017
    Identification with L1-influence Metrics.. In LREC. 779–784.                                  Native Language Identification Shared Task. In Proceedings of the 12th Workshop
                                                                                         6
     on Innovative Use of NLP for Building Educational Applications. 62–75.
[16] Sergiu Nisioi, Ella Rabinovich, Liviu P Dinu, and Shuly Wintner. 2016. A Corpus
     of Native, Non-native and Translated Texts.. In LREC.
[17] Francisco Rangel, Paolo Rosso, Martin Potthast, and Benno Stein. 2017. Overview
     of the 5th author profiling task at pan 2017: Gender and language variety identi-
     fication in twitter. Working Notes Papers of the CLEF (2017).
[18] Venkatesh Duppada Royal Jain and Sushant Hiray. 2017. Hierarchical Ensemble
     for Indian Native Language Identification.. In Working notes of FIRE 2017 - Forum
     for Information Retrieval Evaluation, Bangalore, India, 8th - 10th December.
[19] Bernard Smith. 2001. Learner English: A teacher’s guide to interference and other
     problems. Ernst Klett Sprachen.
[20] Joel Tetreault, Daniel Blanchard, Aoife Cahill, and Martin Chodorow. 2012. Native
     tongues, lost and found: Resources and empirical evaluations in native language
     identification. Proceedings of COLING 2012 (2012), 2585–2602.
[21] Joel R Tetreault, Daniel Blanchard, and Aoife Cahill. 2013. A Report on the First
     Native Language Identification Shared Task.. In BEA@ NAACL-HLT. 48–57.
[22] D. Thenmozhi, Kawshik Kannan, and Chandrabose Aravindan. 2017. A Neural
     Network Approach to Indian Native Language Identification.. In Working notes
     of FIRE 2017 - Forum for Information Retrieval Evaluation, Bangalore, India, 8th -
     10th December.
[23] Laura Mayfield Tomokiyo and Rosie Jones. 2001. You’re not from’round here, are
     you?: naive Bayes detection of non-native utterance text. In Proceedings of the
     second meeting of the North American Chapter of the Association for Computational
     Linguistics on Language technologies. Association for Computational Linguistics,
     1–8.
[24] Rosemary Torney, Peter Vamplew, and John Yearwood. 2012. Using psycholin-
     guistic features for profiling first language of authors. Journal of the Association
     for Information Science and Technology 63, 6 (2012), 1256–1269.
[25] Oren Tsur and Ari Rappoport. 2007. Using classifier features for studying the
     effect of native language on the choice of written second language words. In
     Proceedings of the Workshop on Cognitive Aspects of Computational Language
     Acquisition. Association for Computational Linguistics, 9–16.
[26] Ajay P Victor and K Manju. 2017. Indian Native Language Identification.. In
     Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation, Bangalore,
     India, 8th - 10th December.
[27] Sze-Meng Jojo Wong and Mark Dras. 2009. Contrastive analysis and native
     language identification. In Proceedings of the Australasian Language Technology
     Association Workshop. 53–61.
[28] Sze-Meng Jojo Wong and Mark Dras. 2011. Exploiting parse structures for native
     language identification. In Proceedings of the Conference on Empirical Methods in
     Natural Language Processing. Association for Computational Linguistics, 1600–
     1610.
[29] Sze-Meng Jojo Wong, Mark Dras, and Mark Johnson. 2012. Exploring adaptor
     grammars for native language identification. In Proceedings of the 2012 Joint Con-
     ference on Empirical Methods in Natural Language Processing and Computational
     Natural Language Learning. Association for Computational Linguistics, 699–709.
[30] Marcos Zampieri, Alina Maria Ciobanu, and Liviu P Dinu. 2017. Native Language
     Identification on Text and Speech. arXiv preprint arXiv:1707.07182 (2017).
[31] Marcos Zampieri, Liling Tan, Nikola Ljubešic, Jörg Tiedemann, and Nikola Ljube.
     2014. A report on the DSL shared task 2014. In Proceedings of the First Workshop on
     Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial). 58–67.


                                                                                            7