<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mangalore University INLI@FIRE2018: Arti cial Neural Network and Ensemble Based Models for INLI</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hamada A. Nayel</string-name>
          <email>hamada.ali@fci.bu.edu.eg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>H. L. Shashirekha</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Native</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Faculty of Computers and Informatics, Benha University</institution>
          ,
          <addr-line>Benha</addr-line>
          ,
          <country country="EG">Egypt</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Mangalore University</institution>
          ,
          <addr-line>Mangalore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, the systems submitted by Mangalore University team for Indian Native Language Identi cation (INLI) task have been described. Native Language Identi cation (NLI) has di erent applications such as social media analysis, authorship identi cation, secondlanguage acquisition and forensic investigation. We submitted three systems using Arti cial Neural Network (ANN) model and Ensemble approach. All the three submitted systems achieved the same accuracy of 35.30% and secured second rank over all submissions for the task.</p>
      </abstract>
      <kwd-group>
        <kwd>Arti cial Neural Network Language Identi cation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Native Language Identi cation (NLI) aims at identifying the native language
(L1) of users written in another or later learned language or speech (L2). NLI is
an important task that has many applications in di erent areas such as
socialmedia analysis, authorship identi cation, second language acquisition and
forensic investigation. In forensic analysis [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], NLI helps to glean information about
the discriminant L1 cues in an anonymous text. Second Language Acquisition
(SLA) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] studies the transfer e ects from the native languages on later learned
language. In academics, automatic correction of grammatical errors is an
important application of NLI [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. NLI can be used as a feature in authorship
identi cation task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which aims at assigning a text to one of the prede ned
list of authors. Authorship identi cation is used in terrorists communications
investigation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and digital crime investigation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
In this era, social media is overwhelming our lives. Majority of people are
communicating and discussing di erent topics using di erent platforms of social
media such as Google+, Facebook and Twitter. While communicating with each
other Indians prefer to use English because their native languages are di erent.In
addition, most software and keyboards does not support input using Indian
languages characters. So, people are using standard English keyboard to write their
own words as transliterated words.
      </p>
      <p>
        The task [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] aims at identifying the native language of the writer from the given
Facebook comment written in English language. Six Indian languages - Tamil,
Hindi, Kannada, Malayalam, Bengali and Telugu are considered for this shared
task.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Many researchers have explored the task of NLI for various applications. Jarvis
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], used SVM to create a model for NLI and reported an accuracy of 83.6%.
N-grams, PoS tags and lemmas have been used to create feature space model
for training the classi er. They tested the performance of their system using
TOEFL11 dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The TOEFL11 is a collection of essays written by
learners from the following native languages backgrounds: Arabic, Chinese, French,
German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. In this
work, the feature set was not su cient to cover the characteristics of di erent
languages. Tetreault et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] used ensemble approach to build a classi er to
improve the performance of base classi ers. A wide range of features were used to
build an ensemble of logistic regression learners. Such features include word and
character n-gram, PoS, function words, writing quality markers and spelling
errors. In addition, a set of syntactic features such as Tree Substitution Grammars
and dependency features extracted using the Stanford parser3 have been used.
The system evaluated using TOEFL11 and International Corpus of Learner
English (ICLE) datasets have resulted in state of the art accuracies of 90.1% and
80.9% respectively.
      </p>
      <p>
        Nayel and Shashirekha [
        <xref ref-type="bibr" rid="ref12 ref9">9, 12</xref>
        ] used SVM and ensemble approach for the rst
version of INLI and achieved accuracies of 47.60% and 47.30% respectively.
3
3.1
      </p>
      <sec id="sec-2-1">
        <title>Approaches</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Arti cial Neural Networks</title>
      <p>Arti cial Neural Networks (ANNs) are inspired by the mechanism of brain
computation, which consists of computational units called neurons. The connections
3 http://nlp.stanford.edu:8080/parser/
between ANNs and the brain are in fact rather slim. In the metaphor, a
neuron has scalar inputs with associated weights and outputs. The neuron multiplies
each input by its weight, sums them and transforms to a working output through
applying a non linear function called activation function. Table 1 shows examples
of activation functions. The structure of the biological neuron and an example
of an arti cial neuron model with n inputs and one output is shown in Figures
1(a), 1(b) respectively. In this example, a neuron receives simultaneous inputs
X = (x1; x2; : : : ; xn) associated with weights W = (w1; w2; : : : ; wn), a bias b and
calculate the output as:
y = f (W</p>
      <p>X + b)
(1)
where f is the activation function.
ANN comprises of a large number of neurons within di erent layers. An ANN
model basically consists of three layers: an input layer, a number of hidden layers
and an output layer. Input layer contains a set of neurons called input nodes,
which receive raw inputs directly. The hidden layers receive the data from the
input nodes and are responsible for processing these data by calculating the
weights of neurons at each layer. These weights are called connection weights
and are passed from one node to another. Number of nodes in hidden layers
in uences the number of connections. During training phase connection weights
are adjusted to be able to predict the correct class label of the input. Output
layer receives the processed data and uses its activation function to generate nal
output. This kind of ANN where information ows in one direction is called
feedforward ANN. Figure 2 shows an example of a feed-forward ANN with two hidden
layers. An ANN is called fully connected if each node in a layer is connected to
all nodes in the subsequent layer.
(a) The structure of the biological neuron</p>
      <p>
        (b) A simple neuron example
Most of the classi cation tasks use a single classi er. However, for some data
some classi er may give good results while other classi er may not perform well.
Further, there is no generic rule which helps to choose a classi er for a particular
application and data. So, instead of experimenting the single classi ers one by
one in search of good results it will be bene cial to pool several such classi ers
and then take the collective decision similar to the decision taken by a
committee rather than an individual. This technique which overcomes the weakness of
some classi ers using the strength of other classi ers is termed as "ensemble".
Ensemble approach has been applied for di erent tasks such as BioNER [
        <xref ref-type="bibr" rid="ref11 ref13">11, 13</xref>
        ],
word segmentation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and word sense disambiguation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Fig. 2. A Simple Feed-Forward ANN Structure
Hindi (HI), Telugu (TE), Bengali (BE) and Malayalam (MA). Considering the
languages as a set of classes L = fKA; T A; HI; T E; BE; M Ag and comments
as individual instances, the task of identifying the native language can be
considered as a classi cation problem that assigns one of the prede ned languages
of L to a new unlabelled comment.</p>
      <p>
        The general framework of our system is as described in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Vector space model
using Term Frequency/Inverse Document Frequency (TF/IDF) has been used
to represent comments. ANN based classi er is designed for the rst and second
submissions. The hidden layer of rst submission contains 70 neurons and the
activation function is logistic function. The hidden layer of second submission
contains 80 neurons and the activation function is the identity function.
Ensemble approach using majority voting technique has been used for designing the
third submission. Four ANN based models with di erent parameters (shown in
Table 2)) have been used as base classi ers to build the ensemble classi er.
      </p>
      <p>Logistic
Logistic</p>
      <p>Tanh
Identity</p>
      <sec id="sec-3-1">
        <title>Results and Discussion</title>
        <p>
          Accuracy and class-wise Precision (P), Recall (R) and F-measure have been used
for evaluating the submitted systems [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Cross-Validation (CV) technique has
been used while building the systems. Table 3 shows the 10-fold CV accuracy
for the three systems.
Performance evaluation of the rst, second and third submissions are shown
in Tables 4, 5 and 6 respectively. The accuracy of each of the submitted system
is 35.30% and all of them rank second among all the submissions.
In all the three submissions, the lowest and the best performance was reported
for Hindi and Bengali language respectively among all submissions. Most of
native speakers of Indian languages have knowledge of Hindi which a ects while
writing their comments in English.
5
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Conclusion</title>
        <p>In this work, ANN and Ensemble based classi ers have been used to design
systems for INLI 2018. All designed classi ers reported the same accuracy and
achieved the second rank over all submissions for the task. This work can be
improved using di erent structures of ANN and using deep learning model. In
addition, improving input representation will improve the performance of
systems.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abbasi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , H.:
          <article-title>Applying Authorship Analysis to Extremist-Group Web Forum Messages</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          <volume>20</volume>
          (
          <issue>5</issue>
          ),
          <volume>67</volume>
          {75 (Sep
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Blanchard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tetreault</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Higgins</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cahill</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chodorow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <source>Toe</source>
          <volume>11</volume>
          :
          <article-title>"A corpus of non-native english"</article-title>
          .
          <source>ETS Research Report Series 2013(2)</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chaski</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          :
          <article-title>Whos at the keyboard? "Authorship attribution in digital evidence investigations"</article-title>
          .
          <source>International Journal of Digital Evidence</source>
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>13</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Estival</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaustad</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutchinson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Author proling for english emails</article-title>
          .
          <source>In: "Proceedings of the 10th Conference of the Paci c Association for Computational Linguistics"</source>
          . pp.
          <volume>263</volume>
          {
          <issue>272</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gibbons</surname>
          </string-name>
          , J.:
          <article-title>Forensic linguistics: "An introduction to language in the justice system"</article-title>
          . Wiley-Blackwell (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jarvis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bestgen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pepper</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Maximizing classi cation accuracy in native language identi cation pp</article-title>
          .
          <volume>111</volume>
          {
          <issue>118</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilhan</surname>
            ,
            <given-names>H.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamvar</surname>
            ,
            <given-names>S.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Combining heterogeneous classi ers for word-sense disambiguation</article-title>
          .
          <source>In: Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions - Volume 8</source>
          . pp.
          <volume>74</volume>
          {
          <fpage>80</fpage>
          . WSD '
          <volume>02</volume>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganesh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , P,
          <string-name>
            <surname>S.K.</surname>
          </string-name>
          :
          <article-title>Overview of the INLI@FIRE-2018 Track on Indian Native Language Identi cation</article-title>
          .
          <source>In: "workshop proceedings of FIRE</source>
          <year>2018</year>
          ,
          <article-title>FIRE2018"</article-title>
          . Gandhinagar, India, December 6-9,
          <string-name>
            <given-names>CEUR</given-names>
            <surname>Workshop Proceedings</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganesh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shivkaran</surname>
            , P,
            <given-names>S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the INLI PAN at FIRE-2017 Track on Indian Native Language Identi cation</article-title>
          .
          <source>In: "Notebook Papers of FIRE</source>
          <year>2017</year>
          ,
          <article-title>FIRE-2017"</article-title>
          . Bangalore, India, December 8-10, CEUR Workshop Proceedings (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Min</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>BosonNLP: "An Ensemble Approach for Word Segmentation and POS Tagging"</article-title>
          .
          <source>In: Natural Language Processing and Chinese Computing</source>
          . pp.
          <volume>520</volume>
          {
          <fpage>526</fpage>
          . Springer International Publishing (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Nayel</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shashirekha</surname>
            ,
            <given-names>H.L.</given-names>
          </string-name>
          :
          <article-title>Improving NER for Clinical Texts by Ensemble Approach using Segment Representations</article-title>
          .
          <source>In: "Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)"</source>
          . pp.
          <volume>197</volume>
          {
          <fpage>204</fpage>
          . NLP Association of India, Kolkata,
          <source>India (December</source>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nayel</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shashirekha</surname>
            ,
            <given-names>H.L.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Mangalore-University@</surname>
          </string-name>
          INLI-FIRE-
          <year>2017</year>
          :
          <article-title>"Indian Native Language Identi cation using Support Vector Machines and Ensemble approach"</article-title>
          .
          <source>In: Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India, December 8-
          <issue>10</issue>
          ,
          <year>2017</year>
          . pp.
          <volume>106</volume>
          {
          <issue>109</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Nayel</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shashirekha</surname>
            ,
            <given-names>H.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shindo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsumoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Improving Multi-Word Entity Recognition for Biomedical Texts</article-title>
          .
          <source>International Journal of Pure and Applied Mathematics</source>
          <volume>118</volume>
          (
          <issue>16</issue>
          ),
          <volume>301</volume>
          {
          <fpage>3019</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ortega</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Understanding Second Language Acquisition</article-title>
          .
          <source>Hodder Education</source>
          , Oxford (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rozovskaya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Algorithm Selection and Model Adaptation for ESL Correction Tasks</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</source>
          . pp.
          <volume>924</volume>
          {
          <fpage>933</fpage>
          .
          <string-name>
            <surname>Portland</surname>
          </string-name>
          , Oregon, USA (
          <year>June 2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Tetreault</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blanchard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cahill</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chodorow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Native Tongues, Lost and Found: " Resources and Empirical Evaluations in Native Language Identi catio"</article-title>
          .
          <source>In: "Proceedings of COLING 2012"</source>
          . pp.
          <volume>2585</volume>
          {
          <issue>2602</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>