<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>INLI@FIRE-2018: A Native Language Identi cation System using Convolutional Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Professor, CUSAT</institution>
          ,
          <addr-line>Cochin 682022</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Research Scholar, CUSAT</institution>
          ,
          <addr-line>Cochin 682022</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1865</year>
      </pub-date>
      <abstract>
        <p>Native Language Identi cation is the problem of identifying the rst language of speakers based on his/her writings in another language. The proposed approach is a deep learning based methodology using convolutional neural networks. Convolutional neural networks are a class of neural networks that have proven very e ective in areas such as pattern recognition and classi cation. They are able to capture the local texture within the text and can be used to nd the representative patterns in a text document. The proposed system consists of a language identi cation model, which is trained by a corpus of 1233 documents. The experiments were conducted using the dataset provided for INLI@FIRE2018. The results indicate that the system is capable of giving performance comparable to the methods employing more sophisticated approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>Convolutional Neural Networks cation</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Native Language Identi-</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Native Language Identi cation is the process of distinguishing the native
language of a writer from his/her writings in the second language(English) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It
is a well-known task that nds important applications in elds like forensic,
educational settings, etc. Native language is always used as an essential feature
for authorship pro ling and identi cation. Nowadays, due to the enormous
usage of social media sites and online interactions, getting an intense threat is
a common issue faced by commuters. If a comment or post induces any type
of threat, then recognizing the native language of the commenter(the one who
commented/posted it) will be one of the crucial measures in nding the source.
Speakers of various languages may make di erent types of errors when learning
a new language [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Hence Native Language Identi cation nds its applications
in educational environments to supply targeted feedback to language students
about their errors.
      </p>
      <p>
        Hindi is by far the most widely spoken language in India. Even though
roughly 40% of the population speak Hindi, people use English as their
major second language. English is spoken natively by around 375 million people
across the globe. It is the second o cial language of India and is used for
business, teaching, learning, and trade on a day to day basis. Around 10% of Indias
population speak English and use it in their day to day activities. But it is only a
rst language for 0.019% people in the Country and becoming a second language
for around 125 million people all over the world [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This 10% of the population
is from di erent parts of the country and have various native languages.
Identication of the native language of such speakers is a challenging task that nds
important applications in this social media world.
      </p>
      <p>The structure of this paper is as follows. Section 2 brie y reviews the similar
works in this area. Section 3 discusses the task description and details about
the dataset. Section 4 explains the methodology and Section 5 demonstrates the
results and evaluation metrics. Section 6 concludes the article along with some
routes for the future works.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related works</title>
      <p>
        Native Language Identi cation has a lot of importance in di erent areas of
Natural Language Processing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Most of the works in NLI is reported by taking
English as a second language. They treated NLI as a supervised classi cation
task and used statistical models to train data from various languages. The rst
work in the eld is reported by Koppel et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] who explored a multitude of
features for NLI. These features include average sentence length, average word
length, word n-grams, character n-grams, POS n-grams, content words, function
words, spelling errors, grammatical errors, etc. SVM was used to train these
features on ICLE corpus(International Corpus of Learner English (ICLEv2)[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
Unigrams and Bigrams are the most explored n-grams in the previous works.
      </p>
      <p>
        Syntactic features of the text are also focussed on the recent works. Wong and
Dras [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] used production rules from di erent parsers as features to Language
identi cation system. Similarly, Swanson and Charniak [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] investigated the
bene t of Tree Substitution Grammars for NLI. Tetreault [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] experimented the use
of Tree Substitution Grammars along with dependency features extracted from
the Stanford parser. Tree fragments returned from Tree Substitution Grammar
were proved to be bene cial for distinguishing the native and non-native English
writers by acquiring the syntactic structures. Similarly, augmenting CFG rules
with the grandparent nodes and the augmented rules are found to be
outperforming the simple CFG rules in authorship attribution tasks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        It has been found that the semantic features are the least experimented
one for NLI. Gamon extracted semantic features from semantic dependency
graphs[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These features include binary semantic features and semantic
modication relations which are used as a feature set for classi cation purpose.
Semantic features contain number and gender information of nouns and pronouns
as well as tense and aspectual features of verbs. Similarly, semantic modi cation
relations extract the semantic relations between a node and all its descendants
within a semantic graph. Experiments showed that the semantic features in
combination with the syntactic features resulted in improved accuracy for
Authorship Classi cation tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Throughout the literature, we have found that
none of the existing works utilizes deep learning based methodologies for
language identi cation tasks. Hence we decided to go for an approach which uses
CNN for the above-mentioned problem.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Task Description and Dataset Details</title>
      <p>The task is focused on identifying the rst language of an author from the given
Text/XML le which includes a set of Facebook comments in the English
language. Six Indian languages are considered for this study. They are Tamil, Hindi,
Kannada, Malayalam, Bengali, and Telugu. Spoken forms of English shows
signi cant variations across the di erent states of India and it is relatively easy
to recognize the native language of the speaker using his English accent. But
nding the rst language of a writer based on his comments or posts in English
is a di cult task in the present scenario.</p>
      <p>
        The shared dataset contains data from six di erent Indian languages. The
training data is a set of les in XML format. Each language has around 200 les
of facebook comments. Each le contains around 150 words as the comment.
Sentence segmentation is carried out using the regular expression. Statistics of
the training data is shown in Table 1. The testing data contains two folders say
test1 and test2. Test1 consists of 783 les and test2 contains 1185 les from the
above-mentioned languages.
The proposed system is a CNN-based language identi cation model which
predicts the native language of a writer from his scripts. CNN's are responsible for
the important breakthroughs in Image Classi cation problems and are the core
of the most Computer Vision systems today. But they are not common in text
analytics. CNN's have been proved to be successful in various text classi cation
problems in recent years [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. They have an important property of preserving the
2D spatial orientation in computer vision problems. But when it comes to texts
these orientations have a one-dimensional structure. A generalized overview of
Convolutional Neural Networks is shown in Fig. 1.
      </p>
      <p>The problem is shaped as a text classi cation task with language names as
labels(classes). The number of classes is the same as the number of languages
considered for the study. Text from each language in the training data is sent
to a sentence segmentation module where the raw text is converted to a set
of sentences using regular expressions. Providing sequences of raw human-alike
words will make no sense to computers. For that reason, the raw words are
converted into numeric values using dictionaries. For that, we create a vocabulary
of words, an array which stores all the words in the training data, but each word
appears only once. Two dictionaries which map from word to its
corresponding index value and reverse are also created. Two special words-'ZERO' and
'UNKNOWN' are added to the dictionary. 'ZERO' is used to make all the
sequences of unique length and 'UNKNOWN' is used for out of vocabulary words.
Then the sequences of strings are converted into sequences of numbers using
the aforementioned dictionaries. The sentences may have a di erent length. But
CNN training requires sequences of uniform length. So we padded the sentences
with less number of words with 'ZEROS' to make them of unique length(ZERO
padding). That is why we added the word ZERO to the dictionary. Each
sentence in the training data is labeled with a corresponding language label. Hence
our nal training data contains a lot of sentences and their corresponding labels.
Identifying the patterns within the sentences is our ultimate goal.</p>
      <p>
        Sequential model of keras is used for implementation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The network is
designed with four convolutional layers, two max-pooling layers, two dense layers,
and an embedding layer. The rst layer is an embedding layer which performs
the word embeddings. The embedding size is xed at 100. The second one uses
the convolutional layer for its ability to capture the local context. The following
layers are alternate max-pooling and convolutional layers for acquiring the
patterns within the sentence. We have used 'Relu' as the activation function to bring
the nonlinearity. The number of lters used in all the convolutional layers is 256.
And the kernel size is xed at 7 for the rst two convolutional layers and 3 for
the remaining layers. The nal dense layer is associated with softmax activation
units. During the training phase, lters slide over full rows of the matrix(words).
CNN automatically learns the values of its lters based on the task assigned to
it. The architecture of the proposed network is shown in Table 2.
      </p>
      <p>
        Di erent con gurations of the network are attempted. Experiments are
conducted using deep and shallow convolutional neural networks. The performance
of di erent CNN architectures on the test data is given in Table 3. The best
results are given by the above-described architecture. In our experiments, we
selected the rst 90% of the data as training data and the remaining data as
testing data. The batch size is xed at 64. 'Categorical cross entropy' is used
as the loss function. We used 'Adam', the e cient gradient descent algorithm
as the optimizer because it is an e cient one for optimization. Dropout is used
to prevent over tting [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Model is compiled using Tensor ow in the backend.
The network is trained for 10 epochs and the model le is saved for the testing
purpose.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Experiments are also conducted to measure the e ect of training data size on the
system performance. It is observed that the performance of the system increases
Name Network Con guration Accuracy
CNN1 1 Conv,1 Maxpool,1 Dense,1 Dropout 17.5%
CNN2 2 Conv,2 Maxpool,2 Dense,1 Dropout 21.2%
CNN3 3 Conv,2 Maxpool,2 Dense,2 Dropout 22.2%
CNN4 4 Conv,3 Maxpool,2 Dense,2 Dropout 25.7%
CNN5 4 Conv,4 Maxpool,2 Dense,2 Dropout 25.3%</p>
      <p>CNN6 5 Conv,5 Maxpool,2 Dense,2 Dropout 24.7%
with the increase in training data size. Figure 2 shows the e ect of training
data size on our best performed CNN architecture. Hence it is better to have a
larger sized training corpus when dealing with deep learning based classi cation
methodologies. We used accuracy to quantify the performance of our model.
Accuracy computes the degree to which the result of a prediction conforms to
the true value. The proposed system was tested with the test data sets provided
by the task organizers. Our system predicts the tag for each sentence in the
post(comment). But our goal is to predict the tag for each XML le(post). So
we labeled each post according to the maximum number of predictions for that
particular post. Table 4 demonstrates the results of our experimentation on
both the datasets. It is clear from the table that test2 results are far better than
test1 results. Di erent runs correspond to di erent architecture of the proposed
network.
In this article, we have discussed a deep learning based native language
identi cation system. The exclusive feature of our approach is the use of the
Convolutional neural network for this task. The main reason we preferred a CNN
rather than traditional feature-based methods is its ability to capture local
texture in a sequence. It has been found that the accuracy of the system increases
with the increase in training data size. Hence it is better to have a larger sized
training corpus to get improved performance. The accuracy of the system can
also be improved by using trained word embeddings. Due to insu cient system
requirements, we could not perform this activity. Apart from NLI, Convolutional
Neural Networks can be applied e ciently for various language processing
applications. We hope to apply CNN based methods to di erent language processing
applications such as text classi cation, sentiment analysis, etc.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. The top 10 most spoken languages in india</article-title>
          . https://www.listenandlearnusa.com/blog/the-top-10
          <string-name>
            <surname>-</surname>
          </string-name>
          most
          <article-title>-spoken-languagesin-india</article-title>
          ,
          <source>accessed: 2018-08-03</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Anand Kumar</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>B.G.H.</surname>
          </string-name>
          , P,
          <string-name>
            <surname>S.K.</surname>
          </string-name>
          :
          <article-title>Overview of the inli@ re-2018 track on indian native language identi cation</article-title>
          .
          <source>In: workshop proceedings of FIRE</source>
          <year>2018</year>
          , FIRE-2018, Gandhinagar, India, December 6-9,
          <string-name>
            <given-names>CEUR</given-names>
            <surname>Workshop Proceedings</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Anand Kumar</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barathi Ganesh</surname>
            <given-names>HB</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>S.K.P.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Overview of the inli pan at re-2017 track on indian native language identi cation</article-title>
          .
          <source>In: Notebook Papers of FIRE</source>
          <year>2017</year>
          , FIRE-2017, Bangalore, India, December 8-10, CEUR Workshop Proceedings (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.: Keras. https://github.com/fchollet/keras (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Syntactic stylometry for deception detection</article-title>
          .
          <source>In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume</source>
          <volume>2</volume>
          . pp.
          <volume>171</volume>
          {
          <fpage>175</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gamon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Linguistic correlates of style: authorship classi cation with deep linguistic analysis features</article-title>
          .
          <source>In: Proceedings of the 20th international conference on Computational Linguistics</source>
          . p.
          <fpage>611</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Granger</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dagneaux</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meunier</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paquot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>International corpus of learner english (</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural networks for sentence classi cation</article-title>
          .
          <source>arXiv preprint arXiv:1408.5882</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zigdon</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Determining an author's native language by mining a text for errors</article-title>
          .
          <source>In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining</source>
          . pp.
          <volume>624</volume>
          {
          <fpage>628</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Learner English: A teacher's guide to interference and other problems</article-title>
          . Ernst Klett Sprachen (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Dropout: A simple way to prevent neural networks from over tting</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ),
          <year>1929</year>
          {
          <year>1958</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Swanson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charniak</surname>
          </string-name>
          , E.:
          <article-title>Extracting the native language signal for second language acquisition</article-title>
          .
          <source>In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          . pp.
          <volume>85</volume>
          {
          <issue>94</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Tetreault</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blanchard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cahill</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chodorow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Native tongues, lost and found: Resources and empirical evaluations in native language identi cation</article-title>
          .
          <source>Proceedings of COLING 2012</source>
          pp.
          <volume>2585</volume>
          {
          <issue>2602</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>S.M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Exploiting parse structures for native language identi cation</article-title>
          .
          <source>In: Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <volume>1600</volume>
          {
          <fpage>1610</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>