<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CIC-IPN@INLI2018: Indian Native Language Identi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilia Markov</string-name>
          <email>ilia.markov@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigori Sidorov</string-name>
          <email>sidorov@cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INRIA Paris</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto Politecnico Nacional (IPN), Center for Computing Research (CIC)</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe the CIC-IPN submissions to the shared task on Indian Native Language Identi cation (INLI 2018). We use the Support Vector Machines algorithm trained on numerous feature types: word, character, part-of-speech tag, and punctuation mark n-grams, as well as character n-grams from misspelled words and emotionbased features. The features are weighted using log-entropy scheme. Our team achieved 41.8% accuracy on the test set 1 and 34.5% accuracy on the test set 2, ranking 3rd in the o cial INLI shared task scoring.</p>
      </abstract>
      <kwd-group>
        <kwd>Native Language Identi cation media feature engineering machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The task of Native Language Identi cation (NLI) consists in identifying the
native language of a person based on their text production in the second language.
The underlying hypothesis is that the learner's native language (L1) in uences
their second language (L2) production as a result of the language transfer e ect
(native language interference) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which is thoroughly studied in the eld of
second language acquisition (SLA).
      </p>
      <p>The possible applications of the task include marketing and security, as NLI is
viewed a subtask of author pro ling, as well as education, where the pedagogical
material can be targetly tuned to native languages, for example, by taking into
account the most common errors made by learners with a speci c background
and adapting the materials to tackle such errors in more detail.</p>
      <p>
        Previous studies on identifying the native language from L2 writing { most
of which approached the task from a machine-learning perspective { explored a
wide range of L1 phenomena that appear in L2 production, i.e., lexical choices
made by learners, grammatical patterns used, the in uence of cognates and
general etymology, spelling errors, punctuation, emotions, among others, and
used corresponding features to capture these phenomena. Most of the NLI studies
focused on English as second language; however, NLI methods have been also
examined on other L2s with promising results [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        The interest in NLI led to the organization of several NLI competitions,
including the rst edition of the shared task on identifying the Indian languages [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
which was held in 2017 and attracted a large number of participating teams. The
winning approach consisted in training the Support Vector Machines (SVM)
classi er with SGD (Stochastic Gradient Descent) method on word n-gram and
character n-gram features [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Other approaches included using several pre-processing
steps (e.g., removing digits, emoji, stop words), classi cation algorithms (e.g.,
SVM, Logistic Regression, Naive Bayes), and features (e.g., non-English word
counts, using adjectives and nouns as features, average sentence and word length,
among others) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In this paper, we present the CIC-IPN submissions to the INLI shared task
2018 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We use the SVM algorithm trained on word n-grams, traditional
(untyped) and typed character n-grams, part-of-speech (POS) tag n-grams,
punctuation mark n-grams, character n-grams from misspelled words, and
emotionbased features. In continuation we describe in detail the features used and the
con guration of our runs.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>
        The training dataset released by the organizers consists of Facebook comments in
the English language extracted from regional language newspapers. This dataset
was also used in the 2017 edition of the INLI competition [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The dataset
statistics in terms of the L1s covered, number (No.) of documents per L1, and the
corresponding ratio are provided in Table 1. It can be seen that 1,233 training
documents are nearly-optimally balanced across the represented L1s.
      </p>
      <p>The submitted systems were evaluated on two test sets: the test set 1 (also
used in the INLI 2017; 783 documents) and the test set 2 (the o cial test set of
the INLI 2018; 1,185 documents).</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>In this section, we give a description of the features incorporated in our runs
and the con guration of our system: weighting scheme, frequency threshold, and
machine-learning classi er.
3.1</p>
      <sec id="sec-3-1">
        <title>Features</title>
        <p>
          Word n-grams capture lexical choices of the learner in L2 production, and are
considered one of the most indicative unique feature types for the task of NLI [
          <xref ref-type="bibr" rid="ref4 ref9">4,
9</xref>
          ]. Word n-gram features were also incorporated in the winning approach to the
previous INLI shared task [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In runs 1 and 3, we use word unigrams and
2grams, when in run 2 we use word 1{3-grams. We lowercase the word-based
features and replace digits by a placeholder (e.g., 12345 ! 0).
        </p>
        <p>
          Untyped character n-grams are considered very indicative features for NLI and
for other related tasks [
          <xref ref-type="bibr" rid="ref16 ref3">3, 16</xref>
          ]. In NLI, these feature are hypothesized to capture
the phoneme transfer from the learner's L1 [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], among others L1 peculiarities.
They were also incorporated into the winning approach to the INLI 2017 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We
use character n-grams with n = 2.
        </p>
        <p>
          Typed character n-grams { character n-grams categorized into ten di erent
categories { have been successfully applied to NLI [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. We conducted an ablation
study in order to identify the most indicative typed character n-gram categories.
We found that the middle-punctuation and the whole-word categories did not
contribute to the result, and therefore were discarded. We use typed character
4-grams; 3-grams are used for the su x category.
        </p>
        <p>
          POS tag n-grams capture morpho-syntactic aspects of the native language in
NLI. They encode word order and grammatical properties of the native
language, capturing the use or misuse of grammatical structures. POS tag n-grams
have proved to be useful features for NLI, especially when combined with other
feature types [
          <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
          ]. We use POS tag 3-grams; obtaining the POS tags with the
TreeTagger package [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>
          Punctuation mark n-grams The impact of punctuation marks (PMs) on NLI
was evaluated in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The authors report that punctuation usage is a strong
indicator of the author's L1. We use punctuation mark n-grams (n = 3).
Character n-grams from misspelled words were introduced by Chen et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
These features have been successfully used to tackle the NLI task in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. We
extract 8,937 misspelled words (from the training dataset) using the PyEnchant
package3 and build character 4-grams from them.
3 https://pypi.org/project/pyenchant/
Emotion polarity features Emotion-based features for NLI were proposed in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
We use emotion polarity (emoP) features similar to [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]: replace each word in the
text with the information form the NRC emotion lexicon [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], e.g., excellent!
\0000101001".
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Weighting scheme and threshold</title>
        <p>
          We use log-entropy (le) weighting scheme that measures the importance of a
feature across the entire corpus. le is considered one of the best weighting schemes
for the NLI task [
          <xref ref-type="bibr" rid="ref2 ref4 ref9">4, 2, 9</xref>
          ]. In our experiments under 10-fold cross-validation, le
outperformed other weighting schemes we examined (tf -idf , tf , and binary).
Accuracy improvement over the second best-performing weighting scheme (tf
idf ) was 3.2%{3.6% depending on the run.
        </p>
        <p>
          Tuning the size of the feature set (selecting the optimal frequency threshold
values) is an e ective strategy for NLP tasks in general [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and for NLI in
particular [
          <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
          ]. In all our runs, we include only the features that appear in two
documents (min df =2). In run 3, we additionally set frequency threshold value
to 3 (include only the features that appear three times in the entire corpus).
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Classi er</title>
        <p>
          We use the linear SVM algorithm whose e ectiveness has been proved by
numerous studies on NLI [
          <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
          ]. SVM was also the most popular algorithm in the 2017
edition of the INLI shared task [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. SVM with OvR (one vs. the rest) multi-class
strategy is used, as implemented in the scikit-learn package [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
3.4
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Evaluation</title>
        <p>For the evaluation of our system, we conducted experiments under 10-fold
crossvalidation, measuring the results in terms of classi cation accuracy on the
training corpus.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and Discussion</title>
      <p>Run 3 showed the highest accuracy on the o cial test set due to a higher
frequency threshold value. The confusion matrix for this run on the training data
is shown in Figure 1; the class-wise accuracy results provided by the organizers
on the test set 1 and 2 are presented in Tables 3 and 4, respectively. The highest
10-fold cross-validation result was achieved for the Hindi language, while on the
both test sets this was the hardest language to identify.
We described the three runs that were submitted by the CIC-IPN team to the
INLI shared task 2018. Our approach uses the SVM algorithm trained on word,
character, POS tag, and punctuation mark n-grams, character n-grams from
misspelled words, and emotion-based features. The features are weighted using
log-entropy weighting scheme. Our team achieved 41.8% accuracy on the test
set 1 (run 1) and 34.5% accuracy on the o cial test set 2 (run 3), placing our
team 3rd (out of 12 participating teams) in the competition.</p>
      <p>In future work, we will evaluate the performance of our system without word
and character n-grams in order to investigate their impact on the accuracy drop
su ered by the system when evaluated on the test sets. We will also focus on
more abstract features that perform well in the situation where topic bias may
occur.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Brooke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirst</surname>
          </string-name>
          , G.:
          <article-title>Native language detection with `cheap' learner corpora</article-title>
          .
          <source>In: Proceedings of the Conference of Learner Corpus Research</source>
          . pp.
          <volume>37</volume>
          {
          <fpage>47</fpage>
          . Presses universitaires de Louvain,
          <article-title>Louvain-la-</article-title>
          <string-name>
            <surname>Neuve</surname>
          </string-name>
          ,
          <source>Belgium</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strapparava</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nastase</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Improving native language identi cation by using spelling errors</article-title>
          .
          <source>In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>542</volume>
          {
          <fpage>546</fpage>
          . ACL, Vancouver, Canada (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baptista</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Discriminating between similar languages using a combination of typed and untyped character ngrams and words</article-title>
          .
          <source>In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects</source>
          . pp.
          <volume>137</volume>
          {
          <fpage>145</fpage>
          . ACL, Valencia, Spain (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jarvis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bestgen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pepper</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Maximizing classi cation accuracy in native language identi cation</article-title>
          .
          <source>In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          . pp.
          <volume>111</volume>
          {
          <fpage>118</fpage>
          . ACL, Atlanta,
          <string-name>
            <surname>GA</surname>
          </string-name>
          , USA (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kosmajac</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keselj</surname>
          </string-name>
          , V.:
          <article-title>DalTeam@INLI-FIRE-2017: Native language identi cation using SVM with SGD training</article-title>
          .
          <source>In: Working notes of FIRE</source>
          <year>2017</year>
          <article-title>- Forum for Information Retrieval Evaluation</article-title>
          . CEUR, Bangalore, India (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganesh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the INLI PAN at FIRE-2017 track on Indian native language identi cation</article-title>
          .
          <source>In: Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          . vol.
          <year>2036</year>
          , pp.
          <volume>99</volume>
          {
          <fpage>105</fpage>
          . CEUR Workshop Proceedings, Bangalore, India (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganesh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the INLI@FIRE-2018 track on Indian native language identi cation</article-title>
          .
          <source>In: Workshop proceedings of FIRE 2018. CEUR Workshop Proceedings</source>
          , Gandhinagar, India (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multilingual native language identi cation</article-title>
          .
          <source>Natural Language Engineering</source>
          <volume>23</volume>
          (
          <issue>2</issue>
          ),
          <volume>163</volume>
          {
          <fpage>215</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strapparava</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>CIC-FBK approach to native language identi cation</article-title>
          .
          <source>In: Proceedings of the 12th Workshop on Building Educational Applications Using NLP</source>
          . pp.
          <volume>374</volume>
          {
          <fpage>381</fpage>
          . ACL, Copenhagen, Denmark (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The winning approach to cross-genre gender identi cation in Russian at RUSPro ling 2017</article-title>
          .
          <source>In: FIRE 2017 Working Notes</source>
          . vol.
          <year>2036</year>
          , pp.
          <volume>20</volume>
          {
          <fpage>24</fpage>
          . CEUR-WS.org, Bangalore, India (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nastase</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strapparava</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Punctuation as native language interference</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Computational Linguistics</source>
          . pp.
          <volume>3456</volume>
          {
          <fpage>3466</fpage>
          .
          <string-name>
            <surname>The</surname>
            <given-names>COLING</given-names>
          </string-name>
          2018
          <string-name>
            <given-names>Organizing</given-names>
            <surname>Committee</surname>
          </string-name>
          , Santa Fe, New Mexico, USA (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nastase</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strapparava</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The role of emotions in native language identi cation</article-title>
          .
          <source>In: Proceedings of the 9th Workshop on Computational Approaches</source>
          to Subjectivity, Sentiment &amp;
          <article-title>Social Media Analysis</article-title>
          .
          <source>ACL</source>
          , Brussels, Belgium (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mohammad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turney</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Crowdsourcing a word-emotion association lexicon</article-title>
          .
          <source>Computational Intelligence</source>
          <volume>29</volume>
          ,
          <fpage>436</fpage>
          {
          <fpage>465</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Odlin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Language Transfer: cross-linguistic in uence in language learning</article-title>
          . Cambridge University Press, Cambridge, UK (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sanchez-Perez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
          </string-name>
          , G.:
          <article-title>Comparison of character n-grams and lexical features on author, gender, and language variety identi cation on the same Spanish news corpus</article-title>
          .
          <source>In: Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          . vol.
          <volume>10456</volume>
          , pp.
          <volume>145</volume>
          {
          <fpage>151</fpage>
          . Springer, Dublin, Ireland (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Schmid</surname>
          </string-name>
          , H.:
          <article-title>Improvements In Part-of-Speech Tagging With an Application to German</article-title>
          , pp.
          <volume>13</volume>
          {
          <fpage>25</fpage>
          . Springer (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Tsur</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rappoport</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Using classi er features for studying the e ect of native language on the choice of written second language words</article-title>
          .
          <source>In: Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition</source>
          . pp.
          <volume>9</volume>
          {
          <fpage>16</fpage>
          . ACL, Stroudsburg, PA, USA (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>