<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Language- and Subtask-Dependent Feature Selection and Classifier Parameter Tuning for Author Profiling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilia Markov</string-name>
          <email>imarkov@nlp.cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helena Gómez-Adorno</string-name>
          <email>helena.adorno@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigori Sidorov</string-name>
          <email>sidorov@cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Politécnico Nacional, Center for Computing Research</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>We present the CIC's approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, log-entropy weighting, tf-idf), machine-learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, ensemble classifier, meta-classifiers), and frequency threshold values. We adjusted system configurations for each of the languages and subtasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Author Profiling (AP) is the task that aims at identifying author demographics basing
on the analysis of text samples. The AP methods contribute to marketing, security, and
forensic applications, among other. From the machine-learning perspective, the task is
viewed as a multi-class, single-label classification problem, when the automatic
methods have to assign class labels (e.g., male, female) to objects (text samples). The Author
Profiling task at PAN 2017 [10,13] consists in predicting gender and language variety
on a corpus composed of Twitter messages in English, Spanish, Portuguese, and Arabic.</p>
      <p>According to the AP task literature, combinations of character n-grams with word
n-gram features have proved to be highly discriminative for both gender and language
variety identification, regardless of the language the texts are written in or the genre of
the texts [12,11,14,16]. In this study, we use combinations of typed (introduced in [15])
and untyped character n-grams with word n-gram features, and exploit domain names
as non-textual features.</p>
      <p>We examine various feature representations (binary, raw frequency, normalized
frequency, log-entropy weighting, tf-idf), machine-learning algorithms (liblinear and
libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes,
ensemble classifier, meta-classifiers), and fine-tune the feature set for each of the
targeted languages and subtasks.</p>
    </sec>
    <sec id="sec-2">
      <title>Experimental Settings</title>
      <p>The Author Profiling task at PAN 2017 [13] consisted in predicting gender and
language variety in Twitter. The training corpus covers the following languages and their
varieties:1</p>
      <p>In order to determine the best system configurations for each of the considered
languages, we conducted experiments on the provided PAN AP 2017 training dataset
under 10-fold cross-validation.</p>
      <p>The examined features, machine learning algorithms, feature representations, and
threshold values are shown in Table 1.</p>
      <p>
        Typed character n-grams, that is, character n-grams classified into 10 categories
based on affixes, words, and punctuation were introduced by Sapkota et al. [15]. In our
approach, we used the modified version of typed character n-grams as proposed in [8].
We examined typed character n-grams with n varying between 3 and 4. These features
have shown to be predictive for both gender [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and language variety identification [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Untyped character n-grams correspond to the more common approach of extracting
n-grams without dividing them into different categories. In this work, we examined
untyped character n-grams with n varying between 3 and 7.
      </p>
      <p>We evaluated the performance of word unigrams (henceforward, words) when
including and excluding punctuation marks and several implementations of word 2- and
3-grams: including and excluding punctuation marks, with and without splitting by a
full stop.</p>
      <p>The performance of each of the feature sets described above was evaluated
separately and in combinations.
1 Detailed description of the PAN Author Profiling 2017 corpus can be found in [13].</p>
      <p>We applied several pre-processing steps: removed @mention instances, picture links,
and URL mentions. We used the information regarding the particular domain name in
order to form our feature set of domain names (e.g., https://www.instagram.com !
instagram ! feature set of domain names).</p>
      <p>
        We examined the performance of the machine learning classifiers, shown in
Table 1, using their scikit-learn [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] implementation. These classification algorithms are
considered among the best for text classification tasks [
        <xref ref-type="bibr" rid="ref5 ref6">14,5,16,6</xref>
        ]. We evaluated the
performance of each of the classifiers separately, as well as examined several ensemble
setups and meta-classifiers as described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The most appropriate frequency threshold values were selected for each of the
languages based on grid search. The following frequency threshold values were examined:
1, 2, 3, 5, 10, 20, 30, that is, we considered only those features whose frequency in the
entire corpus is higher than the examined threshold value.</p>
      <p>Table 2 shows the early bird system configurations. Here, word features contain
punctuation marks; word 2-grams are splitted by a full stop and punctuation marks
are excluded. 30 most frequent domain names were used for English and Spanish, 16
for Portuguese, and 7 for Arabic. As machine-learning algorithm, we used liblinear
classifier with ‘ovr’ multi-class strategy and default parameters, which showed high
results across all the targeted languages. Ensemble and meta-classifiers showed similar
results; however, were discarded due to their high computational costs. For our early
bird submission, we adjusted system configurations for each of the languages and used
the highest average results for the both subtasks.</p>
      <p>For our final submission, we adjusted system configurations for each of the subtasks
within each language. First, we selected the most predictive feature combination and the
best performing feature representation for each of the subtasks. Word features included
punctuation marks, while word 2- and 3-gram implementations varied depending on
the language and subtask. Then, we selected the optimal threshold values that were
the same for the both subtasks within each language. We also filtered out the features
that occurred in only one document in the corpus. Finally, we selected the optimal
liblinear classifier parameters: penalty parameter (C), loss function (loss), and tolerance
for stopping criteria (tol) based on grid search. The best final system 10-fold
crossvalidation results were obtained with the configurations shown in Table 3.</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The early bird 10-fold cross-validation (10FCV) results in terms of classification
accuracy on the PAN Author Profiling 2017 training corpus and the number of features (N)
for each language are shown in Table 4. Table 5 presents the results obtained on the
PAN Author Profiling 2017 test dataset evaluated using TIRA evaluation platform [9].</p>
      <p>As one can see comparing Tables 4 and 6, the 10-fold cross-validation results of our
final system are higher than of the early bird submission for all the languages and
subtasks, except for Portuguese gender identification. This decrease in accuracy is caused
by mistakenly using not optimal classifier parameters and filtering out the features that
occurred in only one document in the corpus. The highest 10-fold cross-validation
improvement, more than 5%, was achieved for the English language variety classification.
Overall, the results were improved by approximately 1% for gender and 2% for variety
identification.</p>
      <p>Similarly to the 10-fold cross-validation results, our final system showed higher
accuracy than the early bird submission when evaluated on the test set (see Tables 5 and 7)
for all the languages, except for Portuguese (a drop of 2.1%). The highest improvements
were achieved for the two languages that showed the lowest early bird evaluation
results: English and Arabic (improvements of 6.2% and 2.7%, respectively). On average,
our final system outperformed the early bird submission by 1.9% (72.76% vs. 70.86%)
on the PAN AP 2017 test set.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We described our system for gender and language variety identification that took part in
the Author Profiling task at PAN 2017. The system configurations are adjusted for each
of the languages and subtasks within the competition. The system uses combinations
of typed and untyped character n-grams with word n-grams and non-textual features.
Feature representations, classifier parameters, and threshold values vary depending on
the targeted language and subtask.</p>
      <p>
        One of the directions for future work would be to examine the contribution of other
pre-processing steps, such as replacing digits, splitting punctuation marks, and
replacing highly frequent words as described in [8], as well as of standardizing non-standard
language expressions: slang words, contractions, and abbreviations, as proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the Mexican Government (CONACYT projects
240844, SNI, COFAA-IPN, SIP-IPN 20162204, 20162064, 20171813, 20171344, and
20172008).
8. Markov, I., Stamatatos, E., Sidorov, G.: Improving cross-topic authorship attribution: The
role of pre-processing. In: Proceedings of the 18th International Conference on
Computational Linguistics and Intelligent Text Processing. CICLing 2017, Springer (2017)
9. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the
Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and
Author Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A.,
Toms, E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality, and
Visualization. 5th International Conference of the CLEF Initiative. pp. 268–299. CLEF 2014,
Springer, Berlin Heidelberg New York (2014)
10. Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview
of PAN’17: Author Identification, Author Profiling, and Author Obfuscation. In: Jones,
G., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N.
(eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th
International Conference of the CLEF Initiative. CLEF 2017, Springer, Berlin Heidelberg New York
(2017)
11. Rangel, F., Celli, F., Rosso, P., Pottast, M., Stein, B., Daelemans, W.: Overview of the 3rd
author profiling task at PAN 2015. In: Cappelato, L., Ferro, N., Jones, G., Juan, E.S. (eds.)
CLEF 2015 Labs and Workshops, Notebook Papers. vol. 1391. CEUR (2015)
12. Rangel, F., Rosso, P., Chugur, I., Pottast, M., Trenkmann, M., Stein, B., Verhoeven, B.,
Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: Cappellato, L.,
Ferro, N., Halvey, M., Kraaij, W. (eds.) CLEF 2014 Labs and Workshops, Notebook Papers.
vol. 1180, pp. 898–827. CEUR (2014)
13. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th Author Profiling Task at
PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato, L., Ferro,
N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs.</p>
      <p>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2017)
14. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the
4th author profiling task at PAN 2016: Cross-genre evaluations. In: Working Notes Papers of
the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org
(2016)
15. Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are
created equal: A study in authorship attribution. In: Proceedings of the 2015 Annual
Conference of the North American Chapter of the ACL: Human Language Technologies. pp.
93–102. NAACL-HLT ’15, Association for Computational Linguistics (2015)
16. Zampieri, M., Malmasi, S., Ljubešic´, N., Nakov, P., Ali, A., Tiedemann, J., Scherrer, Y.,
Aepli, N.: Findings of the vardial evaluation campaign 2017. In: Proceedings of the 4th
Workshop on NLP for Similar Languages, Varieties and Dialects. VarDial 2017 (2017)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Buitinck</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Louppe</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mueller</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niculae</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grobler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Layton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , VanderPlas, J.,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
          </string-name>
          , G.:
          <article-title>API design for machine learning software: experiences from the scikit-learn project</article-title>
          .
          <source>In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning</source>
          . pp.
          <fpage>108</fpage>
          -
          <lpage>122</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baptista</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Discriminating between similar languages using a combination of typed and untyped character n-grams and words</article-title>
          .
          <source>In: Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects</source>
          . pp.
          <fpage>137</fpage>
          -
          <lpage>145</lpage>
          .
          <source>VarDial</source>
          <year>2017</year>
          , ACL (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Posadas-Durán</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchez-Perez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chanona-Hernandez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Improving feature representation based on a neural network for author profiling in social media texts</article-title>
          .
          <source>Computational Intelligence and Neuroscience</source>
          <year>2016</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Native language identification using stacked generalization</article-title>
          .
          <source>CoRR abs/1703</source>
          .06541 (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Ljubešic´,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Tiedemann</surname>
          </string-name>
          , J.:
          <article-title>Discriminating between similar languages and Arabic dialect identification: A report on the third DSL shared task</article-title>
          .
          <source>In: Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
          <source>VarDial</source>
          <year>2016</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Posadas-Durán</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Author profiling with doc2vec neural network-based document embeddings</article-title>
          .
          <source>In: Proceedings of the 15th Mexican International Conference on Artificial Intelligence. MICAI</source>
          <year>2016</year>
          , vol.
          <volume>10062</volume>
          . LNAI, Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Adapting cross-genre author profiling to language and corpus</article-title>
          .
          <source>In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings</source>
          , vol.
          <volume>1609</volume>
          , pp.
          <fpage>947</fpage>
          -
          <lpage>955</lpage>
          . CLEF and
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>