<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>The Winning Approach to Cross-Genre Gender Identification in Russian at RUSProfiling 2017</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilia Markov</string-name>
          <email>imarkov@nlp.cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helena Gómez-Adorno</string-name>
          <email>helena.adorno@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigori Sidorov</string-name>
          <email>sidorov@cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Gelbukh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Author Profiling</institution>
          ,
          <addr-line>Gender Identification, Cross-Genre, Social Media, Russian, Machine Learning, Computational Linguistics</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CIC, Instituto Politécnico Nacional Mexico City</institution>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>0</volume>
      <fpage>8</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>We present the CIC systems submitted to the 2017 PAN shared task on Cross-Genre Gender Identification in Russian texts (RUSProifling). We submitted five systems. One of them was based on a statistical approach using only lexical features, and other four on machine-learning techniques using some combinations of genderspecific Russian grammatical features, word and character n-grams, and sufix n-grams. Our systems achieved the highest weighted accuracy across all the test datasets, occupying the first four places in the ranking.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Author profiling (AP) is the task of identifying the author’s
demographics, such as age, gender, personality traits, or native language,
basing on a sample of his or her writing. This task has numerous
practical applications in forensics, security, and marketing, to name
just a few. For example, in forensics and terrorism prevention
applications, knowing the characteristics of the suspect can narrow
down the search space for the author of a written threat; in
marketing applications, this information can be important to predict a
customer’s shopping preferences or develop new targeted products.</p>
      <p>The rapid growth of social media data available on the Internet
has significantly contributed to the increased interest in this task.
This interest led to establishing of the annual PAN evaluation
campaign1, which is considered one of the main fora on AP, authorship
attribution, plagiarism detection, and other tasks related to the
study of authorship and characteristics of the author of a text.</p>
      <p>
        Recent trends in the field include cross-genre AP scenario [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
that is, the setting when the training corpus consists of texts of one
genre, while the test set consists of texts of another genre.
Crossgenre AP conditions better match the requirements of a real-life
scenario of forensic applications, when the available texts by the
candidate authors can belong to genre and thematic area diferent
from the texts under investigation.
      </p>
      <p>
        Following the recent trends in the field, the 2017 PAN shared task
on Gender Identification in Russian texts (RUSProfiling) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
provided cross-genre AP scenario: the training corpus was composed
of tweets, while the provided test datasets covered five diferent
genres: ofline texts (such as a letter to a friend or a picture
descriptions), Facebook posts, tweets, product and service online reviews,
and gender imitation texts.
      </p>
      <p>
        Machine-learning methods are commonly used for the AP task.
From the machine-learning perspective, the task is viewed as a
multi-class, single-label classification problem, in which automatic
methods are to assign class labels (e.g., male or female) to the
text samples. Recently, deep-learning techniques [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], such as
character-, word-, and document-embedding approaches [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], have
been used for the task; however, linear models still perform better,
since they seem to be more robust in capturing stylistic information
in the author’s writing. Therefore, we employ the commonly-used
linear machine-learning approaches, as well as propose a novel
statistical approach aiming to identify the gender of an author
basing on statistical analysis of lexical information.
      </p>
      <p>The paper is organized as follows. In Section 2, we discuss the
related work. In Section 3, we provide some characteristics of the
datasets used in the RUSProfiling shared task 2017. In Section 4, we
describe the conducted experiments, providing the experimental
settings for the submitted systems. In Section 5, we give the
obtained results and their evaluation. Finally, in Section 6 we draw
some conclusions and point to possible directions of future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        The PAN evaluation campaign has become one of the main
platforms for evaluation of AP approaches and methodologies. There
have been various profiling aspects covered by PAN since 2013 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
including age, gender, personality traits, and language variety
identification, under both single- and cross-genre AP conditions.
      </p>
      <p>
        PAN 2017 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] attracted 22 submissions. Most of the teams
(including the top three systems) used traditional machine-learning
algorithms, such as SVM [
        <xref ref-type="bibr" rid="ref11 ref20 ref9">9, 11, 20</xref>
        ] or logistic regression [
        <xref ref-type="bibr" rid="ref13 ref4">4, 13</xref>
        ].
This edition can be characterized by the increased use of
deeplearning techniques [
        <xref ref-type="bibr" rid="ref18 ref5">5, 18</xref>
        ], in particular word and character
embeddings [
        <xref ref-type="bibr" rid="ref19 ref2 ref4">2, 4, 19</xref>
        ], which are gaining popularity and achieving
competitive, but still lower than the linear models, results for the
AP task.
      </p>
      <p>Content-based and style-based features have been extensively
used in the previous editions of PAN. As content-based features, bag
of words, word n-grams, slang words, locations, brand names, topic
words, among others, were used by several teams. As style-based
features, character n-grams are the most popular feature type for
AP, other feature types include ratio of links, character flooding,
typed character n-grams, emoticons, hashtags, and user mentions.</p>
      <p>
        Due to the scarcity of available training data, AP research in the
Russian language has been limited. The first corpus in the Russian
language annotated with the authors’ metadata information—the
Ruspersonality corpus—was introduced by Litvinova et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
corpus is composed of texts labeled with the author gender, age,
personality traits, native language, neuropsycological testing data,
and educational level. The corpus also contains a subset of truthful
and deceptive texts. At the time publication of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the corpus
contained over 1,850 documents.
      </p>
      <p>
        Several experiments were carried out in order to illustrate the
usefulness of the Ruspersonality corpus [
        <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
        ]. For gender
identiifcation, Litvinova et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used a range of context-independent
features such as part-of-speech (POS) tags, syntactic relations,
ratios of POS tags, punctuation marks, and emotion words. They also
evaluated diferent machine-learning algorithms: gradient boosting,
adaBoosting, random forest, SVM, ReLU, among others. The best
performance was obtained by ReLU (mean F1-score of 74%).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATASETS</title>
      <p>The focus of the RUSProfiling shared task 2017 is on cross-genre
gender identification. The organizers provided a training dataset
composed of tweets and five diferent test datasets on the following
genres:</p>
      <p>
        Test 1: Ofline texts (such as picture descriptions or letter to a
friend) from the Ruspersonality Corpus [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Test 2: Facebook posts.</p>
      <p>Test 3: Twitter messeges.</p>
      <p>Test 4: Product and service online reviews.</p>
      <p>Test 5: Gender imitation corpus, that is, women imitating men
and vice versa.</p>
      <p>Table 1 presents general statistics of the training and five test
datasets. In the table, No. of docs stands for the number of
documents in each dataset. The statistics of the average number (Avg.) of
words and characters per document, as well as standard deviation
(Std.), were calculated after applying pre-processing steps, which
included lowercasing and removal of all non-cyrillic characters
(punctuation marks were also removed). In terms of average
number of words and characters, the Test 2 dataset is the most similar to
the training corpus. The main diference between the two datasets
is the standard deviation, which is larger in the training corpus.
The Test 3 dataset is on the same genre as the training corpus, but it
contains shorter documents, of 729.22 words on average. The Test 1
and Test 5 datasets have similar statistics in terms of the number
of words and characters, but difer in the number of documents
(370 and 94, respectively). Finally, the Test 4 dataset contains the
shortest documents, of 54.40 words on average.
4</p>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTAL SETTINGS</title>
      <p>To evaluate our systems, we conducted experiments both on the
provided training dataset under 10-fold cross-validation and using
80%–20% dataset splitting, that is, we used 80% (480 documents) of
the training dataset for training and 20% (120 documents) for
evaluation. The splitting was balanced across the genders. Following
the oficial evaluation metrics of the shared task, we measured the
performance in terms of classification accuracy.</p>
      <p>
        We applied several pre-processing steps before feature extraction.
Pre-processing has proved to be a useful strategy for author
proifling [
        <xref ref-type="bibr" rid="ref11 ref3">3, 11</xref>
        ] and related tasks, such as authorship attribution [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
Keeping in mind that test datasets are in another genre, we kept
only cyrillic characters (non-cyrillic characters along with
punctuation marks were removed). We also performed lowercasing, which
yielded slight improvement in accuracy. These pre-processing steps
were applied in all our runs (in the context of this shared task,
systems are oficially called runs).
      </p>
      <p>
        In all the runs based on machine-learning techniques, we used
Support Vector Machines (SVM) algorithm, which is considered
among the best-performing classification algorithms for text
categorization tasks, including cross-genre AP scenario [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. We used
the liblinear scikit-learn [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] implementation of SVM with the OvR
multi-class strategy. We set the penalty hyper-parameter C to 100
basing on the evaluation results. In our experiments on the training
dataset, SVM showed higher performance than other classification
algorithms we tried, such as random forest, logistic regression,
multinomial Naïve Bayes, LDA, and ensemble classifier.
      </p>
      <p>In our machine-learning approaches, we used two diferent
implementations of the term frequency–inverse document frequency
(tf-idf) weighting: the default scikit-learn implementation and
tfidf with sublinear tf scaling, i.e., tf was replaced with 1 + log(tf).
In our experiments on the training dataset, tf-idf systematically
outperformed other examined weighting schemes, such as binary,
tf, and log entropy.</p>
      <p>The configurations of the five runs of the CIC team are described
below.
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>Run CIC-1 (machine learning)</title>
      <p>Features Since in the Russian language singular forms of the
past tense verbs change by gender (singular masculine forms have
the ending -l “-l”, while an indicator of singular feminine forms is
the ending -la “-la”), we used “word ending in -la” as a feature.
Moreover, since the past tense reflexive verbs maintain the reflexive
ending -s~ “-s’ ”, we also used the feature “word ending in -las~”
“-las’ ”. We employed the features -la “-la” and -las~ “-las’ ” in
isolation, as well as in combination with the subject of the sentence
if the subject was the first-person singular pronoun “ya” and if
this subject was within the window of 6 words after, or 3 words
before, the verb. This gave four additional composite features:
“ -la”, “ -las~”, “-la ”, and “-las~ ” with the meaning such
as “I -edfeminine myself”, as in I dressed myself in a skirt. The window
size (+6/−3) was selected based on grid search.</p>
      <p>In addition, since Russian adjectives agree with the pronouns
in gender, we used the ending -a “-aya” (nominative feminine
singular form) in combination with the first person singular
pronoun “ya” as feature if the pronoun was within the same +6/−3
window as above. This gave two more features: “ -a ” and “-a
”, with the meaning such as “I -feminine-singular-adjective”, as in I am
a professor emerita.</p>
      <p>Additionally, we used the last three (cyrillic) characters of each
word as features (sufix n-grams, n = 3), which, in particular,
indirectly accounted for other grammatically meaningful endings
such as “nyi˘ ” (hinting at masculine adjective, as in I am a professor
emeritus).</p>
      <p>
        Frequency threshold Fine-tuning the size of the feature set has
proved to be of a great importance in AP [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It allows to reduce
significantly the size of the feature set and at the same time to
improve the results in most cases. In this run, we selected only those
features that occurred in at least two documents in the training
corpus and occurred at least five times in the entire training corpus
(min_df = 2; threshold = 5).
      </p>
      <p>Weighting scheme Tf-idf weighting with sublinear tf scaling.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Run CIC-2 (machine learning)</title>
      <p>
        Features Word features represent the lexical choice of a writer.
These features have proved to be indicative of author’s gender in
other languages, such as English, Spanish, Portuguese, and
Arabic [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In this run, we used word unigram features (bag-of-words
approach) in combination with the last three characters of each
word (sufix 3-grams).
      </p>
      <p>Frequency threshold The threshold was the same as in the CIC-1
run.</p>
      <p>Weighting scheme Tf-idf weighting without sublinear tf scaling.
4.3</p>
    </sec>
    <sec id="sec-7">
      <title>Run CIC-3 (statistical)</title>
      <p>First, we labeled the words that occur in the training corpus as
male’s or female’s, depending on whether the word was used (not
counting repetitions) more frequently in male’s or female’s
documents, except when the diference was less than 2.</p>
      <p>Next, for each document we calculated the ratio of such male’s to
female’s words (not counting repetitions). We labeled a document
as male’s if this ratio was above a threshold; otherwise, as female’s.
Since the dataset was balanced, as the threshold we used the median
of the distribution of this ratio.</p>
      <p>We also experimented with taking repetitions of words into
account, thresholds other than 2 for classifying words, as well as
with some formulas other than ratio for classifying documents;
however, we observed a lower performance.
4.4</p>
    </sec>
    <sec id="sec-8">
      <title>Run CIC-4 (machine learning)</title>
      <p>
        Features Combination of word and character n-gram features
usually provides good results for AP, for instance, a combination
of word and character n-grams was used by the best performing
system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] at this year’s PAN shared task [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In this run, we used
a combination of word unigrams with character n-grams (n = 2–3).
Frequency threshold We selected only those features that
occurred in at least two documents in the training corpus and
occurred at least four times in the entire training corpus (min_df = 2;
threshold = 4).
10FCV
      </p>
      <p>acc.</p>
      <p>Frequency threshold In this run, we set a hight frequency
threshold value: we selected only those features that occurred in at least
two documents in the training corpus and occurred at least 50 times
in the entire training corpus (min_df = 2; threshold = 50).
However, setting this high frequency threshold values only marginally
afected 10-fold cross-validation and 80%–20% accuracy, making it
very slightly higher or very slightly lower.</p>
      <sec id="sec-8-1">
        <title>Weighting scheme Tf-idf with sublinear tf scaling.</title>
        <p>5</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>RESULTS</title>
      <p>The 10-fold cross-validation results, in terms of classification
accuracy (acc.) for each run, as well as the results under 80%–20%
dataset splitting, are shown in Table 2. For each experiment, the
results for 10-fold cross-validation (10FCV ) and 80%–20% splitting,
as well as the number of features (No. of features), are provided. The
best results for each evaluation procedure is highlighted in bold
typeface.</p>
      <p>Our first run, which included gender-specific Russian
grammatical features, showed the highest 10-fold cross-validation accuracy
with the smallest number of features. Three out of five of our
runs (CIC-1, CIC-2, and CIC-5) showed the same accuracy under
80%–20% splitting, probably due to small size of the dataset.
Statistical approach (run CIC-3) showed the lowest accuracy under both
10-fold cross-validation and 80%–20% setting, though, surprisingly,
it showed the best results on several of the final test datasets, as
shown in Table 3. We attribute this, again, to the small size of the
datasets available for development.</p>
      <p>
        A comparison of the participating systems, including the oficial
ranking, is presented in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We show the detailed results of our five
runs on the five test datasets, along with the highest result achieved
on each test set among all participating systems and the system
that showed this result, in Table 3. The best result on each test
dataset is highlighted in bold typeface. Avg. stands for the average
accuracy of each run across the five test datasets; if a system was
not tested on some test set, we counted its accuracy on this test set
as zero. Weighted stands for the accuracy weighted by the number
of documents in each test set (again, counting as zero if a system
      </p>
      <sec id="sec-9-1">
        <title>CIC best rank Test 1</title>
        <p>was not evaluated on a test set); this was the measure used for the
oficial ranking. Norm. is similar to Weighted, but is normalized by
the highest accuracy on each test set (note that this is not accuracy;
it is the average closeness of the given system to the best system).</p>
        <p>As one can see from Table 3, none of the runs consistently
outperformed other runs across all the test datasets. The Test 3 set
consisted of documents that were collections of various tweets of
the same author, similarly to the training corpus, so it was not
exactly cross-genre scenario, but the documents in the Test 3 set
contained fewer tweets than those of the training corpus. On this
dataset, as well as on Test 4 with the shortest documents (online
reviews), of our runs, the best performance was achieved by run
CIC-3, which was based on the statistical approach. Test 2
(Facebook posts) was the only test set, on which our statistical approach
(CIC-3) failed to produce good result.</p>
        <p>Surprisingly, on the gender imitation corpus (Test 5), CIC-1 was
our second-best run (after CIC-3), even though CIC-1 was based
on gender-specific Russian grammatical (morphological) features,
such as the grammatical gender of verbs and adjectives, which in
imitated text follow the patterns of the gender being imitated.</p>
        <p>Runs CIC-4 and CIC-5, in spite of showing similar 10-fold
crossvalidation and 80%–20% accuracy, performed worse on the test
datasets than our first three runs. This can be due to the inclusion
of character n-grams, which probably caused overfitting. Another
reason for the relatively poor performance of CIC-5 could be the
too high frequency threshold value set for this run.</p>
        <p>For more in-depth analysis of the obtained results, the access to
the golden standard for the test datasets would be required.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>6 CONCLUSIONS</title>
      <p>
        We have presented the description of the five systems submitted by
the CIC team to the 2017 PAN shared task on Gender Identification
in Russian texts (RUSProfiling), four of them occupying the first four
places in the oficial ranking [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The task focused on cross-genre
author profiling (AP) scenario: the training corpus was composed
of tweets, while the provided test datasets were composed of ofline
texts, Facebook posts, tweets, online reviews, and gender imitation
texts.
      </p>
      <p>Our systems, which were not tuned for a specific genre, showed
the highest accuracy on three out of five test datasets: Facebook
posts, tweets, product and service online reviews, performing worse
on two test datasets than more genre-specific systems, which were
used only for some of the genres. Our first run based on a
machinelearning approach using gender-specific Russian grammatical
features showed the highest average accuracy across all the test datasets,
while our statistical approach based on lexical features showed the
best performance according to the weighted (oficial) and
normalized evaluation.</p>
      <p>One of the directions for future work would be to examine in
more detail the importance of morphological features for gender
identification in Russian texts, as well as to improve our statistical
approach by automatically tuning the threshold value according to
the size and genre of the test data.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partially supported by the Mexican Government
(CONACYT projects 240844, SNI, COFAA-IPN, SIP-IPN 20171813,
20172008, and 20172044).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Angelo</given-names>
            <surname>Basile</surname>
          </string-name>
          , Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>N-GrAM: New Groningen Author-profiling Model</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1866</year>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org, Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Marc</given-names>
            <surname>Franco-Salvador</surname>
          </string-name>
          , Nataliia Plotnikova, Neha Pawar, and
          <string-name>
            <given-names>Yassine</given-names>
            <surname>Benajiba</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Subword-based Deep Averaging Networks for Author Profiling in Social Media</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1866</year>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org, Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Helena</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          , Ilia Markov, Grigori Sidorov,
          <string-name>
            <surname>Juan-Pablo</surname>
          </string-name>
          Posadas-Durán,
          <article-title>Miguel A</article-title>
          .
          <string-name>
            <surname>Sanchez-Perez</surname>
          </string-name>
          , and
          <string-name>
            <surname>Liliana</surname>
          </string-name>
          Chanona-Hernandez.
          <year>2016</year>
          .
          <article-title>Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts</article-title>
          .
          <source>Computational Intelligence and Neuroscience</source>
          <year>2016</year>
          (
          <year>October 2016</year>
          ),
          <volume>13</volume>
          pages. https://doi.org/10.1155/
          <year>2016</year>
          /1638936
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Andrey</given-names>
            <surname>Ignatov</surname>
          </string-name>
          , Liliya Akhtyamova,
          <string-name>
            <given-names>and John</given-names>
            <surname>Cardif</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Twitter Author Profiling Using Word Embeddings and Logistic Regression</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1866</year>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org, Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Don</given-names>
            <surname>Kodiyan</surname>
          </string-name>
          , Florin Hardegger, Stephan Neuhaus, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Cieliebak</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Author Profiling with Bidirectional RNNs using Attention with GRUs</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1866</year>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org, Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Tatiana</given-names>
            <surname>Litvinova</surname>
          </string-name>
          , Olga Litvinlova, Olga Zagorovskaya, Pavel Seredin, Aleksandr Sboev, and
          <string-name>
            <given-names>Olga</given-names>
            <surname>Romanchenko</surname>
          </string-name>
          .
          <year>2016</year>
          . “
          <article-title>Ruspersonality”: A Russian Corpus for Authorship Profiling and Deception Detection</article-title>
          .
          <source>In Proceedings of the 2016 International FRUCT Conference on Intelligence, Social Media and Web</source>
          ,
          <string-name>
            <surname>ISMWFRUCT</surname>
          </string-name>
          <year>2016</year>
          . IEEE,
          <string-name>
            <surname>St. Petersburg</surname>
          </string-name>
          , Russia,
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Tatiana</given-names>
            <surname>Litvinova</surname>
          </string-name>
          , Francisco Rangel, Paolo Rosso, Pavel Seredin, and
          <string-name>
            <given-names>Olga</given-names>
            <surname>Litvinova</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the RUSProfiling PAN at FIRE Track on Cross-genre Gender Identification in Russian</article-title>
          .
          <source>In Notebook Papers of FIRE</source>
          <year>2017</year>
          ,
          <article-title>FIRE 2017</article-title>
          (CEUR Workshop Proceedings).
          <source>CEUR-WS.org, Bangalore</source>
          , India.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Tatiana</given-names>
            <surname>Litvinova</surname>
          </string-name>
          , Pavel Seredin, Olga Litvinova, Olga Zagorovskaya, Aleksandr Sboev, Dmitry Gudovskih, Ivan Moloshnikov, and
          <string-name>
            <given-names>Roman</given-names>
            <surname>Rybka</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Gender Prediction for Authors of Russian Texts Using Regression and Classification Techniques</article-title>
          .
          <source>In Proceedings of the 3rd Workshop on Concept Discovery in Unstructured Data co-located with the 13th International Conference on Concept Lattices and Their Applications</source>
          ,
          <source>CDUD@CLA</source>
          , Vol.
          <volume>1625</volume>
          . CEUR-WS.org,
          <volume>44</volume>
          -
          <fpage>53</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pastor</surname>
          </string-name>
          <article-title>Lop´ez-</article-title>
          <string-name>
            <surname>Monroy</surname>
          </string-name>
          ,
          <article-title>Manuel Montes-y-Go m´ez, Hugo Jair-Escalante, Luis Villasenõr Pineda, and</article-title>
          <string-name>
            <given-names>Thamar</given-names>
            <surname>Solorio</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Social-Media Users can be Profiled by their Similarity with other Users</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1866</year>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org, Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ilia</surname>
            <given-names>Markov</given-names>
          </string-name>
          , Helena Gómez-Adorno,
          <article-title>Juan-Pablo Posadas-Durán, Grigori Sidorov</article-title>
          , and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Author Profiling with Doc2vec Neural NetworkBased Document Embeddings</article-title>
          .
          <source>In Proceedings of the 15th Mexican International Conference on Artificial Intelligence, MICAI 2016</source>
          , Vol.
          <volume>10062</volume>
          .
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , LNAI, Springer, Cancún, Mexico,
          <fpage>117</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ilia</surname>
            <given-names>Markov</given-names>
          </string-name>
          , Helena Gómez-Adorno, and
          <string-name>
            <given-names>Grigori</given-names>
            <surname>Sidorov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Language-</article-title>
          and
          <string-name>
            <surname>Subtask-Dependent Feature</surname>
          </string-name>
          Selection and
          <article-title>Classifier Parameter Tuning for Author Profiling</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1866</year>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org, Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ilia</surname>
            <given-names>Markov</given-names>
          </string-name>
          , Efstathios Stamatatos, and
          <string-name>
            <given-names>Grigori</given-names>
            <surname>Sidorov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Improving CrossTopic Authorship Attribution: The Role of Pre-Processing</article-title>
          .
          <source>In Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2017</source>
          . Springer, Budapest, Hungary.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Matej</surname>
            <given-names>Martinc</given-names>
          </string-name>
          , Iza Škrjanec, Katja Zupan, and
          <string-name>
            <given-names>Senja</given-names>
            <surname>Pollak</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>PAN 2017: Author Profiling - Gender and Language Variety Prediction</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1866</year>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org, Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Fabian</surname>
            <given-names>Pedregosa</given-names>
          </string-name>
          , Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          , Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau,
          <string-name>
            <given-names>Matthieu</given-names>
            <surname>Brucher</surname>
          </string-name>
          , Matthieu Perrot, and
          <string-name>
            <given-names>Édouard</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research 12 (November</source>
          <year>2011</year>
          ),
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          . http://dl.acm.org/citation.cfm?id=
          <volume>1953048</volume>
          .
          <fpage>2078195</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Francisco</surname>
            <given-names>Rangel</given-names>
          </string-name>
          , Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and
          <string-name>
            <given-names>Giacomo</given-names>
            <surname>Inches</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Overview of the Author Profiling Task at PAN 2013</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2013 Evaluation Labs (CEUR Workshop Proceedings). CLEF and CEUR-WS.org, Valencia</source>
          , Spain,
          <fpage>23</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Francisco</surname>
            <given-names>Rangel</given-names>
          </string-name>
          , Paolo Rosso,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings). CLEF and CEUR-WS.org</source>
          , Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Francisco</surname>
            <given-names>Rangel</given-names>
          </string-name>
          , Paolo Rosso, Ben Verhoeven, Walter Daelemans,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Overview of the 4th Author Profiling Task at PAN 2016: Cross-genre Evaluations</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2016 Evaluation Labs (CEUR Workshop Proceedings). CLEF and CEUR-WS.org, Évora</source>
          , Portugal.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Nils</given-names>
            <surname>Schaetti</surname>
          </string-name>
          .
          <year>2017</year>
          . UniNE at CLEF 2017:
          <article-title>TF-IDF and Deep-Learning for Author Profiling</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1866</year>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org, Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Sierra</surname>
          </string-name>
          ,
          <article-title>Manuel Montes-y-</article-title>
          <string-name>
            <surname>Gómez</surname>
            ,
            <given-names>Thamar</given-names>
          </string-name>
          <string-name>
            <surname>Solorio</surname>
          </string-name>
          , and
          <string-name>
            <surname>Fabio</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>González</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Convolutional Neural Networks for Author Profiling</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1866</year>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org, Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Eric</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Tellez</surname>
            , Sabino Miranda-Jiménez,
            <given-names>Mario</given-names>
          </string-name>
          <string-name>
            <surname>Graf</surname>
            , and
            <given-names>Daniela</given-names>
          </string-name>
          <string-name>
            <surname>Moctezuma</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Gender and Language-Variety Identification with microTC</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1866</year>
          .
          <article-title>CLEF and CEUR-WS</article-title>
          .org, Dublin, Ireland.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>