<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the RusProfiling PAN at FIRE Track on Cross-genre Gender Identification in Russian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tatiana Litvinova</string-name>
          <email>centr_rus_yaz@mail.ru</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco Rangel</string-name>
          <email>francisco.rangel@autoritas.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prosso@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel Seredin</string-name>
          <email>paul@phys.vsu.ru</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olga Litvinova</string-name>
          <email>olga_litvinova_teacher@mail.ru</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Autoritas Consulting</institution>
          ,
          <addr-line>Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>PRHLT Research Center, Universitat Politècnica de</institution>
          ,
          <addr-line>València</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>RusProfiling Lab &amp;, Kurchatov Institute</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>RusProfiling Lab &amp;, Kurchatov Institute</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>RusProfiling Lab</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Author pro ling consists of predicting some author's traits (e.g. age, gender, personality) from her writing. After addressing at PAN@CLEF1 mainly age and gender identi cation, in this RusPro ling PAN@FIRE track we have addressed the problem of predicting author's gender in Russian from a cross-genre perspective: given a training set on Twitter, the systems have been evaluated on ve di erent genres (essays, Facebook, Twitter, reviews and texts where the authors imitated the other gender, where the users change their idiostyle). In this paper, we analyse the 22 runs sent by 5 participant teams. The best results (although also the most sparse ones) have been obtained on Facebook. author pro ling; gender identi cation; cross-genre pro ling; Russian;</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Author pro ling involves predicting an author's
demographics, personality traits, education and so on from her
writing, with gender identi cation being the most popular
task [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref15 ref16 ref2 ref4 ref5 ref6 ref8">10, 8, 12, 13, 11, 2, 5, 6, 15, 16, 4</xref>
        ]. Author pro
ling tasks are popular among participants of PAN which is
a series of scienti c events and shared tasks on digital text
forensics.2 Slavic languages, however, are less investigated
from an author pro ling standpoint and have never been
addressed at PAN.
      </p>
      <p>This year at FIRE we have introduced a PAN shared task
on Cross-genre Gender Identi cation in Russian texts
(RusPro ling shared task) where we provided tweets as a training
dataset and Facebook posts, online reviews, texts
describing images or letters to a friend, as well as tweets as test
datasets. The focus is especially on cross-genre gender
proling.</p>
      <p>The rest of the overview paper is structured as follows. In
Section 2, we describe the construction of the corpus and the
evaluation metrics. In Section 3, participants' approaches
1http://pan.webis.de/
2http://pan.webis.de/index.html
are presented, and in Section 4 the obtained results are
discussed. Finally, in Section 5 we draw some conclusions.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>EVALUATION FRAMEWORK</title>
      <p>In this section we describe the construction of the
corpus, covering particular properties, challenges and novelties.
Moreover, the evaluation measures are described.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Corpus</title>
      <p>In this section, we describe the datasets that have been
released for the tasks described in the previous section. We
have designed these datasets using manual and automated
techniques and made them available to participants through
the task web page.3</p>
      <p>
        Twitter dataset: (500 users per gender) was split into
training (300 users per gender) and testing datasets (200
users per gender). Annotating social media texts is what
makes designing such corpora particularly challenging. Some
researchers automatically built Twitter corpora while
others have solved this problem by using labor-intensive
methods. For example, Rao et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] use a focused search
methodology followed by manual annotation to produce a
dataset of 500 English users labeled with gender. The
gender tag was ascribed based on the screen name, pro le
picture, self-description ('bio') and {in the few cases this was
not su cient{ the use of gender markings when referring to
themselves. For this research we used the same approach
with manual labeling for tweet author gender. For those
cases where the gender information was not clear we
discarded the user. Retweets were removed.
      </p>
      <p>The number of tweets from one user varied from 1 to 200
(depending on how active the users were at the time the
data was collected { September 2016). All tweets from one
user were merged together and considered as one text. As
the analysis suggests, the tweets contain a lot of non-original
information (hashtags, hidden citations (e.g., newsfeeds that
are copied, etc.), hyperlinks, etc.), which makes it extremely
challenging for them to be analyzed.
3http://en.ruspro linglab.ru/ruspro ling-at-pan/korpus/
Facebook dataset: 228 users (114 authors per gender) of
di erent age groups (20+, 30+, 40+) from di erent Russian
cities were randomly chosen (to get minimum mutual
friendships). We used the same principals for gender labeling as
were used for Twitter. All posts from one user were merged
into one text with average length of 1000 words.</p>
      <p>As well as for collecting data from Twitter, Facebook
pages of famous people involved in administration or
government or accounts of heads of major companies were not
employed for the study. As the analysis show, in Russian
Facebook texts there is less non-original information than
on Twitter.</p>
      <p>
        Essays dataset: 185 authors per gender, one or two texts
per author (in case of two texts they were merged together
and considered as one text). The texts were taken randomly
from manually collected RusPersonality corpus [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
RusPersonality is the rst Russian-language corpus of written texts
labeled with data on their authors. A unique aspect of the
corpus is the breadth of the metadata (gender, age,
personality, neuropsychological testing data, education level, etc).
The texts were written by respondents especially for this
corpus, do not contain any borrowings and are not edited.
Topics of the texts were letter to a friend, picture
description, letter to an employee trying to convince her to hire the
respondent. The average text length in this dataset was 150
words.
      </p>
      <p>Reviews dataset: 388 authors per gender, one text per
author. The texts were collected from Trustpilot4, the
author's gender was identi ed based on the pro le information.
The average text length was 80 words.</p>
      <p>Gender-imitated dataset: 47 authors per gender, three
texts from each author that were merged together and
considered as one text. The texts were randomly selected from
the existing corpus we have collected called Gender
Imitation Corpus. The Gender Imitation Corpus is the rst
Russian corpus for studies of stylistic deception. Each
respondent (n=142) was instructed to write three texts on
the same topic (from a list). Let us provide an example of
the task: "Last summer you bought a package tour from a
travel agency, but you were not at all pleased with your
experience with that company and the trip was not worth the
price. You are about to ask for a refund. Write three texts
describing your negative experience providing a detailed
account of it. Give a warning that you are intending to sue
the company". The rst text is supposed to be written in
a way usual for whoever writes it (without any deception),
the second one should be written as if by someone of the
opposite gender ("imitation"); the third one should be as if one
by another individual of the same gender so that her
personal writing style will not be recognized (what is referred
to as "obfuscation"). Most of the texts are 80-150 words
long. All of the respondents are students of Russian
universities. Besides the texts, the corpus includes metadata with
the authors' characteristics: gender, age, native language,
handedness, psychological gender (femininity/masculinity).
Therefore, the corpus provides countless opportunities for
investigating problems arising in imitating properties of the
written speech in di erent aspects as well as gender
(biological and psychological) imitation in texts. To the best of our
knowledge, this is the rst corpus of this kind. Presently,
the corpus is being prepared to be made available on the
RusPro ling Lab website.</p>
      <p>In Table 1 a summary on the number of authors per
dataset is shown.</p>
      <p>For evaluating what done in the previous approaches we
have used accuracy, following author pro ling tasks at PAN.
In the RusPro ling shared task, we have calculated the
accuracy per dataset as the number of authors correctly identi ed
divided by the total number of authors in this dataset. The
global ranking has been obtained by calculating the average
accuracy among all the datasets weighted by the number of
documents in each dataset:
global acc =</p>
      <p>Pds accuracy(ds) size(ds))</p>
      <p>Pds size(ds)
(1)
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Baselines</title>
      <p>
        To understand the complexity of the task per genre and
with the aim to compare the performances of the
participants approaches, we propose the following baselines, as
well as we did at PAN at CLEF in 2017 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]:
majority. A statistical baseline that emulates random
choice. The baseline depends on the number of classes:
two in case of gender identi cation.
bow. This method represents documents as a
bag-ofwords with the 5,000 most common words in the
training set, weighted by absolute frequency of occurrence,
and it uses SVM as machine learning algorithm. The
texts are preprocessed as follows: lowercase words,
removal of punctuation signs and numbers, and removal
of stop words for the corresponding language.
      </p>
      <p>
        LDR [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This method represents documents on the
basis of the probability distribution of occurrence of
their words in the di erent classes. The key concept
of LDR is a weight, representing the probability of a
term to belong to one of the di erent categories (e.g.
female vs. male). The distribution of weights for a
given document should be closer to the weights of its
corresponding category. LDR takes advantage of the
whole vocabulary.
      </p>
    </sec>
    <sec id="sec-5">
      <title>OVERVIEW OF THE SUBMITTED AP</title>
    </sec>
    <sec id="sec-6">
      <title>PROACHES</title>
      <p>Following, we brie y describe the systems submitted by
the ve participants of the task, from three perspectives:
preprocessing, features to represent the authors' texts and
classi cation approaches. In Table 3 the teams and the
corresponding references are presented.</p>
      <p>
        Preprocessing. Preprocessing was carried out to obtain
plain text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Various participants removed stopwords [
        <xref ref-type="bibr" rid="ref1 ref17">1,
17</xref>
        ], short words [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and Twitter speci c elements (user
mentions, hashtags and links) [
        <xref ref-type="bibr" rid="ref1 ref17">1, 17</xref>
        ]. Some of them also
removed punctuation marks [
        <xref ref-type="bibr" rid="ref1 ref7">7, 1</xref>
        ] as well as numbers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and
the authors in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] removed non-cyrillic characters. Finally,
lemmatisation has been performed by the authors in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        Features. Traditionally, author pro ling tasks have been
approached with content and style-based features. In this
vein, the authors in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] extracted features such as the
number of user mentions, hashtags and urls, emoticons,
punctuation marks, and average word length, combined with tf-idf
bag-of-words. Similarly, the authors in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] combined
different kinds of features in their systems such as word and
character n-grams, words most frequently used per gender,
linguistic patterns such as word endings or the use of rst
person singular pronouns within a distance to a verb in past
tense. The mentioned linguistic rule has been combined with
deep learning techniques in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Finally, the authors in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
performed topic modelling and the authors in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] developed
a representation scheme based on the texts belonging to the
corresponding target classes.
      </p>
      <p>
        Classi cation Approaches. Traditional features have
been used with machine learning methods such as Support
Vector Machines (SVM) [
        <xref ref-type="bibr" rid="ref18 ref3 ref7">18, 7, 3</xref>
        ], Random Forest [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and
AdaBoost [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The authors in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] used Additive
Regularization for Topic Modelling. Finally, the authors in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], who
combined a rule-based approach with deep learning, have
used variations of Long-Short Term Memory networks.
      </p>
    </sec>
    <sec id="sec-7">
      <title>EVALUATION AND DISCUSSION OF THE</title>
    </sec>
    <sec id="sec-8">
      <title>SUBMITTED APPROACHES</title>
      <p>Due to the cross-genre perspective of the task, ve datasets
were provided. Five teams submitted a total of 22 runs,
whose distribution per dataset is shown in Table 3. As can
be seen, a total of 93 runs have been analysed, with 18-19
runs per dataset.</p>
      <p>The distribution of the results per dataset is shown in
Figure 1. It is noteworthy the highest accuracy obtained on
Facebook, with the median value about 75% and the
highest one over 90%. However, results on this genre are the
most sparse ones, with a standard deviation of 0.16. On the
other hand, results on the gender-imitated corpus are the
lowest ones, with most of the participants obtaining
accuracies close to 50%, that would correspond to the majority
class baseline. However, there were two participants who
obtained results about 65%. In the following subsections we
analyse the results per dataset more in depth.</p>
      <p>Results on the essays dataset (Table 4) set forth an
average accuracy of 55.39%, a median of 54.86% and a total
of seven runs below the majority class and bow baselines.
Apart from these low results, there are four runs
improving in more than 10% this baseline, with accuracies between
60.27% and 78.38%.</p>
      <p>The best result (78.38%) has been obtained by Bits Pilani,
who combined linguistic rules with deep learning techniques.
The second best result (68.11%) has been obtained by
AmritaNLP, who used stylistic features with traditional
machine learning algorithms. As can be seen, the rst result is
more than 10% higher than the second one, and about 23%
higher than the average, showing the power of deep
learning in this task when training on Twitter and evaluating on
essays. However, none of these systems overcame the LDR
baseline (81.41%), that obtained a performance that was 3%
and 13% higher, respectively.</p>
      <p>In Table 5 the results on the Facebook dataset are shown.
Both the average value (71.19%), the median (75%), the Q3
(86.19%) and the best value (93.42%) are the highest of all
datasets. Indeed, they are even higher than the obtained on
the Twitter dataset (shown in Table 6). However, the
systems behaved in a heterogeneous way among datasets,
obtaining the most sparse results with an inter-quartile range
of 34.44%. The reason is due to ve runs equal or below the
majority baseline, and another run from the same
participant very close to 50%. Furthermore, 12 systems performed
worst than the bow baseline, that obtained an accuracy of
76.32%, even higher than the mean (71.19%) and the median
(75%).</p>
      <p>The four best results have been obtained by CIC, that
trained SVMs with combinations of n-grams and linguistic
rules, among others. The fth and sixth best results have
been obtained by BITS Pilani with linguistic rules combined
with deep learning. The best runs obtained a better
performance than the LDR baseline of 2% and 12%, respectively.
In this case, although the deep learning techniques obtained
good results, they are more than 5% lower than traditional
approaches.
4.3</p>
    </sec>
    <sec id="sec-9">
      <title>Twitter</title>
      <p>The results obtained on the Twitter dataset are shown
in Table 6. The two best results (68.25%, 66.50%) have
been obtained by CIC team, with the next result tied with
BITS Pilani (65.25%). These results are very similar to the
one obtained by the LDR baseline (67.59%). The average
result falls down to 57.87%, below the median of 61.12%,
due to the low results obtained by most of the runs sent by
RBG team. In this vein, it is noteworthy to see that the
results are below the majority baseline obtained by the bow
baseline (49.37%).</p>
      <p>Although the results on the Twitter dataset were expected
to be the highest ones, they are much lower than the
obtained on the Facebook dataset. In Facebook, besides
maintaining the spontaneity of Twitter, posts use to be longer
and grammatically richer, with fewer syntactic errors and
misspellings. This may be the cause of the increase in
accuracy. Furthermore, although the mean is higher, the best
result in Twitter (68.25%) is 10% lower than the obtained
in the essays dataset (78.38%).
4.4</p>
    </sec>
    <sec id="sec-10">
      <title>Reviews</title>
      <p>Results on the reviews dataset (Table 7) are lower than
on the previous datasets although with lowest sparsity: most
of the participants obtained results close to the average and
median (52.87% and 52.06% respectively). As can be
observed, these results are very close to the majority class
(50%) and the bow baseline (50%), with ve runs equal or
below, and nine runs with less than a 5% of improvement.
These low results expose the di culty of the task on this
genre when the training data comes from Twitter.</p>
      <p>The best results have been achieved by CIC (61.86% and
59.79%) and Bits Pilani (57.86% and 57.73%) teams, such
as in the previous datasets (although about 4% lower than
the 65.81% obtained by the LDR baseline). However, the
di erence is more than 7% in case of Twitter, 17% in case
of essays and 30% in case of Facebook.</p>
      <p>In the gender-imitated corpus, the authors were asked to
write the texts as if they were of the other gender or
obfuscating their style, besides texts without imitation. In Table 8
the results of the gender identi cation task on this genre are
shown. The average and median accuracies obtained by the
systems on this dataset are the lowest (51.90% and 50%
respectively). Most participants obtained accuracies close to
the majority class and the bow baseline: 11 teams with an
accuracy equal or lower than 50% and 6 teams with less than
5% of improvement. Only two runs of Bits Pilani team
obtained a signi cant improvement of 13% and 15% over the
majority class. This team combined linguistic rules with
deep learning techniques, showing the robustness of these
techniques when the authors imitate the other gender and
style. In this vein, we should highlight that LDR baseline
(55.32%), AmritaNLP (54.26%) and CIC (54.26%), that
obtained similar results among them, performed about 10%
worst than the aforementioned deep learning techniques.</p>
      <p>The global ranking shown in Table 9 has been calculated
following Formula 1. It is noteworthy that most participants
obtained a weighted accuracy between 47% and 57%, with
a median of 54.42%. That means that most of the
participants obtained results close to the majority class (50%) and
the bow baseline (53.13%). There are also three runs that
obtained results much lower than the majority class due to
their participation only on some datasets.</p>
      <p>At the top of the ranking, we can highlight that the CIC
team obtained the best rst four results, with accuracies
ranging from 58.62% to 64.56%, showing the robustness and
homogeneity of their approach. However, it should be
highlighted that, as Bits Pilani runs di erent systems on the
di erent datasets, although they obtained one of the bests
results in each of them, a fair comparison has not been
possible. For example, run 4 obtained 78.38% accuracy on essays
(more than 10% than the next one), was not run neither
on Facebook nor on gender-imitated sets, where the overall
accuracy was lower. It is worth to mention that none of the
systems outperformed the LDR baseline (71.21%), that
obtained a 6.65% better performance with respect to the best
system.</p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSION</title>
      <p>This paper describes the 22 systems sent by 5
participants to the RusPro ling shared task at PAN-FIRE 2017.
Participants submitted a total of 93 runs on the ve di
erent datasets, with 18-19 runs per each dataset. They had to
address the identi cation of the author's gender from a
crossgenre perspective: given a training set of Twitter data, the
systems have been evaluated on ve di erent sets (essays,
Facebook, Twitter, reviews and gender-imitated texts).</p>
      <p>Participants have used di erent kinds of approaches, from
traditional ones based on hand-crafted features and machine
learning techniques such as Support Vector Machines, to the
nowadays fashionable deep learning techniques. Depending
on the genre, these approaches performed the best, such
as in case of essays or the gender-imitated texts where they
obtained more than 10% of improvement over the traditional
ones.</p>
      <p>
        Contrary to what was expected, the best results have not
been achieved in Twitter but in Facebook. The reason may
be that, although Facebook maintains the spontaneity of
Twitter, their posts use to be longer and grammatically
richer, with fewer syntactic errors and misspellings. On the
other hand, almost the worst results have been obtained on
reviews. Similar cross-genre e ects were also observed at
PAN-2014 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>In case of the gender-imitated texts, most systems failed,
with 11 runs equal or below the majority baseline, and 6
runs with less than 5% of improvement. Only two systems
of Bits Pilani obtained results with more than 10% of
improvement over the baseline. In this more di cult scenario,
the deep learning approaches showed their superiority over
traditional approaches.
6.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported in part of creation of Gender
Imitation Corpus by the Russian Science Foundation, project
No. 16-18-10050 "Identifying the Gender and Age of Online
Chatters Using Formal Parameters of their Texts". Texts
with style obfuscation were collected in the framework of the
project "Lie Detection in a Written Text: A Corpus Study"
supported by the Russian Foundation for Basic Research N
15-34-01221. The third author acknowledges the
SomEMBED TIN2015-71147-C2-1-P MINECO research project.
7.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shah</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sharma</surname>
          </string-name>
          .
          <article-title>Gender identi cation in russian texts</article-title>
          .
          <source>In Working Notes for PAN-RUSPro ling at FIRE'17. Workshops Proceedings of the 9th International Forum for Information Retrieval Evaluation (Fire'17)</source>
          , Bangalore, India. CEUR-WS.org,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Celli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lepri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-I.</given-names>
            <surname>Biel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gatica-Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Riccardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pianesi</surname>
          </string-name>
          .
          <source>The workshop on computational personality recognition 2014</source>
          .
          <source>In Proceedings of the ACM International Conference on Multimedia</source>
          , pages
          <volume>1245</volume>
          {
          <fpage>1246</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ganesh</surname>
          </string-name>
          <string-name>
            <given-names>HB</given-names>
            ,
            <surname>A. Kumar</surname>
          </string-name>
          <string-name>
            <surname>M</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>KP</surname>
          </string-name>
          .
          <article-title>Representation of target classes for text classi cation - amrita cen nlp@ruspro ling pan 2017</article-title>
          .
          <source>In Working Notes for PAN-RUSPro ling at FIRE'17. Workshops Proceedings of the 9th International Forum for Information Retrieval Evaluation (Fire'17)</source>
          , Bangalore, India. CEUR-WS.org,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Litvinova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gudovskikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sboev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Seredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Litvinova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pisarevskaya</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>Author gender prediction in russian social media texts</article-title>
          .
          <source>In Conf. on Analysis of Images, Social networks, and Texts</source>
          , AIST-
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Litvinova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Litvinlova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zagorovskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Seredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sboev</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Romanchenko</surname>
          </string-name>
          .
          <article-title>" ruspersonality": A russian corpus for authorship pro ling and deception detection</article-title>
          .
          <source>In Intelligence, Social Media and Web (ISMW FRUCT)</source>
          ,
          <year>2016</year>
          International FRUCT Conference on, pages
          <fpage>1</fpage>
          <article-title>{7</article-title>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Litvinova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Seredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Litvinova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zagorovskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sboev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gudovskih</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Moloshnikov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rybka</surname>
          </string-name>
          .
          <article-title>Gender prediction for authors of russian texts using regression and classi cation techniques</article-title>
          .
          <source>In CDUD@ CLA</source>
          , pages
          <volume>44</volume>
          {
          <fpage>53</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Markov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Sidorov, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          .
          <article-title>The winning approach to cross-genre gender identi cation in russian at ruspro ling 2017</article-title>
          .
          <source>In Working Notes for PAN-RUSPro ling at FIRE'17. Workshops Proceedings of the 9th International Forum for Information Retrieval Evaluation (Fire'17)</source>
          , Bangalore, India. CEUR-WS.org,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , I. Chugur,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Trenkmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Verhoeven</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Daelemans</surname>
          </string-name>
          .
          <article-title>Overview of the 2nd author pro ling task at pan 2014</article-title>
          . In Cappellato L.,
          <string-name>
            <surname>Ferro</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halvey</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kraaij</surname>
            <given-names>W</given-names>
          </string-name>
          . (Eds.)
          <article-title>CLEF 2014 labs and workshops, notebook papers</article-title>
          .
          <source>CEUR-WS.org</source>
          , vol.
          <volume>1180</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Franco-Salvador</surname>
          </string-name>
          .
          <article-title>A low dimensionality representation for language variety identi cation</article-title>
          .
          <source>In 17th International Conference on Intelligent Text Processing and Computational Linguistics</source>
          , CICLing. Springer-Verlag, LNCS, arXiv:
          <fpage>1705</fpage>
          .10754,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Moshe</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Inches</surname>
          </string-name>
          .
          <article-title>Overview of the author pro ling task at pan 2013</article-title>
          . In Forner P.,
          <string-name>
            <surname>Navigli</surname>
            <given-names>R.</given-names>
          </string-name>
          , Tu s
          <string-name>
            <surname>D</surname>
          </string-name>
          . (Eds.),
          <article-title>CLEF 2013 labs and workshops, notebook papers</article-title>
          .
          <source>CEUR-WS.org</source>
          , vol.
          <volume>1179</volume>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <article-title>Overview of the 5th Author Pro ling Task at PAN 2017: Gender and Language Variety Identi cation in Twitter</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org, Sept</source>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Daelemans</surname>
          </string-name>
          .
          <article-title>Overview of the 3rd author pro ling task at pan 2015</article-title>
          . In Cappellato L.,
          <string-name>
            <surname>Ferro</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            <given-names>G.</given-names>
          </string-name>
          , San Juan E. (Eds.)
          <article-title>CLEF 2015 labs and workshops, notebook papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEUR-WS.org</source>
          , vol.
          <volume>1391</volume>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Verhoeven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Daelemans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <article-title>Overview of the 4th author pro ling task at PAN 2016: cross-genre evaluations</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org, Sept</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yarowsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shreevats</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gupta</surname>
          </string-name>
          .
          <article-title>Classifying latent user attributes in twitter</article-title>
          .
          <source>In Proceedings of the 2nd international workshop on Search and mining user-generated contents</source>
          , pages
          <volume>37</volume>
          {
          <fpage>44</fpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sboev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Litvinova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gudovskikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rybka</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Moloshnikov.</surname>
          </string-name>
          <article-title>Machine learning models of text categorization by author gender using topic-independent features</article-title>
          .
          <source>Procedia Computer Science</source>
          ,
          <volume>101</volume>
          :
          <fpage>135</fpage>
          {
          <fpage>142</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sboev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Litvinova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Voronina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gudovskikh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rybka</surname>
          </string-name>
          .
          <article-title>Deep learning network models to categorize texts according to author's gender and to identify text sentiment</article-title>
          .
          <source>In Computational Science and Computational Intelligence (CSCI)</source>
          , 2016 International Conference on, pages
          <volume>1101</volume>
          {
          <fpage>1106</fpage>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Skitalinskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Akhtyamova</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Cardi</surname>
          </string-name>
          .
          <article-title>Cross-genre gender identi cation in russian texts using topic modeling working note: Team dubl</article-title>
          .
          <source>In Working Notes for PAN-RUSPro ling at FIRE'17. Workshops Proceedings of the 9th International Forum for Information Retrieval Evaluation (Fire'17)</source>
          , Bangalore, India. CEUR-WS.org,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vinayan</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. J.R.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>NB</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Kumar</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>S. K P. Amritanlp</surname>
          </string-name>
          <article-title>@pan-ruspro ling: Author pro ling using machine learning techniques</article-title>
          .
          <source>In Working Notes for PAN-RUSPro ling at FIRE'17. Workshops Proceedings of the 9th International Forum for Information Retrieval Evaluation (Fire'17)</source>
          , Bangalore, India. CEUR-WS.org,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>