1. INTRODUCTION

Overview of the RusProfiling PAN at FIRE Track on Cross-genre Gender Identification in Russian

Tatiana Litvinova

centr_rus_yaz@mail.ru 4

Francisco Rangel

francisco.rangel@autoritas.es 0

Paolo Rosso

prosso@dsic.upv.es 1

Pavel Seredin

paul@phys.vsu.ru 2

Olga Litvinova

olga_litvinova_teacher@mail.ru 3 0 Autoritas Consulting , Valencia , Spain 1 PRHLT Research Center, Universitat Politècnica de , València , Spain 2 RusProfiling Lab &, Kurchatov Institute , Russia 3 RusProfiling Lab &, Kurchatov Institute , Russia 4 RusProfiling Lab , Russia

Author pro ling consists of predicting some author's traits (e.g. age, gender, personality) from her writing. After addressing at PAN@CLEF1 mainly age and gender identi cation, in this RusPro ling PAN@FIRE track we have addressed the problem of predicting author's gender in Russian from a cross-genre perspective: given a training set on Twitter, the systems have been evaluated on ve di erent genres (essays, Facebook, Twitter, reviews and texts where the authors imitated the other gender, where the users change their idiostyle). In this paper, we analyse the 22 runs sent by 5 participant teams. The best results (although also the most sparse ones) have been obtained on Facebook. author pro ling; gender identi cation; cross-genre pro ling; Russian;

1. INTRODUCTION

Author pro ling involves predicting an author's demographics, personality traits, education and so on from her writing, with gender identi cation being the most popular task [ 10, 8, 12, 13, 11, 2, 5, 6, 15, 16, 4 ]. Author pro ling tasks are popular among participants of PAN which is a series of scienti c events and shared tasks on digital text forensics.2 Slavic languages, however, are less investigated from an author pro ling standpoint and have never been addressed at PAN.

This year at FIRE we have introduced a PAN shared task on Cross-genre Gender Identi cation in Russian texts (RusPro ling shared task) where we provided tweets as a training dataset and Facebook posts, online reviews, texts describing images or letters to a friend, as well as tweets as test datasets. The focus is especially on cross-genre gender proling.

The rest of the overview paper is structured as follows. In Section 2, we describe the construction of the corpus and the evaluation metrics. In Section 3, participants' approaches 1http://pan.webis.de/ 2http://pan.webis.de/index.html are presented, and in Section 4 the obtained results are discussed. Finally, in Section 5 we draw some conclusions. 2.

EVALUATION FRAMEWORK

In this section we describe the construction of the corpus, covering particular properties, challenges and novelties. Moreover, the evaluation measures are described. 2.1

Corpus

In this section, we describe the datasets that have been released for the tasks described in the previous section. We have designed these datasets using manual and automated techniques and made them available to participants through the task web page.3

Twitter dataset: (500 users per gender) was split into training (300 users per gender) and testing datasets (200 users per gender). Annotating social media texts is what makes designing such corpora particularly challenging. Some researchers automatically built Twitter corpora while others have solved this problem by using labor-intensive methods. For example, Rao et al. [ 14 ] use a focused search methodology followed by manual annotation to produce a dataset of 500 English users labeled with gender. The gender tag was ascribed based on the screen name, pro le picture, self-description ('bio') and {in the few cases this was not su cient{ the use of gender markings when referring to themselves. For this research we used the same approach with manual labeling for tweet author gender. For those cases where the gender information was not clear we discarded the user. Retweets were removed.

The number of tweets from one user varied from 1 to 200 (depending on how active the users were at the time the data was collected { September 2016). All tweets from one user were merged together and considered as one text. As the analysis suggests, the tweets contain a lot of non-original information (hashtags, hidden citations (e.g., newsfeeds that are copied, etc.), hyperlinks, etc.), which makes it extremely challenging for them to be analyzed. 3http://en.ruspro linglab.ru/ruspro ling-at-pan/korpus/ Facebook dataset: 228 users (114 authors per gender) of di erent age groups (20+, 30+, 40+) from di erent Russian cities were randomly chosen (to get minimum mutual friendships). We used the same principals for gender labeling as were used for Twitter. All posts from one user were merged into one text with average length of 1000 words.

As well as for collecting data from Twitter, Facebook pages of famous people involved in administration or government or accounts of heads of major companies were not employed for the study. As the analysis show, in Russian Facebook texts there is less non-original information than on Twitter.

Essays dataset: 185 authors per gender, one or two texts per author (in case of two texts they were merged together and considered as one text). The texts were taken randomly from manually collected RusPersonality corpus [ 5 ]. RusPersonality is the rst Russian-language corpus of written texts labeled with data on their authors. A unique aspect of the corpus is the breadth of the metadata (gender, age, personality, neuropsychological testing data, education level, etc). The texts were written by respondents especially for this corpus, do not contain any borrowings and are not edited. Topics of the texts were letter to a friend, picture description, letter to an employee trying to convince her to hire the respondent. The average text length in this dataset was 150 words.

Reviews dataset: 388 authors per gender, one text per author. The texts were collected from Trustpilot4, the author's gender was identi ed based on the pro le information. The average text length was 80 words.

Gender-imitated dataset: 47 authors per gender, three texts from each author that were merged together and considered as one text. The texts were randomly selected from the existing corpus we have collected called Gender Imitation Corpus. The Gender Imitation Corpus is the rst Russian corpus for studies of stylistic deception. Each respondent (n=142) was instructed to write three texts on the same topic (from a list). Let us provide an example of the task: "Last summer you bought a package tour from a travel agency, but you were not at all pleased with your experience with that company and the trip was not worth the price. You are about to ask for a refund. Write three texts describing your negative experience providing a detailed account of it. Give a warning that you are intending to sue the company". The rst text is supposed to be written in a way usual for whoever writes it (without any deception), the second one should be written as if by someone of the opposite gender ("imitation"); the third one should be as if one by another individual of the same gender so that her personal writing style will not be recognized (what is referred to as "obfuscation"). Most of the texts are 80-150 words long. All of the respondents are students of Russian universities. Besides the texts, the corpus includes metadata with the authors' characteristics: gender, age, native language, handedness, psychological gender (femininity/masculinity). Therefore, the corpus provides countless opportunities for investigating problems arising in imitating properties of the written speech in di erent aspects as well as gender (biological and psychological) imitation in texts. To the best of our knowledge, this is the rst corpus of this kind. Presently, the corpus is being prepared to be made available on the RusPro ling Lab website.

In Table 1 a summary on the number of authors per dataset is shown.

For evaluating what done in the previous approaches we have used accuracy, following author pro ling tasks at PAN. In the RusPro ling shared task, we have calculated the accuracy per dataset as the number of authors correctly identi ed divided by the total number of authors in this dataset. The global ranking has been obtained by calculating the average accuracy among all the datasets weighted by the number of documents in each dataset: global acc =

Pds accuracy(ds) size(ds))

Pds size(ds) (1) 2.3

Baselines

To understand the complexity of the task per genre and with the aim to compare the performances of the participants approaches, we propose the following baselines, as well as we did at PAN at CLEF in 2017 [ 11 ]: majority. A statistical baseline that emulates random choice. The baseline depends on the number of classes: two in case of gender identi cation. bow. This method represents documents as a bag-ofwords with the 5,000 most common words in the training set, weighted by absolute frequency of occurrence, and it uses SVM as machine learning algorithm. The texts are preprocessed as follows: lowercase words, removal of punctuation signs and numbers, and removal of stop words for the corresponding language.

LDR [ 9 ]. This method represents documents on the basis of the probability distribution of occurrence of their words in the di erent classes. The key concept of LDR is a weight, representing the probability of a term to belong to one of the di erent categories (e.g. female vs. male). The distribution of weights for a given document should be closer to the weights of its corresponding category. LDR takes advantage of the whole vocabulary.

OVERVIEW OF THE SUBMITTED AP PROACHES

Following, we brie y describe the systems submitted by the ve participants of the task, from three perspectives: preprocessing, features to represent the authors' texts and classi cation approaches. In Table 3 the teams and the corresponding references are presented.

Preprocessing. Preprocessing was carried out to obtain plain text [ 1 ]. Various participants removed stopwords [ 1, 17 ], short words [ 17 ] and Twitter speci c elements (user mentions, hashtags and links) [ 1, 17 ]. Some of them also removed punctuation marks [ 7, 1 ] as well as numbers [ 1 ], and the authors in [ 7 ] removed non-cyrillic characters. Finally, lemmatisation has been performed by the authors in [ 17 ].

Features. Traditionally, author pro ling tasks have been approached with content and style-based features. In this vein, the authors in [ 18 ] extracted features such as the number of user mentions, hashtags and urls, emoticons, punctuation marks, and average word length, combined with tf-idf bag-of-words. Similarly, the authors in [ 7 ] combined different kinds of features in their systems such as word and character n-grams, words most frequently used per gender, linguistic patterns such as word endings or the use of rst person singular pronouns within a distance to a verb in past tense. The mentioned linguistic rule has been combined with deep learning techniques in [ 1 ]. Finally, the authors in [ 17 ] performed topic modelling and the authors in [ 3 ] developed a representation scheme based on the texts belonging to the corresponding target classes.

Classi cation Approaches. Traditional features have been used with machine learning methods such as Support Vector Machines (SVM) [ 18, 7, 3 ], Random Forest [ 18 ] and AdaBoost [ 18 ]. The authors in [ 17 ] used Additive Regularization for Topic Modelling. Finally, the authors in [ 1 ], who combined a rule-based approach with deep learning, have used variations of Long-Short Term Memory networks.

EVALUATION AND DISCUSSION OF THE SUBMITTED APPROACHES

Due to the cross-genre perspective of the task, ve datasets were provided. Five teams submitted a total of 22 runs, whose distribution per dataset is shown in Table 3. As can be seen, a total of 93 runs have been analysed, with 18-19 runs per dataset.

The distribution of the results per dataset is shown in Figure 1. It is noteworthy the highest accuracy obtained on Facebook, with the median value about 75% and the highest one over 90%. However, results on this genre are the most sparse ones, with a standard deviation of 0.16. On the other hand, results on the gender-imitated corpus are the lowest ones, with most of the participants obtaining accuracies close to 50%, that would correspond to the majority class baseline. However, there were two participants who obtained results about 65%. In the following subsections we analyse the results per dataset more in depth.

Results on the essays dataset (Table 4) set forth an average accuracy of 55.39%, a median of 54.86% and a total of seven runs below the majority class and bow baselines. Apart from these low results, there are four runs improving in more than 10% this baseline, with accuracies between 60.27% and 78.38%.

The best result (78.38%) has been obtained by Bits Pilani, who combined linguistic rules with deep learning techniques. The second best result (68.11%) has been obtained by AmritaNLP, who used stylistic features with traditional machine learning algorithms. As can be seen, the rst result is more than 10% higher than the second one, and about 23% higher than the average, showing the power of deep learning in this task when training on Twitter and evaluating on essays. However, none of these systems overcame the LDR baseline (81.41%), that obtained a performance that was 3% and 13% higher, respectively.

In Table 5 the results on the Facebook dataset are shown. Both the average value (71.19%), the median (75%), the Q3 (86.19%) and the best value (93.42%) are the highest of all datasets. Indeed, they are even higher than the obtained on the Twitter dataset (shown in Table 6). However, the systems behaved in a heterogeneous way among datasets, obtaining the most sparse results with an inter-quartile range of 34.44%. The reason is due to ve runs equal or below the majority baseline, and another run from the same participant very close to 50%. Furthermore, 12 systems performed worst than the bow baseline, that obtained an accuracy of 76.32%, even higher than the mean (71.19%) and the median (75%).

The four best results have been obtained by CIC, that trained SVMs with combinations of n-grams and linguistic rules, among others. The fth and sixth best results have been obtained by BITS Pilani with linguistic rules combined with deep learning. The best runs obtained a better performance than the LDR baseline of 2% and 12%, respectively. In this case, although the deep learning techniques obtained good results, they are more than 5% lower than traditional approaches. 4.3

Twitter

The results obtained on the Twitter dataset are shown in Table 6. The two best results (68.25%, 66.50%) have been obtained by CIC team, with the next result tied with BITS Pilani (65.25%). These results are very similar to the one obtained by the LDR baseline (67.59%). The average result falls down to 57.87%, below the median of 61.12%, due to the low results obtained by most of the runs sent by RBG team. In this vein, it is noteworthy to see that the results are below the majority baseline obtained by the bow baseline (49.37%).

Although the results on the Twitter dataset were expected to be the highest ones, they are much lower than the obtained on the Facebook dataset. In Facebook, besides maintaining the spontaneity of Twitter, posts use to be longer and grammatically richer, with fewer syntactic errors and misspellings. This may be the cause of the increase in accuracy. Furthermore, although the mean is higher, the best result in Twitter (68.25%) is 10% lower than the obtained in the essays dataset (78.38%). 4.4

Reviews

Results on the reviews dataset (Table 7) are lower than on the previous datasets although with lowest sparsity: most of the participants obtained results close to the average and median (52.87% and 52.06% respectively). As can be observed, these results are very close to the majority class (50%) and the bow baseline (50%), with ve runs equal or below, and nine runs with less than a 5% of improvement. These low results expose the di culty of the task on this genre when the training data comes from Twitter.

The best results have been achieved by CIC (61.86% and 59.79%) and Bits Pilani (57.86% and 57.73%) teams, such as in the previous datasets (although about 4% lower than the 65.81% obtained by the LDR baseline). However, the di erence is more than 7% in case of Twitter, 17% in case of essays and 30% in case of Facebook.

In the gender-imitated corpus, the authors were asked to write the texts as if they were of the other gender or obfuscating their style, besides texts without imitation. In Table 8 the results of the gender identi cation task on this genre are shown. The average and median accuracies obtained by the systems on this dataset are the lowest (51.90% and 50% respectively). Most participants obtained accuracies close to the majority class and the bow baseline: 11 teams with an accuracy equal or lower than 50% and 6 teams with less than 5% of improvement. Only two runs of Bits Pilani team obtained a signi cant improvement of 13% and 15% over the majority class. This team combined linguistic rules with deep learning techniques, showing the robustness of these techniques when the authors imitate the other gender and style. In this vein, we should highlight that LDR baseline (55.32%), AmritaNLP (54.26%) and CIC (54.26%), that obtained similar results among them, performed about 10% worst than the aforementioned deep learning techniques.

The global ranking shown in Table 9 has been calculated following Formula 1. It is noteworthy that most participants obtained a weighted accuracy between 47% and 57%, with a median of 54.42%. That means that most of the participants obtained results close to the majority class (50%) and the bow baseline (53.13%). There are also three runs that obtained results much lower than the majority class due to their participation only on some datasets.

At the top of the ranking, we can highlight that the CIC team obtained the best rst four results, with accuracies ranging from 58.62% to 64.56%, showing the robustness and homogeneity of their approach. However, it should be highlighted that, as Bits Pilani runs di erent systems on the di erent datasets, although they obtained one of the bests results in each of them, a fair comparison has not been possible. For example, run 4 obtained 78.38% accuracy on essays (more than 10% than the next one), was not run neither on Facebook nor on gender-imitated sets, where the overall accuracy was lower. It is worth to mention that none of the systems outperformed the LDR baseline (71.21%), that obtained a 6.65% better performance with respect to the best system.

CONCLUSION

This paper describes the 22 systems sent by 5 participants to the RusPro ling shared task at PAN-FIRE 2017. Participants submitted a total of 93 runs on the ve di erent datasets, with 18-19 runs per each dataset. They had to address the identi cation of the author's gender from a crossgenre perspective: given a training set of Twitter data, the systems have been evaluated on ve di erent sets (essays, Facebook, Twitter, reviews and gender-imitated texts).

Participants have used di erent kinds of approaches, from traditional ones based on hand-crafted features and machine learning techniques such as Support Vector Machines, to the nowadays fashionable deep learning techniques. Depending on the genre, these approaches performed the best, such as in case of essays or the gender-imitated texts where they obtained more than 10% of improvement over the traditional ones.

Contrary to what was expected, the best results have not been achieved in Twitter but in Facebook. The reason may be that, although Facebook maintains the spontaneity of Twitter, their posts use to be longer and grammatically richer, with fewer syntactic errors and misspellings. On the other hand, almost the worst results have been obtained on reviews. Similar cross-genre e ects were also observed at PAN-2014 [ 8 ].

In case of the gender-imitated texts, most systems failed, with 11 runs equal or below the majority baseline, and 6 runs with less than 5% of improvement. Only two systems of Bits Pilani obtained results with more than 10% of improvement over the baseline. In this more di cult scenario, the deep learning approaches showed their superiority over traditional approaches. 6.

ACKNOWLEDGMENTS

This work was supported in part of creation of Gender Imitation Corpus by the Russian Science Foundation, project No. 16-18-10050 "Identifying the Gender and Age of Online Chatters Using Formal Parameters of their Texts". Texts with style obfuscation were collected in the framework of the project "Lie Detection in a Written Text: A Corpus Study" supported by the Russian Foundation for Basic Research N 15-34-01221. The third author acknowledges the SomEMBED TIN2015-71147-C2-1-P MINECO research project. 7.

[1]

Bhargava ,

Goel ,

Shah , and

Sharma . Gender identi cation in russian texts . In Working Notes for PAN-RUSPro ling at FIRE'17. Workshops Proceedings of the 9th International Forum for Information Retrieval Evaluation (Fire'17) , Bangalore, India. CEUR-WS.org, 2017 .

[2]

Celli ,

Lepri ,

J.-I.

Biel ,

Gatica-Perez ,

Riccardi , and

Pianesi . The workshop on computational personality recognition 2014 . In Proceedings of the ACM International Conference on Multimedia , pages 1245 { 1246 . ACM, 2014 .

[3]

Ganesh

HB , A. Kumar

M , and

KP . Representation of target classes for text classi cation - amrita cen nlp@ruspro ling pan 2017 . In Working Notes for PAN-RUSPro ling at FIRE'17. Workshops Proceedings of the 9th International Forum for Information Retrieval Evaluation (Fire'17) , Bangalore, India. CEUR-WS.org, 2017 .

[4]

Litvinova ,

Gudovskikh ,

Sboev ,

Seredin ,

Litvinova ,

Pisarevskaya , and

Rosso . Author gender prediction in russian social media texts . In Conf. on Analysis of Images, Social networks, and Texts , AIST- 2017 .

[5]

Litvinova ,

Litvinlova ,

Zagorovskaya ,

Seredin ,

Sboev , and

Romanchenko . " ruspersonality": A russian corpus for authorship pro ling and deception detection . In Intelligence, Social Media and Web (ISMW FRUCT) , 2016 International FRUCT Conference on, pages 1 {7 . IEEE, 2016 .

[6]

Litvinova ,

Seredin ,

Litvinova ,

Zagorovskaya ,

Sboev ,

Gudovskih , I. Moloshnikov , and

Rybka . Gender prediction for authors of russian texts using regression and classi cation techniques . In CDUD@ CLA , pages 44 { 53 , 2016 .

[7]

Markov ,

Gomez-Adorno , G.

Sidorov, and

Gelbukh . The winning approach to cross-genre gender identi cation in russian at ruspro ling 2017 . In Working Notes for PAN-RUSPro ling at FIRE'17. Workshops Proceedings of the 9th International Forum for Information Retrieval Evaluation (Fire'17) , Bangalore, India. CEUR-WS.org, 2017 .

[8]

Rangel ,

Rosso , I. Chugur,

Potthast ,

Trenkmann ,

Stein ,

Verhoeven , and

Daelemans . Overview of the 2nd author pro ling task at pan 2014 . In Cappellato L., Ferro

, Halvey

, Kraaij

. (Eds.) CLEF 2014 labs and workshops, notebook papers . CEUR-WS.org , vol. 1180 , 2014 .

[9]

Rangel ,

Rosso , and

Franco-Salvador . A low dimensionality representation for language variety identi cation . In 17th International Conference on Intelligent Text Processing and Computational Linguistics , CICLing. Springer-Verlag, LNCS, arXiv: 1705 .10754, 2016 .

[10]

Rangel ,

Rosso ,

M. Moshe

Koppel ,

Stamatatos , and

Inches . Overview of the author pro ling task at pan 2013 . In Forner P., Navigli

, Tu s D . (Eds.), CLEF 2013 labs and workshops, notebook papers . CEUR-WS.org , vol. 1179 , 2013 .

[11]

Rangel ,

Rosso ,

Potthast , and

Stein . Overview of the 5th Author Pro ling Task at PAN 2017: Gender and Language Variety Identi cation in Twitter . In Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org, Sept . 2017 .

[12]

Rangel ,

Rosso ,

Potthast ,

Stein , and

Daelemans . Overview of the 3rd author pro ling task at pan 2015 . In Cappellato L., Ferro

, Jones

, San Juan E. (Eds.) CLEF 2015 labs and workshops, notebook papers . CEUR Workshop Proceedings. CEUR-WS.org , vol. 1391 , 2015 .

[13]

Rangel ,

Rosso ,

Verhoeven ,

Daelemans ,

Potthast , and

Stein . Overview of the 4th author pro ling task at PAN 2016: cross-genre evaluations . In Working Notes Papers of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org, Sept . 2016 .

[14]

Rao ,

Yarowsky ,

Shreevats , and

Gupta . Classifying latent user attributes in twitter . In Proceedings of the 2nd international workshop on Search and mining user-generated contents , pages 37 { 44 . ACM, 2010 .

[15]

Sboev ,

Litvinova ,

Gudovskikh ,

Rybka , and I. Moloshnikov. Machine learning models of text categorization by author gender using topic-independent features . Procedia Computer Science , 101 : 135 { 142 , 2016 .

[16]

Sboev ,

Litvinova ,

Voronina ,

Gudovskikh , and

Rybka . Deep learning network models to categorize texts according to author's gender and to identify text sentiment . In Computational Science and Computational Intelligence (CSCI) , 2016 International Conference on, pages 1101 { 1106 . IEEE, 2016 .

[17]

Skitalinskaya ,

Akhtyamova , and

Cardi . Cross-genre gender identi cation in russian texts using topic modeling working note: Team dubl . In Working Notes for PAN-RUSPro ling at FIRE'17. Workshops Proceedings of the 9th International Forum for Information Retrieval Evaluation (Fire'17) , Bangalore, India. CEUR-WS.org, 2017 .

[18]

Vinayan , N. J.R. ,

NB , A. Kumar

, and

S. K P. Amritanlp

@pan-ruspro ling: Author pro ling using machine learning techniques . In Working Notes for PAN-RUSPro ling at FIRE'17. Workshops Proceedings of the 9th International Forum for Information Retrieval Evaluation (Fire'17) , Bangalore, India. CEUR-WS.org, 2017 .