-

Pro le-based Approach for Age and Gender Identi cation

Ma. Jose Garciarena Ucelay

Ma. Paula Villegas

Dario G. Funez

funezdariog@gmail.com 1

Leticia C. Cagnina

0 1

Marcelo L. Errecalde

merrecaldeg@gmail.com 1

Gabriela Ram rez-de-la-Rosa

Esau Villatoro-Tello

2 0 Consejo Nacional de Investigaciones Cient cas y Tecnicas , CONICET 1 LIDIC Research Group, Universidad Nacional de San Luis , Argentina 2 Language and Reasoning Research Group, Information Technologies Dept., Universidad Autonoma Metropolitana (UAM) Unidad Cuajimalpa , Mexico

This paper describes the participation between the LIDIC research group of the UNSL from Argentina and the Language and Reasoning research group of the UAM Cuajimalpa from Mexico at the PAN's 2016 Author Pro ling task. For the proposed method we adopted a pro le-based approach, which has been successfully applied in the Authorship Attribution problem. Thus, we proposed a variation of this technique for tackling the Author Pro ling task. Performed experiments showed that using about 8000 most frequent character n-grams for the construction of the di erent pro les, our proposed method obtains a better performance for both the same genre of documents as well as for the cross-genre scenario.

Pro le-based approach Author Pro ling Natural Language Processing

Lately, the Author Pro ling (AP) task is among the challenges that has been very attractive for the scienti c community, specially for elds such as Natural Language Processing, Forensics, Marketing, and Internet Security. As known, the main goal of the AP is to distinguish, from a given text, among di erent authors' categories and not to identify the author itself; the latter is known as Authorship Attribution [ 1 ]. Thus, the AP task aims at modelling, through more general set of features, groups of authors. Ideally speaking, such features will represent, to some extent, how di erent categories of authors employ their language depending on its age, gender, native language, political preference, personality, etc. [ 2 ].

One of the very rst works on facing the problem of AP are [ 2,3 ], where it was shown the pertinence of statistical techniques for distinguishing among authors' gender and age. Since then, many approaches have been proposed for facing the AP challenge [ 3,4,5,6,7 ]. A common approach among these research works is the use of textual representations, which have shown being e ective enough when the revised documents represent formal texts, for instance, news reports, scienti c papers, books, etc. Nonetheless, most of traditional approaches face several di culties when provided documents are from a more informal source, such as blogs, chats, or social media texts (e.g., tweets).

As part of the e orts in providing e ective solutions to the AP challenge, the PAN@CLEF4 proposes a competitive evaluation exercise for uncovering plagiarism, authorship, and social software misuse. For this year PAN campaign the focus of AP shared task is on cross-genre age and gender identi cation [ 8 ], meaning that, the training documents will be on one genre (e.g. Twitter, blogs, social media, etc.) and the evaluation will be on a di erent one.

The rest of this document is organized as follows, Section 2 describes some of the most relevant research works that have tried to solve the problem of AP with a pro le-based paradigm. Section 3 describes the ideas that motivate this work. Next, Section 4 describes our proposed method for approaching the AP problem and, Section 5 shows the obtained results on the PAN 2016 dataset. Finally, Section 6 depicts our future work ideas and the obtained conclusions. 2

Related work

In the eld of Author Analysis, there are several tasks that fall under the same type of stylistic analysis; these tasks are Author Attribution, Plagiarism Detection and Author Pro ling. In the Author Attribution problem, there are two predominant paradigms: instance-based paradigm and pro le-based paradigm. The former is the common one and also is the most used in the other related tasks of Author Analysis; this paradigm assumes each document of an author as independent. However, the pro le-based paradigm, in which all the documents for the same author are treated as one, despite its simplicity is not very common.

The most recent research that uses the pro le-based paradigm is the one proposed by Potha and Stamatatos [ 9 ]. They evaluated the pro le-based paradigm for the author attribution task and tested the paradigm against methods that use an instance-based paradigm from the PAN-2013 participants. The authors established four parameters for their method, such as the length of the n-grams, the length of the unknown document, the length of the pro le and the dissimilarity function. Results showed that their method, using a set of global and local settings, outperforms single methods from the participants of PAN 2013 for the author authorship track.

Another researches, also for Author Attribution, use hybrid approaches, that is, some characteristics are taken from both paradigms (i.e., instance-based and 4 http://pan.webis.de/clef16/pan16-web/ pro le-based) [ 10,11 ]. In these researches the authors use each document for each author as independent in the same way the instance-based paradigm does, but a pro le is built for each author.

As the previous works show, pro le-based approaches have been given competitive results for author attribution tasks. In this sense, we want to test this simple approach in another author analysis problem, i.e., author pro ling task. As in [ 9 ], we set some parameters such as the length of the pro le and the length of the n-grams in an cross-domain scenario. 3

Pro le based approaches

Pro le-based methods have been successfully used for addressing problems related to the authorship attribution (AA) task [ 1 ]. In a typical AA problem, a text of unknown authorship is assigned to a candidate author, given a set of candidate authors for which we have available texts of undisputed authorship. In this context, for each class of author these methods build a pro le containing information extracted from a collection of documents written by the author [ 12 ]. Figure 1 summarizes graphically the process of generating the pro les of each author.

The information extracted from the documents for the construction of the pro les can be related to the writing style or the text content as we brie y describe below.

{ Style-based features: such as frequency or number of pronouns, articles and prepositions, number of hyperlinks, words average, etc. [ 13 ]. One of the most used is the frequencies of n-grams of characters. The n-grams are substrings of n consecutive characters [ 14 ]. In particular for English language, n-grams of characters with n=3 have demonstrated to be e ective. These features capture interesting information depending on the gender and the age of the author. For example, women in blogs use more pronouns and a rmative-negative words. { Content-based features: consider the words related to di erent topics [ 13 ].

For example, the women usually write words related to personal concerns such as shopping, mom, etc. Instead, the men usually write about politic and technology.

In order to obtain the author pro les, these methods consider a set of documents of each author and extract the set of features. As the set could be too large, the pro le will consider only the L more frequent features from the whole set. Then, before classifying a target document, the method will construct a pro le with that unique document and using a similarity measure with respect to all authors' pro les, it will determine the authorship [ 15 ].

Some similarity (or distance) measures used in the pro le-based approaches are: 1. Keselj's Relative Distance (KRD) [ 16 ]: calculates the distance K between two pro les P1 and P2 as: (1) (2) where Pi(x) is the frequency of the term x in pro le Pi, and XP i is the set of all terms that occur in the pro le Pi. 2. Simpli ed Pro le Intersection (SPI) [ 17 ]: calculates the amount of features that belong to both pro les P1 and P2 as:

As pro le-based approaches have been successfully used for the AA task, we propose to use these for the Author Pro ling task. 4

Sistema de Per les : the proposed method

Our study focuses on predicting the age and gender of the author (female or male), for the languages English, Spanish and Dutch. For the age, the task considers the following ranges of ages: 18-24, 25-34, 35-49, 50-64 and 65-xx years old, only for the English and Spanish texts [ 18 ].

In order to use a pro le-based approach, we represent a speci c class of author with a pro le. Then, for predicting the gender and the age, we made 10 K =

X x2XP1 [XP2 K =

X x2XP1 \XP2 2 2 (P1(x) P1(x) + P2(x)

P2(x)) 2 (P1(x) P1(x) + P2(x)

P2(x)) 2 di erent pro les which comprise information combined about the possible gender and age of the authors. Thus, we obtained pro les for the following categories: female 18-24, male 18-24, female 25-34, male 25-34, female 35-49, male 35-49, female 50-64, male 50-64, female 65-XX and male 65-XX.

Regarding the features for the construction of the pro les, preliminary experiments showed that the use of character n-grams were adequate. The complete system named Sistema de Per les (SP) was implemented in two stages. In the rst one we constructed the pro les for each category for each language separately. We used the documents (i.e., training set) provided by Author Pro ling task at PAN-PC-2016 [ 8 ]. To getting the pro les of each category (each language separately) we applied the following steps considering all the training set: { Uni cation of each separate xml les in a single txt le (concatenation). One for category. { Preprocessing of the txt le obtained for each category: tags and images are removed. { Generation of the n-grams using the txt le and calculate the frequencies of each one. Sort the n-grams considering those most frequent at rst5. This step is performed for each category. { Save the pro le of the category considering only the L most frequent n-grams obtained in the previous step.

The second stage is the classi cation of a test document in a particular language (this information is provided). SP receives an input xml le then, the following steps are performed: { Preprocessing of the input le: tags and images are removed of the le and it is saved as a txt le. { Obtaining the n-grams and sorting those considering only the L most frequent (pro le document). { Check for similarity with the pro les of each category using the SPI function described above. It compares the pro le document with the corresponding to each category returning the label of that which is closer. Take into account that the pro les considered in this step are those with similar language of the input le. 5 5.1

Experiments and results Intra-Domain Study

We rst studied the performance of SP in a intra-domain experiments. Regarding the parameter L of SP, we consider that choosing an appropriate value is important to achieve a correct balance between an acceptable execution time and a good percentage of instances correctly classi ed. Moreover, if the L value 5 We used the library Morphadorner for this step, which is an open-access Java library for NLP supplied by the Northwestern University. is small it occurs an under tting. On the contrary, if the L value is excessively large, SP can generate an over tting of the classi cation. This is because the generated pro les would be adjusted too much over the corpus used for training.

Then, we carried out some preliminary intra-domain experiments, using only the training corpus provided by PAN 2016 competition. Although the competition stated that Author Pro ling task would focus on cross-genre age and gender identi cation, we believed convenient to try di erent values of L using the same corpus for both, training and testing. PAN 2016 corpus consists of 436 documents written in English, 250 in Spanish and 384 in Dutch language. We splitted this collection taking the 80% to train, and leaving the remaining 20% to test.

Tables 1 and 2 show the results of experiments for gender in Dutch, English and Spanish languages, as well as for age in the case of the latter two. We consider the percentage obtained of correctly classi ed instances, in other words, the accuracy as a measure of performance. Rows of Tables 1 and 2 indicate the di erent values for L (from 2000 to 8000) and columns point out di erent models of representation, that is, only 3-grams of characters or the combination from 3-grams to 5-grams, and so on.

We can observe that, in general, the best values of accuracy were reached when L was 4000 and 3-grams were utilized. In some cases, 5-grams work similarly to the use of 3-grams, but the reason for choosing the latter was given by the time incurred in the execution. Building the pro les based on 5-grams took twice as long as the construction of the pro les based on 3-grams. 5.2

Cross-genre Study

As we mentioned before, this years PAN Author Pro ling task was stated as cross-genre classi cation [ 8 ]. In this context, \genre" refers to the type of source from which the texts proceed, for example, Twitter, blogs and social media. For the experimentation we constructed the pro les for SP from the complete training corpus provided by the competition at PAN-2016.

In order to test our SP method in a cross-genre scenario, we used two di erent corpus: a representative subset of the collection supplied by the competition in PAN-2014 [ 19 ], and the complete corpus of PAN-2015 competition [ 20 ]. For the former collection we only considered the texts obtained from blogs and social media, both in Spanish and English languages. For the latter test collection we used all the available texts, which were obtained from Twitter; for Dutch we only evaluated the gender identi cation problem.

At rst, we obtained a general baseline in order to have values to compare with. Thus, using the training and test sets mentioned in the previous paragraph, with the Nave Bayes classi er and the tf-idf word representation, we reached the results shown in Table 3.

The results obtained with our SP method are shown in Table 4 and Table 5. We show the accuracy obtained for classi cation by gender and age, with di erent L values using 3-grams. The results in both tables correspond to the PAN-2014 collection to test for English and Spanish language. As we can see, SP with L=8000 achieves in the most of the cases, the highest percentage of classi cation (over the baseline).

Table 6 shows the accuracy obtained using PAN-2015 collection for testing with di erent L values. Although there is not a L value which is the best in all languages for both age and gender, we can conclude that L=8000 still per500 blg sm 2000 blg sm 4000 blg sm 6000 blg sm 8000 10000

blg sm blg forming well in the most of the cases. In fact for Dutch language (the only experimentation performed) with this value of L, SP obtained the best result.

Finally, for simplicity, we have set, for all categories and all languages, our SP system with L=8000 and as a similarity measure the SPI metric for the nal submission in the PAN competition. This decision was determined based on the averages of the results obtained and shown in the tables above. All the experiments were run using the TIRA platform [ 21,22 ].

Figure 2 summarizes the obtained performance of our system when it is tested with di erent corpora using the PAN-2016 data set for building the pro les. It is worth noting that in all considered cases (PAN-2014 and PAN-2015) the accuracy values are good when L=8000. 6

Conclusions and future work

This paper described the joint participation of the LIDIC research group of the UNSL from Argentina and the LyR research group of the UAM Cuajimalpa from Mexico at the PAN-2016 Author Pro ling task.

We presented a pro le-based method for the Author Pro ling task. Our proposal uses pro les of character 3-grams for representing information about the di erent categories of authors. We performed experiments in intra and cross genre scenarios and we showed that using the 8000 most frequent character 3grams, our method obtains the best performance of classi cation for genre and age.

In future works we plan to test di erent features for the construction of the pro les and the use of di erent similarity measures for comparing the pro les.

Acknowledgments. This work was partially funded by CONACyT under the Thematic Networks program (Language Technologies Thematic Network projects 260178, 271622). We also thank to UAM Cuajimalpa, CONACyT (Project grant number 258588) and SNI-CONACyT for their support.

1. E. Stamatatos, \ A survey of modern authorship attribution methods," J. Am. Soc. Inf. Sci. Technol ., vol. 60 , pp. 538 { 556 , Mar . 2009 .

Argamon ,

Koppel ,

Fine , and

A. R.

Shimoni , \Gender, genre, and writing style in formal written texts," TEXT , vol. 23 , pp. 321 { 346 , 2003 .

Koppel ,

Argamon , and

A. R.

Shimoni , \ Automatically categorizing written texts by author gender," Literary and Linguistic Computing , vol. 17 , no. 4 , pp. 401 { 412 , 2002 .

J. D.

Burger ,

Henderson , G. Kim, and G. Zarrella, \ Discriminating gender on twitter," in Proceedings of the Conference on Empirical Methods in Natural Language Processing , EMNLP ' 11 , (Stroudsburg , PA, USA), pp. 1301 { 1309 , Association for Computational Linguistics, 2011 .

Peersman ,

Daelemans , and L. Van Vaerenbergh , \ Predicting age and gender in online social networks," in Proceedings of the 3rd International Workshop on Search and Mining User-generated Contents , SMUC ' 11 , (New York, NY, USA), pp. 37 { 44 , ACM , 2011 .

Nguyen ,

Gravel ,

Trieschnigg , and T. Meder, \ How old do you think i am?; a study of language and age in twitter," in Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media , AAAI Press, 2013 . Reporting year: 2013 .

A. P.

Lopez-Monroy , M.

M. y

Gomez , H. J.

Escalante , L.

Villasen~or-

Pineda , and E. Stamatatos, \ Discriminative subpro le-speci c representations for author proling in social media," Knowledge-Based Systems , vol. 89 , pp. 134 { 147 , 2015 .

Rangel ,

Rosso ,

Verhoeven ,

Daelemans ,

Potthast , and

Stein , \ Overview of the 4th Author Pro ling Task at PAN 2016: Cross-genre Evaluations," in Working Notes Papers of the CLEF 2016 Evaluation Labs , CEUR Workshop Proceedings, CLEF and CEUR-WS.org, Sept . 2016 .

Potha and

Stamatatos , Arti cial Intelligence: Methods and Applications: 8th Hellenic Conference on AI, SETN 2014 , Ioannina, Greece, May 15 -17, 2014 . Proceedings, ch. A Pro le -Based Method for Authorship Veri cation , pp. 313 { 326 . Cham: Springer International Publishing, 2014 .

10.

H. V.

Halteren , \ Author veri cation by linguistic pro ling: An exploration of the parameter space," ACM Trans. Speech Lang. Process. , vol. 4 , pp. 1 : 1 {1: 17 , Feb . 2007 .

11. J. Grieve, \ Quantitative authorship attribution: An evaluation of techniques," Literary and Linguistic Computing , vol. 22 , no. 3 , pp. 251 { 270 , 2007 .

12. H. J. Escalante , M. M. y Gomez , and T. Solorio, \ A weighted pro le intersection measure for pro le-based authorship attribution," in Proceedings of MICAI 2011 , vol. 7094 , pp. 232 { 243 , 2011 .

13. J. Schler , M.

Koppel , S.

Argamon , and J. W.

Pennebaker , \ E ects of age and gender on blogging," in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 199 { 205 , 2006 .

14. W. B. Cavnar and J. M. Trenkle , \ N-gram-based text categorization," in Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval , pp. 161 { 175 , 1994 .

15.

Layton ,

Watters , and

Dazeley , \ Recentred local pro les for authorship attribution," Natural Language Engineering , vol. 18 , pp. 293 { 312 , 2012 .

16.

Keselj ,

Peng ,

Cercone , and C. Thomas, \ N-gram-based author pro les for authorship attribution," Proceedings of the conference paci c association for computational linguistics , PACLING , vol. 3 , pp. 255 { 264 , 2003 .

17. G. Frantzeskou,

Stamatatos ,

Gritzalis , and

Katsikas , \ Source code author identi cation based on n-gram author pro les," in Arti cial Intelligence Applications and Innovations , vol. 204 of IFIP, pp. 508 { 515 , Springer

, 2006 .

18. \9th evaluation lab on uncovering plagiarism, authorship, and social software misuse (PAN 2013 ). " http://pan .webis.de/, 2013 .

19.

Rangel ,

Rosso , I. Chugur,

Potthast ,

Trenkmann ,

Stein ,

Verhoeven , and W. Daelemans, \ Overview of the 2nd Author Pro ling Task at PAN 2014 , " in CLEF 2014 Evaluation Labs and Workshop, pp. 15 { 18 , CEUR-WS .org, 2014 .

20.

Rangel ,

Celli ,

Rosso ,

Potthast ,

Stein , and W. Daelemans, \ Overview of the 3rd Author Pro ling Task at PAN 2015 , " in CLEF 2015 Evaluation Labs and Workshop, pp. 8 { 11 , CEUR-WS .org, 2015 .

21.

Gollub ,

Stein ,

Burrows , and

Hoppe , \TIRA: Con guring, Executing, and Disseminating Information Retrieval Experiments," in 9th International Workshop on Text-based Information Retrieval ( TIR 12) at DEXA (A . Tjoa , S. Liddle , K.-D. Schewe , and X. Zhou, eds.), (Los Alamitos, California), pp. 151 { 155 , IEEE, Sept. 2012 .

22. M. Potthast , T.

Gollub , F.

Rangel , P.

Rosso , E. Stamatatos, and B.

Stein , \ Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identi cation, and Author Pro ling," in Information Access Evaluation meets Multilinguality, Multimodality, and Visualization . 5th International Conference of the CLEF Initiative (CLEF 14 ) (E. Kanoulas , M.

Lupu , P.

Clough , M.

Sanderson , M.

Hall , A.

Hanbury , and E. Toms, eds.), (Berlin Heidelberg New York), pp. 268 { 299 , Springer, Sept. 2014 .