Quanti anni hai? Age Identification for Italian Aleksandra Maslennikova• , Paolo Labruna• , Andrea Cimino , Felice Dell’Orletta • Università di Pisa a.maslennikova@studenti.unipi.it pielleunipi@gmail.com  Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR) ItaliaNLP Lab - www.italianlp.it {andrea.cimino, felice.dellorletta}@ilc.cnr.it Abstract 1 Introduction English. We present the first work to our Social media platforms such as Facebook, Twit- knowledge on automatic age identification ter and public forums allow users to communicate for Italian texts. For this work we built a and share their opinions and to build social rela- dataset consisting of more than 2.400.000 tions. The proliferation of such platforms allowed posts extracted from publicly available fo- the scientific community to study many commu- rums and containing authorship attribution nication phenomena such as the analysis of the metadata, such as age and gender. We de- sentiment (Pak et al., 2010) or irony (Hernández veloped an age classifier and performed a Farı́as et al, 2016). Another related research field set of experiments with the aim of evalu- is the ”author profiling” one, where the features ating the possibility of assigning the cor- that allow to discriminate age, gender, or native rect age of an user and which informa- language of a person are analyzed. These studies tion is useful to tackle this task: lexical are conducted both for forensic and marketing rea- or linguistic information spanning across sons, since the classification of these characteris- different levels of linguistic descriptions. tics allow companies to better focus their market- The performed experiments show the im- ing campaigns. In the author profiling scenario, portance of lexical information in age clas- many are the studies conducted by the scientific sification, but also that exists writing style community, that were generally focused on En- that relates to the age of an user. glish and Spanish language. The majority of these studies were performed in PAN 1 (Rangel et al., Italiano. In questo articolo presentiamo 2016), a lab at CLEF 2 that holds each year and il primo lavoro a nostra conoscenza sul in which many shared tasks related to the ”author- riconoscimento automatico dell’età per la ship attribution” research topic are run. In these lingua italiana. Per condurre il lavoro ab- shared tasks participants were asked to identify the biamo costruito un dataset composto da gender or the age using manually annotated train- più di 2.400.000 di post estratti da fo- ing data from social media platforms. Among the rum pubblici e associati a informazioni most successful approaches proposed by partici- rispetto all’età e al genere degli autori. pants the ones that achieved the best results (op Abbiamo sviluppato un sistema di clas- Vollenbroek et al., 2016), (Modaresi et al., 2016) sificazione dell’età dello scrittore di un are based on SVM classifiers exploiting a wide testo e condotto una serie di esperimenti variety of lexical and linguistic features, such as per valutare se è possibile definire l’età e word n–grams, part–of–speech, and syntax. Only attraverso quali informazioni estratte dal recently deep learning based approaches were pro- testo: lessicali o di descrizione linguis- posed and have showed very good results espe- tica a diversi livelli. I risultati ottenuti cially when dealing with multi–modal data, i.e. dimostrano l’importanza del lessico nella text and images posted on Twitter (Takahashi et classificazione, ma anche l’esistenza di al., 2018). uno stile di scrittura correlato all’età. In the present work we tackle a specific author- 1 Copyright c 2019 for this paper by its authors. Use per- https://pan.webis.de/ 2 mitted under Creative Commons License Attribution 4.0 In- http://www.clef-initiative.eu/ ternational (CC BY 4.0). association/steering-committee ship attribution task: the age detection for the Ital- one of them. We didn’t have a preassigned settled ian language. To our knowledge, this is the first list of possible topics. Instead, we were adding time that such task is performed on Italian. For this them in the process. For example, if we have an reason, we built a multi–topic corpus, developed a entire forum which discusses about only watches, classifier which exploits a wide range of linguis- we wouldn’t assign some general ”Hobby” tag, but tic features, and conducted several experiments to we would create a special group ”Watches” specif- evaluate both the newly introduced corpus and the ically for this forum. classifier. At the and of the collection process, we col- The main contributions of this work are: i) an lected 2.445.012 posts from 7.023 different users automatically built corpus for the age detection and 162 forums, that we divided in 30 different task for the Italian language; ii) the development topic groups. All the information regarding the of an age detection system; iii) the study of the dataset are shown in Table 1. impact of linguistic and lexical features. 3 The Age classifier 2 Dataset construction We implemented a document age classifier that With the aim of building an automatic dataset from operates on morpho–syntactically tagged and de- the web, we needed a set of Italian texts with the pendency parsed texts. The classifier exploits age of authors publicly available. Nowadays col- widely used lexical, morpho-syntatic and syntac- lecting this information is a challenging task, since tic features that are used to build the final statisti- the majority of the available platforms, for the sake cal model. This statistical model is finally used of privacy, prefer not to make the user’s age public. to predict the age range of unseen documents. So, first-of-all, we had to find a website with such We used linear SVM implemented in LIBLIN- data. We choose the ForumFree platform3 which EAR (Rong-En et al., 2008) as machine learning allows users to create their own forums without algorithm. The input documents were automati- any coding skills, using an existing template. Hav- cally POS tagged by the Part–Of–Speech tagger ing all the forums based on the same templates described in (Cimino and Dell’Orletta, 2016) and makes them perfect for automated crawling. We dependency–parsed by the DeSR parser (Attardi et extracted all the posts of the users that decided to al., 2009). show publicly their age. We tried to collect the data from the top 200 most active forums. Not all 3.1 Features the forums had users with all the user information Raw and Lexical Text Features filled and, in the end of the processes, we fetched Word n-grams, calculated as presence or absence messages from 162 different forums. Since our of a word n-gram in the text. goal was to build a corpus with author profiling Lemma n-grams, calculated as the frequency of purposes, and such task is very difficult with very each lemma n-gram in the text and normalized small comments, we selected only posts with a with respect to the number of tokens in the text. minimum length of 20 words. Morpho–syntactic Features Another problem we faced is that users are not Coarse and fine grained Part-Of-Speech n- age-balanced in the forums: for example, anime grams, calculated as the logarithm of the fre- dedicated forum have mostly users aged under quency of each coarse/fine grained PoS n-gram in 35. Another example are cars dedicated forums, the text and normalized with respect to the number where usually users are more mature with respect of tokens of the text. to anime forums. Only a couple of forums have Syntactic Features very balanced information, which usually is the Linear dependency types n-grams, calculated as best data for training machine learning based clas- the frequency of each dependency n-gram in the sifiers. For this reason, we decided to group the text with respect to the surface linear ordering of forums by their topics, because in this scenario words and normalized with respect to the number it is more probable to gather enough textual data of tokens in the text. for each age gap. We manually looked the con- Hierarchical dependency types n-grams calcu- tent of all forums and assigned the topic for each lated as the logarithm of the frequency of each hi- 3 https://www.forumfree.it/?wiki=About erarchy dependency n-gram in the text and nor- Topic ≤20 21-30 31-40 41-50 51-60 ≥61 Users 36 158 187 209 158 45 Cars Posts 6056 50281 46746 62002 48939 15867 Users 10 11 12 35 25 1 Bicycles Posts 2056 2284 5532 13418 16959 6 Users 3 52 78 69 46 18 Smoking Posts 7 21399 41470 38149 17981 4742 Users 392 438 142 62 16 6 Anime/Manga Posts 60367 99165 39939 29086 3873 228 Users 115 104 14 8 6 7 Role playing Posts 22953 40652 3893 3945 534 2060 Users 235 358 113 131 48 7 Gaming Posts 54584 81535 20379 20055 4560 1323 Users 11 25 21 13 11 2 Spirituality Posts 336 1427 1342 1095 1517 965 Users 7 36 27 29 17 1 Aesthetic medicine Posts 1345 6135 11767 8208 3384 1 Users 215 338 192 136 52 24 Sport Posts 82495 310220 158382 103027 34627 16084 Users 0 1 4 10 4 4 Culinary Posts 0 52 10130 2414 747 438 Users 10 21 11 4 2 3 Pets Posts 4307 13222 7357 2592 5383 10353 Users 21 76 26 24 17 4 Celebrities Posts 548 21114 5820 6150 3139 1248 Users 0 2 4 10 6 0 Politics Posts 0 330 2801 3548 576 0 Users 52 45 34 43 34 15 Different topics Posts 9453 12000 21667 16316 4759 24418 Users 11 57 79 62 30 5 Fishing Posts 3040 14805 24306 17131 13155 8356 Users 6 6 0 2 5 1 Institution community Posts 13 12 0 18 11130 4364 Users 0 6 7 5 5 1 Rail transport modelling Posts 0 3597 2289 999 2470 751 Users 4 10 4 7 4 0 Culture Posts 1855 560 653 1174 219 0 Users 0 2 2 4 1 2 Tourism Posts 0 16 10 1378 2 14 Users 11 31 18 10 2 1 Sexuality Posts 185 2540 8201 1421 7 1179 Users 25 34 78 121 55 11 Metal Detecting Posts 7750 9830 19299 31288 16547 3529 Users 12 25 15 0 0 0 Music Posts 8731 15720 5276 0 0 0 Users 1 4 1 1 0 0 Parenting Posts 719 2250 626 420 0 0 Users 37 47 12 4 8 5 Technologies Posts 185 266 431 26 19 23 Users 5 9 10 6 6 2 Nature Posts 998 1304 3653 2171 292 10 Users 0 5 6 1 0 0 Religion Posts 0 2618 4125 896 0 0 Users 25 26 10 5 1 2 Films Posts 9476 6135 503 43 4 2477 Users 12 14 2 0 1 2 Psychology Posts 291 912 44 0 1 11 Users 0 3 3 10 11 7 Gambling Posts 0 458 134 364 715 274 Users 29 153 317 302 109 32 Watches Posts 5158 52623 114074 101869 50243 18085 Table 1: Distribution of number of users and posts per age gap in different topics in the corpus malized with respect to the number of tokens in word and lemmas features, the second one (Syn- the text. In addition to the dependency relation- tax), which uses only the morpho–syntactic and ship, the feature takes into account whether a node syntactic features. Finally, the last model (All), is a left or a right child with respect to its parent. which uses both the lexical, morpho–syntactic and syntactic features. We considered as baseline 4 Experiments model a classifier which predicts always the most frequent class. In order to test the corpus and the classifier, we performed a set of experiments. The experiments 4.1 Results were devised in order to test real-word scenarios where 1) we were interested to classify a set of Tables 2 and 3 report the results achieved by the posts written by a single user rather then a sin- classifier for the in–domain and out–domain ex- gle post; 2) we always classified unseen users, i.e. periments respectively. For what concerns all no training data was available for such users. For the experiments, we can notice that the results these reasons, we merged all the posts of a sin- achieved by our classifier are higher than the base- gle user in the original corpus in a single doc- line results, showing that there are features that are ument. We then considered only the users that able to discriminate among the considered classes. wrote a minimum of 200 tokens and limited the The in–domain results show that the lexical fea- final merged document to a ’soft’ limit of 1000 to- tures are the ones that have the most discrimina- kens for each user. When the soft limit was ex- tive power with respect to the syntax ones. The ceeded, we included the whole post that exceeded f-score achieved by the lexicon model is 3-4 times the soft limit. The described procedure allows better than the baseline in the 5–class setting, and training and test splits to never contain the same 2 times better in average in the 2–class setting. user. For the age detection tasks, similarly as in The syntax model shown very good results but, as (Rangel et al., 2016), we considered age-splits as expected, lower than the results achieved by the the classification classes. More precisely, we took lexicon model. This is an important result since into account two different age group splits: the it shows that syntax and morpho–syntax are rel- first one, which we will refer with the name 5– evant characteristics in each age-group, both in class, in which we split the documents in 5 differ- the 5–class and 2–class settings. Surprisingly, the ent age groups: 20-29, 30-39, 40-49, 50-59, 60- All model didn’t show in any experiment an in- 69. The second age group split, which we will crease in classification performance. The classifi- refer with the name 2–class, is composed by the cation patterns revealed in the in–domain experi- following age group splits: ≤29, ≥50-69 (exclud- ments are similarly shown also in the out–domain ing all the documents written by users that did not experiments. The results achieved in this setting as belong to these age groups). We conducted two expected are lower than results achieved in the in– different kind of experiments. In the first experi- domain settings. The 5–class experiments show a ment (in–domain), we evaluated the performance drop in performance achieved by the considered of the classifier on in-domain texts, more precisely learning models of 8-10% f–score points in aver- we selected three different topics starting from the age w.r.t. to the in–domain experiments. When main corpus and on each of the topics we trained we move to the 2–class experiments, no significant the classifier on the 80% of the data, and evalu- drop in performance is noticed. This shows that ated the performance of the classifier on the re- in case of domain shifting, the machine learning maining 20%. For this experiment we choose the models are still able to well discriminate between the following domains: Sports, Watches and Cars. young and aged people. In the second experiment (out–domain) we trained Figures 1 and 2 report the confusion matrices the classifier on the all the 3 topics used for the of the in–domain and out–domain experiments us- in–domain experiments and evaluated the perfor- ing the 5-class age-groups. More precisely, the mance of the classifier on other 3 different topics in–domain confusion matrix is obtained by train- (Smoking, Celebrities, Metal Detecting). ing the All model on all the three training in– In addition, we devised 3 different machine domain topics and testing the model on the re- learning models based on 3 different sets of fea- spective testset (f–score: 0.47). Similarly, the out- tures. The first one (Lexicon), which uses only domain confusion matrix is obtained by training 5-class 2-class Topic Baseline Lexicon Syntax All Baseline Lexicon Syntax All Sport 0.27 0.45 0.42 0.48 0.74 0.74 0.75 0.75 Watches 0.19 0.43 0.35 0.42 0.44 0.85 0.75 0.83 Cars 0.12 0.54 0.34 0.45 0.47 0.87 0.77 0.84 Table 2: Results achieved in the in–domain experiments in terms of f–score 5-class 2-class Topic Baseline Lexicon Syntax All Baseline Lexicon Syntax All Smoking 0.14 0.30 0.25 0.32 0.42 0.79 0.68 0.79 Celebrities 0.33 0.45 0.39 0.47 0.62 0.83 0.73 0.81 Metal Detecting 0.21 0.36 0.27 0.34 0.52 0.80 0.66 0.78 Table 3: Results achieved in the out–domain experiments in terms of f–score Figure 1: Confusion matrix calculated on the docu- Figure 2: Confusion matrix calculated on the docu- ments belonging to the in-domain topics ments belonging to the out-domain topics the All model on all the in-domain topics (includ- By exploiting the publicly available information ing the test-sets), and testing the model on the out- on the FreeForum platform, we built a corpus con- domain documents of the selected 3 topics. As sisting of more than 2.400.000 posts and 7.000 it can be seen, the errors both on the in–domain different users containing the user’s age informa- and out–domain experiments show very good per- tion. The first experiments performed through formances of the classifier, i.e., in case of errors, a machine learning based classifier that uses a usually it makes a mistake of a range of ± 10 wide range of linguistic features showed promis- years. Such results show also that the automati- ing results in two different range classification cally built corpus is a very useful resource for the tasks both in the in–domain and out–domain set- age classification task. Finally, it is interesting tings. The conducted experiments show that lex- to notice that the most correct predicted classes icon plays a fundamental role in the age classi- are the ranges 20-29 and 40-49, both in the in– fication task both in in–domain and out–domain domain and out–domain settings, while the worst scenarios. Lastly, the experiments shown that the predicted class in both experiments is the 60-69 corpus, even though if automatically generated, is age range, most probably because is the most un- suitable for real–world applications. We plan to derrepresented class in the training set. release the full corpus as soon as privacy and legal issues will be fully investigated. 5 Conclusions We presented the first automatically built corpus for the age detection task for the Italian language. Acknowledgments Takumi Takahashi, Takuji Tahara, Koki Nagatani, Yasuhide Miura, Tomoki Taniguchi and Tomoko This work was partially supported by the 2-year Ohkuma. 2016. Text and image synergy with feature project ARTILS, Augmented RealTime Learning cross technique for gender identification. In Work- ing Notes of CLEF 2018 - Conference and Labs for Secure workspace, funded by Regione Toscana of the Evaluation forum, Avignon, France, 10 - 14 (BANDO POR FESR 2014-2020). September, 2018. References Giuseppe Attardi, Felice Dell’Orletta, Maria Simi and Joseph Turian. 2009. Accurate dependency parsing with a stacked multilayer perceptron. In Proceed- ings of the 2nd Workshop of Evalita 2009. Decem- ber, Reggio Emilia, Italy. Andrea Cimino and Felice Dell’Orletta. 2016. Build- ing the state-of-the-art in POS tagging of italian tweets. In Proceedings of Third Italian Confer- ence on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Work- shop (EVALITA), December 5-7. Delia Irazú Hernández Farı́as, Viviana Patti and Paolo Rosso 2016. Irony Detection in Twitter: The Role of Affective Content. In ACM Transactions on Internet Technology (TOIT), Volume 15, number 3. Pashutan Modaresi, Matthias Liebeck and Stefan Con- rad. 2016. Exploring the Effects of Cross-Genre Machine Learning for Author Profiling in PAN 2016. In Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016. Francisco Manuel Rangel Pardo, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast and Benno Stein. 2016. Overview of the 4th Author Pro- filing Task at PAN 2016: Cross-Genre Evaluations. In Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016. Mart Busger op Vollenbroek, Talvany Carlotto, Tim Kreutz, Maria Medvedeva, Chris Pool, Johannes Bjerva, and Hessel Haagsma and Malvina Nissim. 2016. Gronup: Groningen user profiling In Work- ing Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 Septem- ber, 2016. Alexander Pak and Patrick Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta Fan Rong-En, Chang Kai-Wei, Hsieh Cho-Jui, Wang Xiang-Rui and Lin Chih-Jen. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874.