Quanti anni hai? Age Identification for Italian
    Aleksandra Maslennikova• , Paolo Labruna• , Andrea Cimino , Felice Dell’Orletta
                                        •
                                          Università di Pisa
                        a.maslennikova@studenti.unipi.it
                                pielleunipi@gmail.com
          
            Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
                                 ItaliaNLP Lab - www.italianlp.it
             {andrea.cimino, felice.dellorletta}@ilc.cnr.it

                        Abstract                                1       Introduction
    English. We present the first work to our                   Social media platforms such as Facebook, Twit-
    knowledge on automatic age identification                   ter and public forums allow users to communicate
    for Italian texts. For this work we built a                 and share their opinions and to build social rela-
    dataset consisting of more than 2.400.000                   tions. The proliferation of such platforms allowed
    posts extracted from publicly available fo-                 the scientific community to study many commu-
    rums and containing authorship attribution                  nication phenomena such as the analysis of the
    metadata, such as age and gender. We de-                    sentiment (Pak et al., 2010) or irony (Hernández
    veloped an age classifier and performed a                   Farı́as et al, 2016). Another related research field
    set of experiments with the aim of evalu-                   is the ”author profiling” one, where the features
    ating the possibility of assigning the cor-                 that allow to discriminate age, gender, or native
    rect age of an user and which informa-                      language of a person are analyzed. These studies
    tion is useful to tackle this task: lexical                 are conducted both for forensic and marketing rea-
    or linguistic information spanning across                   sons, since the classification of these characteris-
    different levels of linguistic descriptions.                tics allow companies to better focus their market-
    The performed experiments show the im-                      ing campaigns. In the author profiling scenario,
    portance of lexical information in age clas-                many are the studies conducted by the scientific
    sification, but also that exists writing style              community, that were generally focused on En-
    that relates to the age of an user.                         glish and Spanish language. The majority of these
                                                                studies were performed in PAN 1 (Rangel et al.,
    Italiano. In questo articolo presentiamo                    2016), a lab at CLEF 2 that holds each year and
    il primo lavoro a nostra conoscenza sul                     in which many shared tasks related to the ”author-
    riconoscimento automatico dell’età per la                  ship attribution” research topic are run. In these
    lingua italiana. Per condurre il lavoro ab-                 shared tasks participants were asked to identify the
    biamo costruito un dataset composto da                      gender or the age using manually annotated train-
    più di 2.400.000 di post estratti da fo-                   ing data from social media platforms. Among the
    rum pubblici e associati a informazioni                     most successful approaches proposed by partici-
    rispetto all’età e al genere degli autori.                 pants the ones that achieved the best results (op
    Abbiamo sviluppato un sistema di clas-                      Vollenbroek et al., 2016), (Modaresi et al., 2016)
    sificazione dell’età dello scrittore di un                 are based on SVM classifiers exploiting a wide
    testo e condotto una serie di esperimenti                   variety of lexical and linguistic features, such as
    per valutare se è possibile definire l’età e              word n–grams, part–of–speech, and syntax. Only
    attraverso quali informazioni estratte dal                  recently deep learning based approaches were pro-
    testo: lessicali o di descrizione linguis-                  posed and have showed very good results espe-
    tica a diversi livelli. I risultati ottenuti                cially when dealing with multi–modal data, i.e.
    dimostrano l’importanza del lessico nella                   text and images posted on Twitter (Takahashi et
    classificazione, ma anche l’esistenza di                    al., 2018).
    uno stile di scrittura correlato all’età.                     In the present work we tackle a specific author-
                                                                    1
     Copyright c 2019 for this paper by its authors. Use per-      https://pan.webis.de/
                                                                    2
mitted under Creative Commons License Attribution 4.0 In-          http://www.clef-initiative.eu/
ternational (CC BY 4.0).                                        association/steering-committee
ship attribution task: the age detection for the Ital-   one of them. We didn’t have a preassigned settled
ian language. To our knowledge, this is the first        list of possible topics. Instead, we were adding
time that such task is performed on Italian. For this    them in the process. For example, if we have an
reason, we built a multi–topic corpus, developed a       entire forum which discusses about only watches,
classifier which exploits a wide range of linguis-       we wouldn’t assign some general ”Hobby” tag, but
tic features, and conducted several experiments to       we would create a special group ”Watches” specif-
evaluate both the newly introduced corpus and the        ically for this forum.
classifier.                                                 At the and of the collection process, we col-
   The main contributions of this work are: i) an        lected 2.445.012 posts from 7.023 different users
automatically built corpus for the age detection         and 162 forums, that we divided in 30 different
task for the Italian language; ii) the development       topic groups. All the information regarding the
of an age detection system; iii) the study of the        dataset are shown in Table 1.
impact of linguistic and lexical features.
                                                         3     The Age classifier
2       Dataset construction
                                                         We implemented a document age classifier that
With the aim of building an automatic dataset from       operates on morpho–syntactically tagged and de-
the web, we needed a set of Italian texts with the       pendency parsed texts. The classifier exploits
age of authors publicly available. Nowadays col-         widely used lexical, morpho-syntatic and syntac-
lecting this information is a challenging task, since    tic features that are used to build the final statisti-
the majority of the available platforms, for the sake    cal model. This statistical model is finally used
of privacy, prefer not to make the user’s age public.    to predict the age range of unseen documents.
So, first-of-all, we had to find a website with such     We used linear SVM implemented in LIBLIN-
data. We choose the ForumFree platform3 which            EAR (Rong-En et al., 2008) as machine learning
allows users to create their own forums without          algorithm. The input documents were automati-
any coding skills, using an existing template. Hav-      cally POS tagged by the Part–Of–Speech tagger
ing all the forums based on the same templates           described in (Cimino and Dell’Orletta, 2016) and
makes them perfect for automated crawling. We            dependency–parsed by the DeSR parser (Attardi et
extracted all the posts of the users that decided to     al., 2009).
show publicly their age. We tried to collect the
data from the top 200 most active forums. Not all        3.1    Features
the forums had users with all the user information       Raw and Lexical Text Features
filled and, in the end of the processes, we fetched      Word n-grams, calculated as presence or absence
messages from 162 different forums. Since our            of a word n-gram in the text.
goal was to build a corpus with author profiling         Lemma n-grams, calculated as the frequency of
purposes, and such task is very difficult with very      each lemma n-gram in the text and normalized
small comments, we selected only posts with a            with respect to the number of tokens in the text.
minimum length of 20 words.                              Morpho–syntactic Features
    Another problem we faced is that users are not       Coarse and fine grained Part-Of-Speech n-
age-balanced in the forums: for example, anime           grams, calculated as the logarithm of the fre-
dedicated forum have mostly users aged under             quency of each coarse/fine grained PoS n-gram in
35. Another example are cars dedicated forums,           the text and normalized with respect to the number
where usually users are more mature with respect         of tokens of the text.
to anime forums. Only a couple of forums have
                                                         Syntactic Features
very balanced information, which usually is the
                                                         Linear dependency types n-grams, calculated as
best data for training machine learning based clas-
                                                         the frequency of each dependency n-gram in the
sifiers. For this reason, we decided to group the
                                                         text with respect to the surface linear ordering of
forums by their topics, because in this scenario
                                                         words and normalized with respect to the number
it is more probable to gather enough textual data
                                                         of tokens in the text.
for each age gap. We manually looked the con-
                                                         Hierarchical dependency types n-grams calcu-
tent of all forums and assigned the topic for each
                                                         lated as the logarithm of the frequency of each hi-
    3
        https://www.forumfree.it/?wiki=About             erarchy dependency n-gram in the text and nor-
                 Topic                      ≤20     21-30    31-40    41-50   51-60    ≥61
                                   Users     36      158      187      209      158     45
                 Cars
                                   Posts    6056    50281   46746     62002   48939   15867
                                   Users     10       11      12       35       25       1
               Bicycles
                                   Posts    2056     2284    5532     13418   16959      6
                                   Users      3       52      78       69       46      18
               Smoking
                                   Posts      7     21399   41470     38149   17981    4742
                                   Users     392     438      142      62       16       6
            Anime/Manga
                                   Posts   60367    99165   39939     29086    3873     228
                                   Users     115     104      14        8        6       7
             Role playing
                                   Posts   22953    40652    3893     3945      534    2060
                                   Users     235     358      113      131      48       7
                Gaming
                                   Posts   54584    81535   20379     20055    4560    1323
                                   Users     11       25      21       13       11       2
              Spirituality
                                   Posts     336     1427    1342     1095     1517     965
                                   Users      7       36      27       29       17       1
          Aesthetic medicine
                                   Posts    1345     6135   11767     8208     3384      1
                                   Users     215     338      192      136      52      24
                 Sport
                                   Posts   82495   310220   158382   103027   34627   16084
                                   Users      0       1        4       10        4       4
               Culinary
                                   Posts      0       52    10130     2414      747     438
                                   Users     10       21      11        4        2       3
                 Pets
                                   Posts    4307    13222    7357     2592     5383   10353
                                   Users     21       76      26       24       17       4
              Celebrities
                                   Posts     548    21114    5820     6150     3139    1248
                                   Users      0       2        4       10        6       0
                Politics
                                   Posts      0      330     2801     3548      576      0
                                   Users     52       45      34       43       34      15
            Different topics
                                   Posts    9453    12000   21667     16316    4759   24418
                                   Users     11       57      79       62       30       5
                Fishing
                                   Posts    3040    14805   24306     17131   13155    8356
                                   Users      6       6        0        2        5       1
         Institution community
                                   Posts     13       12       0       18     11130    4364
                                   Users      0       6        7        5        5       1
        Rail transport modelling
                                   Posts      0      3597    2289      999     2470     751
                                   Users      4       10       4        7        4       0
                Culture
                                   Posts    1855     560      653     1174      219      0
                                   Users      0       2        2        4        1       2
               Tourism
                                   Posts      0       16      10      1378       2      14
                                   Users     11       31      18       10        2       1
               Sexuality
                                   Posts     185     2540    8201     1421       7     1179
                                   Users     25       34      78       121      55      11
            Metal Detecting
                                   Posts    7750     9830   19299     31288   16547    3529
                                   Users     12       25      15        0        0       0
                Music
                                   Posts    8731    15720    5276       0        0       0
                                   Users      1       4        1        1        0       0
               Parenting
                                   Posts     719     2250     626      420       0       0
                                   Users     37       47      12        4        8       5
             Technologies
                                   Posts     185     266      431      26       19      23
                                   Users      5       9       10        6        6       2
                Nature
                                   Posts     998     1304    3653     2171      292     10
                                   Users      0       5        6        1        0       0
               Religion
                                   Posts      0      2618    4125      896       0       0
                                   Users     25       26      10        5        1       2
                 Films
                                   Posts    9476     6135     503      43        4     2477
                                   Users     12       14       2        0        1       2
              Psychology
                                   Posts     291     912      44        0        1      11
                                   Users      0       3        3       10       11       7
               Gambling
                                   Posts      0      458      134      364      715     274
                                   Users     29      153      317      302      109     32
               Watches
                                   Posts    5158    52623   114074   101869   50243   18085

Table 1: Distribution of number of users and posts per age gap in different topics in the corpus
malized with respect to the number of tokens in          word and lemmas features, the second one (Syn-
the text. In addition to the dependency relation-        tax), which uses only the morpho–syntactic and
ship, the feature takes into account whether a node      syntactic features. Finally, the last model (All),
is a left or a right child with respect to its parent.   which uses both the lexical, morpho–syntactic and
                                                         syntactic features. We considered as baseline
4   Experiments                                          model a classifier which predicts always the most
                                                         frequent class.
In order to test the corpus and the classifier, we
performed a set of experiments. The experiments          4.1   Results
were devised in order to test real-word scenarios
where 1) we were interested to classify a set of         Tables 2 and 3 report the results achieved by the
posts written by a single user rather then a sin-        classifier for the in–domain and out–domain ex-
gle post; 2) we always classified unseen users, i.e.     periments respectively. For what concerns all
no training data was available for such users. For       the experiments, we can notice that the results
these reasons, we merged all the posts of a sin-         achieved by our classifier are higher than the base-
gle user in the original corpus in a single doc-         line results, showing that there are features that are
ument. We then considered only the users that            able to discriminate among the considered classes.
wrote a minimum of 200 tokens and limited the            The in–domain results show that the lexical fea-
final merged document to a ’soft’ limit of 1000 to-      tures are the ones that have the most discrimina-
kens for each user. When the soft limit was ex-          tive power with respect to the syntax ones. The
ceeded, we included the whole post that exceeded         f-score achieved by the lexicon model is 3-4 times
the soft limit. The described procedure allows           better than the baseline in the 5–class setting, and
training and test splits to never contain the same       2 times better in average in the 2–class setting.
user. For the age detection tasks, similarly as in       The syntax model shown very good results but, as
(Rangel et al., 2016), we considered age-splits as       expected, lower than the results achieved by the
the classification classes. More precisely, we took      lexicon model. This is an important result since
into account two different age group splits: the         it shows that syntax and morpho–syntax are rel-
first one, which we will refer with the name 5–          evant characteristics in each age-group, both in
class, in which we split the documents in 5 differ-      the 5–class and 2–class settings. Surprisingly, the
ent age groups: 20-29, 30-39, 40-49, 50-59, 60-          All model didn’t show in any experiment an in-
69. The second age group split, which we will            crease in classification performance. The classifi-
refer with the name 2–class, is composed by the          cation patterns revealed in the in–domain experi-
following age group splits: ≤29, ≥50-69 (exclud-         ments are similarly shown also in the out–domain
ing all the documents written by users that did not      experiments. The results achieved in this setting as
belong to these age groups). We conducted two            expected are lower than results achieved in the in–
different kind of experiments. In the first experi-      domain settings. The 5–class experiments show a
ment (in–domain), we evaluated the performance           drop in performance achieved by the considered
of the classifier on in-domain texts, more precisely     learning models of 8-10% f–score points in aver-
we selected three different topics starting from the     age w.r.t. to the in–domain experiments. When
main corpus and on each of the topics we trained         we move to the 2–class experiments, no significant
the classifier on the 80% of the data, and evalu-        drop in performance is noticed. This shows that
ated the performance of the classifier on the re-        in case of domain shifting, the machine learning
maining 20%. For this experiment we choose the           models are still able to well discriminate between
the following domains: Sports, Watches and Cars.         young and aged people.
In the second experiment (out–domain) we trained            Figures 1 and 2 report the confusion matrices
the classifier on the all the 3 topics used for the      of the in–domain and out–domain experiments us-
in–domain experiments and evaluated the perfor-          ing the 5-class age-groups. More precisely, the
mance of the classifier on other 3 different topics      in–domain confusion matrix is obtained by train-
(Smoking, Celebrities, Metal Detecting).                 ing the All model on all the three training in–
   In addition, we devised 3 different machine           domain topics and testing the model on the re-
learning models based on 3 different sets of fea-        spective testset (f–score: 0.47). Similarly, the out-
tures. The first one (Lexicon), which uses only          domain confusion matrix is obtained by training
                                       5-class                                2-class
                Topic    Baseline   Lexicon Syntax      All     Baseline   Lexicon Syntax    All
                Sport     0.27       0.45      0.42     0.48     0.74       0.74      0.75   0.75
               Watches    0.19       0.43      0.35     0.42     0.44       0.85      0.75   0.83
                Cars      0.12       0.54      0.34     0.45     0.47       0.87      0.77   0.84

               Table 2: Results achieved in the in–domain experiments in terms of f–score
                                           5-class                               2-class
               Topic         Baseline   Lexicon Syntax     All     Baseline   Lexicon Syntax    All
             Smoking          0.14       0.30      0.25    0.32     0.42       0.79      0.68   0.79
            Celebrities       0.33       0.45      0.39    0.47     0.62       0.83      0.73   0.81
           Metal Detecting    0.21       0.36      0.27    0.34     0.52       0.80      0.66   0.78

              Table 3: Results achieved in the out–domain experiments in terms of f–score


Figure 1: Confusion matrix calculated on the docu-             Figure 2: Confusion matrix calculated on the docu-
ments belonging to the in-domain topics                        ments belonging to the out-domain topics


the All model on all the in-domain topics (includ-        By exploiting the publicly available information
ing the test-sets), and testing the model on the out-     on the FreeForum platform, we built a corpus con-
domain documents of the selected 3 topics. As             sisting of more than 2.400.000 posts and 7.000
it can be seen, the errors both on the in–domain          different users containing the user’s age informa-
and out–domain experiments show very good per-            tion. The first experiments performed through
formances of the classifier, i.e., in case of errors,     a machine learning based classifier that uses a
usually it makes a mistake of a range of ± 10             wide range of linguistic features showed promis-
years. Such results show also that the automati-          ing results in two different range classification
cally built corpus is a very useful resource for the      tasks both in the in–domain and out–domain set-
age classification task. Finally, it is interesting       tings. The conducted experiments show that lex-
to notice that the most correct predicted classes         icon plays a fundamental role in the age classi-
are the ranges 20-29 and 40-49, both in the in–           fication task both in in–domain and out–domain
domain and out–domain settings, while the worst           scenarios. Lastly, the experiments shown that the
predicted class in both experiments is the 60-69          corpus, even though if automatically generated, is
age range, most probably because is the most un-          suitable for real–world applications. We plan to
derrepresented class in the training set.                 release the full corpus as soon as privacy and legal
                                                          issues will be fully investigated.
5   Conclusions
We presented the first automatically built corpus
for the age detection task for the Italian language.
Acknowledgments                                            Takumi Takahashi, Takuji Tahara, Koki Nagatani,
                                                             Yasuhide Miura, Tomoki Taniguchi and Tomoko
This work was partially supported by the 2-year              Ohkuma. 2016. Text and image synergy with feature
project ARTILS, Augmented RealTime Learning                  cross technique for gender identification. In Work-
                                                             ing Notes of CLEF 2018 - Conference and Labs
for Secure workspace, funded by Regione Toscana              of the Evaluation forum, Avignon, France, 10 - 14
(BANDO POR FESR 2014-2020).                                  September, 2018.


References
Giuseppe Attardi, Felice Dell’Orletta, Maria Simi and
  Joseph Turian. 2009. Accurate dependency parsing
  with a stacked multilayer perceptron. In Proceed-
  ings of the 2nd Workshop of Evalita 2009. Decem-
  ber, Reggio Emilia, Italy.

Andrea Cimino and Felice Dell’Orletta. 2016. Build-
  ing the state-of-the-art in POS tagging of italian
  tweets. In Proceedings of Third Italian Confer-
  ence on Computational Linguistics (CLiC-it 2016)
  & Fifth Evaluation Campaign of Natural Language
  Processing and Speech Tools for Italian. Final Work-
  shop (EVALITA), December 5-7.

Delia Irazú Hernández Farı́as, Viviana Patti and Paolo
  Rosso 2016. Irony Detection in Twitter: The Role of
  Affective Content. In ACM Transactions on Internet
  Technology (TOIT), Volume 15, number 3.

Pashutan Modaresi, Matthias Liebeck and Stefan Con-
  rad. 2016. Exploring the Effects of Cross-Genre
  Machine Learning for Author Profiling in PAN 2016.
  In Working Notes of CLEF 2016 - Conference and
  Labs of the Evaluation forum, Évora, Portugal, 5-8
  September, 2016.

Francisco Manuel Rangel Pardo, Paolo Rosso, Ben
  Verhoeven, Walter Daelemans, Martin Potthast and
  Benno Stein. 2016. Overview of the 4th Author Pro-
  filing Task at PAN 2016: Cross-Genre Evaluations.
  In Working Notes of CLEF 2016 - Conference and
  Labs of the Evaluation forum, Évora, Portugal, 5-8
  September, 2016.

Mart Busger op Vollenbroek, Talvany Carlotto, Tim
 Kreutz, Maria Medvedeva, Chris Pool, Johannes
 Bjerva, and Hessel Haagsma and Malvina Nissim.
 2016. Gronup: Groningen user profiling In Work-
 ing Notes of CLEF 2016 - Conference and Labs of
 the Evaluation forum, Évora, Portugal, 5-8 Septem-
 ber, 2016.

Alexander Pak and Patrick Paroubek. 2010. Twitter as
  a corpus for sentiment analysis and opinion mining.
  In Proceedings of the International Conference on
  Language Resources and Evaluation, LREC 2010,
  17-23 May 2010, Valletta, Malta

Fan Rong-En, Chang Kai-Wei, Hsieh Cho-Jui, Wang
  Xiang-Rui and Lin Chih-Jen. 2008. LIBLINEAR:
  A library for large linear classification. Journal of
  Machine Learning Research, 9:1871–1874.