=Paper= {{Paper |id=Vol-1180/CLEF2014wn-Pan-RangelEt2014 |storemode=property |title=Overview of the Author Profiling Task at PAN 2014 |pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-RangelEt2014.pdf |volume=Vol-1180 |dblpUrl=https://dblp.org/rec/conf/clef/PardoRCPTSVD14 }} ==Overview of the Author Profiling Task at PAN 2014== https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-RangelEt2014.pdf
Overview of the 2nd Author Profiling Task at PAN 2014

                   Francisco Rangel1,2 Paolo Rosso2 Irina Chugur3
                  Martin Potthast4 Martin Trenkmann4 Benno Stein4
                          Ben Verhoeven5 Walter Daelemans5
                                1
                                Autoritas Consulting, S.A., Spain
          2
           Natural Language Engineering Lab, Universitat Politècnica de València, Spain
                 3
                   Universidad Nacional de Educación a Distancia, Madrid, Spain
        4
          Web Technology & Information Systems, Bauhaus-Universität Weimar, Germany
           5
             CLiPS - Computational Linguistics Group, University of Antwerp, Belgium

                        pan@webis.de          http://pan.webis.de



         Abstract This overview presents the framework and the results for the Author
         Profiling task at PAN 2014. Objective of this year is the analysis of the adapt-
         ability of the detection approaches when given different genres. For this purpose
         a corpus with four different parts (subcorpora) has been compiled: social me-
         dia, Twitter, blogs, and hotel reviews. The construction of the Twitter subcorpus
         happened in cooperation with RepLab in order to investigate also a reputational
         perspective. Altogether, the approaches of 10 participants are evaluated.


1      Introduction
Though the enormous impact of social media on our daily life, we observe a lack of
information about those who create the contents. In this regard, author profiling tries to
determine the gender, age, native language, or personality type of authors by analysing
their published texts. Author profiling is of growing importance: E.g., from a marketing
viewpoint, companies may be interested in knowing the demographics of their target
group in order to achieve a better market segmentation; from a forensic viewpoint, de-
termining the linguistic profile of a person who wrote a "suspicious text"’ may provide
valuable background information.
    In the Author Profiling task at PAN 2013,1 the identification of age and gender
relied on a large corpus collected from social media [28]. This year, in PAN 2014,2 we
continue focusing on age and gender aspects but, in addition, compiled a corpus of four
different genres, namely social media, blogs, Twitter, and hotel reviews. Except for the
hotel review subcorpus, which is available in English only, all documents are provided
in both English and Spanish. Note that most of the existing research in computational
linguistics [3] and social psychology [26] focuses on the English language, and the
question is whether the observed relations pertain to other languages as well.
    The remainder of this paper is organised as follows. Section 2 covers the state of
the art, Section 3 describes the corpus and evaluation measures, and Section 4 presents
 1
     http://webis.de/research/events/pan-13/pan13-web/author-profiling.html
 2
     http://webis.de/research/events/pan-14/pan14-web/author-profiling.html




                                               898
the approaches submitted by the participants. Section 5 and 6 discuss results and draw
conclusions respectively.


2   Related Work
The study of how certain linguistic features vary according to the profile of their authors
is a subject of interest for several different areas such as psychology, linguistics and,
more recently, computational linguistics. Pennebaker et al. [27] connected language use
with personality traits, studying how the variation of linguistic characteristics in a text
can provide information regarding the gender and age of its author. Argamon et al. [3]
analysed formal written texts extracted from the British National Corpus, combining
function words with part-of-speech features and achieving approximately 80% accu-
racy in gender prediction. Other researchers (Holmes and Meyerhoff [14], Burger and
Henderson [5]) have also investigated how to obtain age and gender information from
formal texts.
    With the rise of the social media, the focus is on other kind of writings, more col-
loquial, less structured and formal, like blogs or fora. Koppel et al. [15] studied the
problem of automatically determining an author’s gender by proposing combinations
of simple lexical and syntactic features, and achieving approximately 80% accuracy.
Schler et al. [29] studied the effect of age and gender in the writing style in blogs; they
gathered over 71,000 blogs and obtained a set of stylistic features like non-dictionary
words, parts-of-speech, function words and hyperlinks, combined with content features,
such as word unigrams with the highest information gain. They obtained an accuracy
of about 80% for gender identification and about 75% for age identification. They mod-
eled age in three classes: 10s (13-17), 20s (23-27) and 30s (33-47). They demonstrated
that language features in blogs correlates with age, as reflected in, for example, the use
of prepositions and determiners. Goswami et al. [12] added some new features as slang
words and the average length of sentences, improving accuracy to 80.3% in age group
identification and to 89.2% in gender detection.
    It is to be noted that the previously described studies were conducted with texts of at
least 250 words. The effect of data size is known, however, to be an important factor in
machine learning algorithms of this type. In fact, Zhang and Zhang [34] experimented
with short segments of blog post, specifically 10,000 segments with 15 tokens per seg-
ment, and obtained 72.1% accuracy for gender prediction, as opposed to more than 80%
in the previous studies. Similarly, Nguyen et al. [23] studied the use of language and
age among Dutch Twitter users, where the documents are really short, with an average
length of less than 10 terms. They modelled age as a continuous variable (as they had
previously done in [22]), and used an approach based on logistic regression. They also
measured the effect of the gender in the performance of age identification, consider-
ing both variables as inter-dependent, and achieved correlations up to 0.74 and mean
absolute errors between 4.1 and 6.8 years.
    One common problem when investigating author profiling is the need to obtain
labelled data for the authors, to obtain their age and gender. Studies in classical literature
deal with a small number of well-known authors, where manual labelling can easily be
applied. However for the dimensions of the actual social media data this is a more




                                            899
difficult task, which should be automated. In some cases, researchers manually label
the collection [23] with some risk of bias. In other cases, as in the vast majority of
the aforementioned studies, researchers took into account information provided by the
authors themselves. For example, in blog platforms, the contributors self-specify their
profiles. This is the case for Peersman et al. [25] who retrieved a dataset from Netlog,3
where authors report their gender and exact age, and Koppel et al. [15], who retrieved
the dataset from Blogspot.4 This is likely to introduce some noise to the evaluation set,
but it also reflects the realistic state of the available data.
    The task of obtaining author profiles has an emerging interest in the scientific com-
munity, as can be seen in the number of related tasks around the topic arisen the two
last years: a) the shared task on Native Language Identification at BEA-8 Workshop at
NAACL-HT 2013;5 b) the task on Computational Personality Recognition (WCPR) at
ICWSM 20136 and at ACM Multimedia 2014,7 and; c) the task on Author Profiling at
PAN 2013 and PAN 2014.
    With respect to the task on Author Profiling at PAN 2013 [28], most of the par-
ticipants used combinations of style-based features such as frequency of punctuation
marks, capital letters, quotations, and so on, together with POS tags and content-based
features such as Latent Semantic Analysis, bag-of-words, TF-IDF, dictionary-based
words, topic-based words, and so on. It is worth mentioning the usage of second order
representations based on relationships between documents and profiles by the winner
of the PAN-AP 2013 task [16] and the use of collocations for the winner of the English
task [21].
    Last but not least, the interest in different author profile aspects is evident also in the
Kaggle platform,8 where companies and research departments shared their needs and
independent researchers joined challenges as Psychopathy Prediction Based on Twitter
Usage;9 Personality Prediction Based on Twitter Stream;10 or Gender Prediction from
Handwriting.11 This shows the rise of interest from the industry in author profiling.

3     Evaluation Framework
In this section we describe the construction of the corpus, covering particular properties,
challenges, and novelties. Finally, the evaluation measures are described.

3.1   Corpus
In order to study how the different author profiling approaches apply to different genres,
we have built a corpus with four different genres: social media, blogs, Twitter, and hotel
 3
   http://www.netlog.com
 4
   http://blogspot.com
 5
   https://sites.google.com/site/nlisharedtask2013/
 6
   http://mypersonality.org/wiki/doku.php?id=wcpr13
 7
   https://sites.google.com/site/wcprst/home/wcpr14
 8
   http://www.kaggle.com/
 9
   http://www.kaggle.com/c/twitter-psychopathy-prediction
10
   http://www.kaggle.com/c/twitter-personality-prediction
11
   http://www.kaggle.com/c/icdar2013-gender-prediction-from-handwriting




                                             900
reviews. The respective subcorpora cover English and Spanish, with the exception of
the hotel reviews, which have been provided in English only. The corpus documents
are encoded as XML files, one per author, with the contents between  tags.
The author is labeled with age and gender information. For labeling age, instead of the
three age classes a) 10s (13-17); b) 20s (23-27); c) 30s (33-47) used in PAN-AP 2013,
this year we opted for modelling age in a more fine-grained way and considered the
following classes: a) 18-24; b) 25-34; c) 35-49; d) 50-64; e) 65+ .
    As in the previous edition, each subcorpus was split into three parts for training,
early birds, and test respectively.
Social Media We have built the social media subcorpus by selecting a part of the
PAN-AP-13 corpus. We have selected those authors with an average number of words
in their posts greater than 100. We also manually reviewed the documents in order to
remove those authors who seem to be fake profiles such as bots, for example, authors
selling the same product (e.g., mobiles, ads) in most of their posts or authors with a high
number of text reuse (e.g., teenagers sharing poetry or homework). The final distribution
of the number of authors is shown in Table 1. The social media subcorpus is balanced
by gender, so the number of authors per gender is one-half.

        Table 1. Distribution of social media with respect to age classes per language.

                       Training                Early birds                Test
                 English      Spanish     English     Spanish      English       Spanish
        18-24       1550           330        140             30       680          150
        25-34       2098           426        180             36       900          180
        35-49       2246           324        200             28       980          138
        50-64       1838           160        160             14       790           70
        65+           14            32         12             14        26           28
        Σ           7746          1272        692            122     3376           566


Blogs The objective of collecting blogs is to build a gold standard for author profiling
in this specific genre. To achieve this objective, we manually selected and annotated
the documents. Firstly, we looked for public LinkedIn profiles which share a personal
blog URL. We verified that the blog exists, it is written in one of the languages we are
interested in (English or Spanish) and it is updated only by one person and this person
is easily identifiable. We discarded organizational blogs when we were not sure that the
blog was updated by the person identified in the LinkedIn profile. Secondly, we looked
for age information. In some cases the birth date is published in the user’s profile. But
in most cases it is not so we looked for degree starting date in the education section.
We used the information shown in Table 2 to figure out the age range. We discarded
users whose education dates were not clear. Thirdly, if we could figure out the age, we
identified the gender by the user’s photography and name. Again, for those cases where
the gender information was not clear, we discarded the user. Finally, this process was
done by two independent annotators and a third one decided in case of disagreement.
For each blog, we provided up to 25 posts. We provided contents obtained from the
RSS feed but we allow users to download the full text from the permalink.




                                             901
                              Table 2. Age range by degree starting date.

                                  Degree starting date      Age group
                                       2006-. . .             18-24
                                      1997-2006               25-34
                                      1982-1996               35-49
                                      1967-1981               50-64
                                       . . . -1966             +65


   The final distribution of the number of authors is shown in Table 3. The blogs sub-
corpus is balanced by gender, so the number of authors per gender is half.

              Table 3. Distribution of blogs with respect to age classes per language.

                          Training                   Early birds              Test
                    English       Spanish     English       Spanish     English      Spanish
          18-24           6             4              4            2       10            4
          25-34          60            26              6            4       24           12
          35-49          54            42              8            4       32           26
          50-64          23            12              4            2       10           10
          65+             4             4              2            2        2            2
          Σ             147            88            24            14       78           56


Twitter We manually selected and annotated the documents, following the same
methodology as for the blogs. We built this subcorpus in collaboration with RepLab12
where the main goal of author profiling—viewed in the context of reputation monitoring
on Twitter—is to decide how influential a given user is in the domain which the entity
under study belongs to. This includes determining the type of author (e.g., journalist,
stakeholder, professional) and his degree of influence on opinions within the domain.
For the shared PAN-RepLab author profiling task, 131 Twitter profiles from several
domains (energy, environmental, banking, automotive, and Corporate Social Respon-
sibility sectors) were annotated with age and gender. The profiles were selected from
the RepLab 2013 corpus and from a list of influential authors provided by the online
division of a leading Public Relations consultancy (Llorente & Cuenca).13 Note that bal-
ancing the list of profiles by age and gender turned out to be a challenging task, because
influential Twitter authors in the considered economic domains tend to be male and of
quite a narrow age range (35-49). In addition to age and gender, tweets in RepLab were
manually tagged by reputation experts with a) type of author and; b) opinion-maker
labels (Influencer, Non-influencer, and Undecidable).
    For more details on the RepLab 2014 author profiling data set please refer to [2].
Due to Twitter terms of service, we provided the tweets URLs so that participants could
download them. For each Twitter profile, we provided up to 1000 tweets. The final
12
     http://nlp.uned.es/replab2014
13
     http://www.llorenteycuenca.com/




                                                 902
distribution of the number of authors is shown in Table 4. The Twitter subcorpus is
balanced by gender, so half of the authors are male and the other half are female.

            Table 4. Distribution of Twitter with respect to age classes per language.

                        Training                Early birds                Test
                   English     Spanish      English     Spanish     English       Spanish
        18-24           20          12            2            2          12              4
        25-34           88          42            6            4          56             26
        35-49          130          86           16           12          58             46
        50-64           60          32            4            6          26             12
        65+              8           6            2            2           2              2
        Σ              306         178           30           26        154              90


Hotels Reviews To study the applicability of author profiling approaches to the re-
view genre, we have compiled the Webis-TripAd-13 corpus, a large subset of hotel
reviews from the PAN 2014 author profiling evaluation corpus. The corpus has been
carefully constructed to ensure its quality with regard to text cleanliness and annotation
accuracy.
    The Webis-TripAd-13 corpus is derived from another corpus that was originally
used for aspect-level rating prediction [31].14 The original corpus was crawled from the
hotel review site TripAdvisor15 in the period of one month from mid February to mid
March 2009, and contains 235 793 reviews about 1,850 different hotels. Each review
comprises its author’s user name, the review text, and the date the review was written.
In addition, there are seven numerical aspect ratings and an overall rating score assigned
by the user, which serve as ground-truth for aspect-level rating prediction or sentiment
analysis tasks in general. However, the original dataset does not feature age and gender
annotations.
    In order to make this dataset applicable to author profiling and to ensure its quality,
we applied the following four post-processing steps: first, we removed short reviews of
less than 10 words which were found to be malformed reviews due to parsing errors.
Second, we removed reviews whose text was not found to be English according to
a language detector. Third, since the original dataset does not provide any age and
gender information, we compiled a list of user names who submitted the reviews and
crawled the corresponding user profiles from the TripAdvisor website. Fourth, given
this metadata, we discarded all reviews written by authors whose age and gender was
not given on their user profile or whose user profile was inactive. Moreover, to ensure
data quality, we reviewed user profiles and reviews with regard to sanity (i.e., whether
the information given made sense). The final Webis-TripAd-13 corpus contains 58 101
reviews and covers six age classes. The distribution of reviews across these classes is
shown in columns 3 and 4 of Table 5.16
14
   http://times.cs.uiuc.edu/~wang296/Data
15
   http://www.tripadvisor.com
16
   This version of the corpus has been released at: http://www.webis.de/research/corpora




                                              903
    To match the requirements of PAN’s author profiling evaluation corpus, we unified
the Webis-TripAd-13 corpus accordingly: to obtain a nearly uniform age class distri-
bution, we sampled 700 authors from each of the three major classes (25–34, 35–49,
50–64). For the two minor classes (18–24, 65+), however, the number of authors avail-
able was limited by the size of the smaller age class, so that 254 authors (18–24) and
547 authors (65+) remained, respectively. Class 13–17 was discarded completely since
the number of available authors was found to be not representative for evaluation pur-
poses. The final distribution of the subset of the Webis-TripAd-13 corpus that forms
part of the PAN author profiling evaluation corpus is shown in Table 5, column 7–8.

             Table 5. Distribution of reviews with respect to age and gender classes.

                      Webis-TripAd-13        PAN 2014 training set      PAN 2014 test set
Gender     Age      # authors    # reviews    # authors   # reviews    # authors    # reviews
female     13-17          23           23             -           -            -           -
           18-24         656          741           180         208           74          84
           25-34        7517         9504           500         651          200         247
           35-49       10554        13552           500         659          200         255
           50-64        5850         7449           500         617          200         242
           65+           547          682           400         494          147         188
male       13-17          22           25             -           -            -           -
           18-24         254          314           180         228           74          86
           25-34        3816         5144           500         700          200         250
           35-49        8586        12044           500         707          200         302
           50-64        5413         7229           500         669          200         268
           65+          1079         1394           400         520          147         178



3.2    Performance measures
For evaluating participants’ approaches we have used accuracy. More specifically, we
calculated the ratio between the number of authors correctly predicted by the total num-
ber of authors. We calculated separately accuracy for each subcorpus, language, gender,
and age class. Moreover, we combined accuracy for the joint identification of age and
gender. The final score used to rank the participants is the average for the combined
accuracies for each subcorpus and language.
    We computed statistical significance of performance differences between systems
using approximate randomisation testing [24].17 As noted by Yeh [33], for comparing
output from classifiers, frequently used statistical significance tests such as paired t-
tests make assumptions that do not hold for precision scores and f-scores. Approximate
randomisation testing does not make these assumptions and can handle complicated
distributions as well as normal distributions. We did a pairwise comparison of accu-
racies of all systems and with p < 0.05, we consider the systems to be significantly
17
     We used the implementation by Vincent Van Asch available from the CLiPS website:
     http://www.clips.uantwerpen.be/scripts/art




                                              904
different from each other. The complete set of statistical significance tests is illustrated
in Appendix A.
    In case of age identification we also measured the average and standard deviation of
the distance between the predicted and the truth class. We define the distance between
classes as the number of hops between them, with the maximum distance equal to 4 in
case of the most distant ones (18-24 and 65+). In case the participant did not provide a
prediction, we added 1 to the maximum distance, penalising this missing value with a
distance of 5. We also calculated the total time needed to process the test data, in order
to investigate the applicability in a real world.

3.3   Software Submissions
We continue to invite software submissions instead of run submissions for the second
time. Within software submissions, participants are asked to submit executables of their
author profiling softwares instead of just the output (i.e., runs) of their softwares on a
given test set. Our rationale to do so is to increase the sustainability of our shared task
and to allow for the re-evaluation of approaches to Author Profiling later on, for ex-
ample, on future evaluation corpora. To facilitate software submissions, we develop the
TIRA experimentation platform [9, 10], which makes handling software submissions
at scale as simple as handling run submissions. Using TIRA, participants deploy their
software into virtual machines at our site, which allows us to keep them in a running
state [11].


4     Overview of the Submitted Approaches

Ten teams participated in the Author Profiling task. Eight of them submitted the note-
book paper, a further one (liau14) provided us with a description of the approach, and
castillojuarez14 did not comment on any change with respect to their last year’s sys-
tem [1].
    Pre-processing. Various participants cleaned the HTML and XML to obtain plain
text [18, 19, 4, 13, 32]. One participant [13] removed URLs, user mentions and hashtags
from the Twitter texts. In [4], participants carried out case conversion, deleted invalid
characters and multiple white spaces, and similarly in [32] where the participants also
escaped invalid characters. Only in [30] and [32] participants performed tokenisation,
whereas in [32] they studied the effect of subset selection, and in [19] they tried to
delete spam bots by deleting contents with high percentage of the % character.
    Features. Many participants [20, 19, 13, 4, 32, 18] and (liau14) considered different
kinds of stylistic features. For example frequencies of different punctuation signs were
used in [13, 20, 4, 18, 32], size of sentences, words that appear once and twice or the
use of deflections in [20], the number of characters, words and sentences in [32]. In [19]
participants measured the number of posts per user, the frequency of capital letters and
capital words, whereas in [32] participants measured the correctness, cleanliness and
diversity of the texts. Only in [32] and [19] participants took advantage of the HTML
information, using the occurrence of tags such as img, href or br. Different readabil-
ity features where used in [20, 19, 13, 4, 32]. For example, Automated Readability




                                           905
Index [19, 13], Coleman-Liau Index [19, 13], Rix Readability Index [19, 13], Gunning
Fox Index [13], Flesch-Kinkaid [32]. A lexical analysis was carried out in [20] and [13],
where participants employed parts-of-speech as features together with the identification
of proper nouns or words with character flooding (e.g., hellooooo). The occurrence of
emoticons was used in [18], [19] and liau14.
    With respect to content features, in [30, 18] and (liau14) participants modeled the
language with n-grams or bag-of-words. In [20] they extracted topic words such as
money, home, smartphone, games, sports, job, marketing, etc. In [19] participants used
MRC and LIWC features to extract frequency of words related to different psycholin-
guistic concepts such as familiarity, concreteness, imagery, motion, emotion, religion,
and so on. Some participants used dictionaries to differentiate words per subcorpus and
class [4], identify lexical errors [19], foreign words [13] or specific phrases such as my
husband or my wife [19] and liau14.
    Specific features were used in [32], where participants obtained features employed
in information retrieval (IR) such as the cosine similarity or the Okapi BM25. Finally,
in [19] participants estimated the sentiment of the sentences and in [17] participants
used a second order representation based on relationships among terms, documents,
profiles and subprofiles.
    Classification approaches. All the participants approached the task as a machine
learning task. For example, logistic regression was used in [18] and liau14, and also
in [32] where participants used a different algorithm per subcorpus, for instance logic
boost, rotation forest, multi-class classifier, multilayer perceptron and simple logistic.
In [30] participants used multinominal Naïve Bayes, in [17] libLINEAR, in [13] random
forests, in [19] support vector machines and in [20] decision tables. In [4] participants
implemented their own frequency-based prediction function.


5   Evaluation and Discussion of the Submitted Approaches

We divided the evaluation in two steps, providing an early bird option for those partic-
ipants who wanted to receive some feedback. There were 7 early bird submissions and
eventually 10 for final evaluation. We show results separately for the evaluation in each
corpus part and for each language. Results are given in accuracy of identification of age,
gender, as well as the joint identification of age and gender. Results for early birds are
shown in Tables 6 - 9, whereas final results are shown in Tables 10 to 13. In case of final
evaluation, a baseline was provided for comparison purposes. This baseline considered
the 1 000 most frequent character trigrams. Some participants did not run their systems
on any of the subcorpora.
    As can be seen in the early bird results, the best ones were obtained for Twitter,
both in English and Spanish, with no big differences between the two languages. In
case of blogs, there are similar results for gender identification, but for age and joint
identification the best results were obtained on the Spanish partition. The English blogs
subcorpus is the one with the lowest results in age and joint identification, together
with social media in English and hotel reviews in joint identification. For social media,
the results are better in Spanish than in English for all the predictions. Spanish social
media got one of the highest accuracies in gender identification, together with hotel




                                           906
reviews and Twitter texts. With respect to hotel reviews, gender accuracies are close
to Twitter, but age and joint identification belong to the lowest among all subcorpora.
The highest values were obtained by shrestha14 [18] on Spanish Twitter with 0.8846 in
gender identification, 0.6923 in age identification and 0.6154 in joint identification of
both age and gender.

Table 6. Evaluation results for early birds in social media in terms of accuracy on English (left)
and Spanish (right) texts.
                   English                                           Spanish
Team                Joint    Gender       Age     Team                Joint     Gender      Age
liau14             0.2153     0.5390    0.3728    shrestha14         0.3033     0.6803     0.4016
shrestha14         0.2009     0.5332    0.3627    liau14             0.2787     0.7295     0.4262
lopezmonroy14      0.1893     0.5332    0.3338    lopezmonroy14      0.2377     0.6639     0.3689
castillojuarez14   0.1517     0.5231    0.3035    marquardt14        0.1639     0.6803     0.2705
marquardt14        0.1517     0.5260    0.2717    baker14            0.1557     0.5000     0.3115
ashok14            0.1272     0.5072    0.2558    castilloJuarez14   0.0656     0.4754     0.2049
baker14            0.1257     0.5000    0.2529    ashok14               -          -          -



Table 7. Evaluation results for early birds in blogs in terms of accuracy on English (left) and
Spanish (right) texts.
                   English                                           Spanish
Team                Joint    Gender       Age     Team                Joint     Gender      Age
lopezmonroy14      0.2083     0.6250    0.2500    lopezmonroy14      0.3571     0.5000     0.4286
liau14             0.1667     0.5000    0.2083    marquardt14        0.2857     0.6429     0.3571
ashok14            0.1667     0.4583    0.1667    shrestha14         0.2857     0.5714     0.4286
shrestha14         0.1667     0.5417    0.2500    castillojuarez14   0.2143     0.5000     0.3571
marquardt14        0.1250     0.5417    0.2500    baker14            0.1429     0.5000     0.2857
castillojuarez14   0.0833     0.5833    0.2500    liau14             0.0714     0.4286     0.2857
baker14            0.0417     0.5000    0.2083    ashok14               -          -          -



Table 8. Evaluation results for early birds in Twitter in terms of accuracy on English (left) and
Spanish (right) texts.
                   English                                           Spanish
Team                Joint    Gender       Age     Team                Joint     Gender      Age
lopezmonroy14      0.5333     0.7667    0.6333    shrestha14         0.6154     0.8846     0.6923
shrestha14         0.4000     0.7333    0.4333    lopezmonroy14      0.5385     0.7692     0.5769
liau14             0.3667     0.6667    0.5667    liau14             0.3846     0.6923     0.5385
marquardt14        0.3000     0.5667    0.5333    marquardt14        0.3846     0.7692     0.5000
baker14            0.2667     0.5333    0.5000    baker14            0.1923     0.5000     0.4615
ashok14            0.2333     0.5000    0.4667    ashok14               -          -          -
castillojuarez14      -          -         -      castillojuarez14      -          -          -




                                                907
Table 9. Evaluation results for early birds in hotel reviews in terms of accuracy on English texts.
                                             English
                         Team                 Joint    Gender      Age
                         liau14              0.2622    0.7317     0.3415
                         lopezmonroy14       0.2500    0.6524     0.3720
                         shrestha14          0.2012    0.6280     0.2805
                         marquardt14         0.1585    0.5976     0.2561
                         ashok14             0.1220    0.5854     0.2317
                         baker14             0.1037    0.5427     0.2439
                         castillojuarez14    0.0854    0.4756     0.1951


     As for the early birds, the best results in the final evaluation were achieved for Twit-
ter. In this case gender identification accuracies are higher in English whereas age and
joint identification are higher in Spanish. In any case, all the results are much lower
than the early birds ones, where the size of the set was approximately 10%. With re-
spect to the blogs, the best results in gender identification were achieved in English
and for age identification in Spanish. Although the joint identification obtained similar
values, in English there are more participants with higher results. The lowest accuracy
for gender identification was reoprted for the Spanish blogs, with values very close to
the random chance. These results are even worse than the early birds ones. Most of
the participants obtained better results for English than in the early birds, except mar-
quardt14 [19] who obtained worse results. Results in social media and hotel reviews are
very similar to the early birds ones, probably caused by the large number of authors.
The results for blogs are very similar to social media in case of age identification. The
lowest results in joint identification were achieved in English social media and in hotel
reviews, where furthermore the lowest results in age identification were obtained. The
lowest results in gender identification were achieved in English blogs, with values very
close to the random chance. On the contrary, the highest results for gender identifica-
tion were achieved in hotel reviews and in Twitter. The high ranking of the baseline
approach in hotel reviews is noteworthy, with values for gender identification of 0.6626
and a joint identification just in mid-ranking.
     The highest effectiveness values were achieved by liau14 in gender identification
on English Twitter (accuracy of 0.7338) and by shrestha14 [18] in age identification
on Spanish Twitter (accuracy of 0.6111) as well as in joint identification on Spanish
Twitter (accuracy of 0.4333). It is difficult to draw a correlation between approaches
and results, but looking at the three highest accuracies per subcorpus and task (gender,
age and joint identification), it seems that on overall simple content features such as
bag-of-words or word n-grams achieve the best results. Similarly, bag-of-words used
by liau14, word n-grams used by shrestha14 [18] and term vector model used by vil-
lenaroman14 [30] achieved the best results for almost all genres. Also noteworthy is the
contribution of IR features used by weren14 [32] in all the identifications in English
blogs, joint identification in English social media, age identification in Spanish Twit-
ter, Spanish social media and hotel reviews, gender identification in Spanish blogs and
joint identification in English social media. The mix of content and style features of
marquardt14 [19] gave good results in gender identification in Spanish Twitter and in




                                               908
the three identifications in Spanish blogs. The second ranking in gender identification in
Spanish social media was obtained with the char n-grams baseline, but low rankings in
the other subcorpora demonstrate that the use of character n-grams does not seem to be
a good approach for author profiling in general. The overall best performance was ob-
tained by lopezmonroy14 [17] employing second order representation based on terms.
Table 14 shows the joint identification accuracies per subcorpus and their average.

Table 10. Evaluation results in social media in terms of accuracy on English (left) and Spanish
(right) texts.
                   English                                          Spanish
Team                Joint    Gender      Age     Team                Joint    Gender      Age
shrestha14         0.2062    0.5382    0.3652    liau14             0.3357    0.6837     0.4894
liau14             0.1952    0.5385    0.3605    shrestha14         0.2845    0.6449     0.4276
weren14            0.1914    0.5361    0.3489    lopezmonroy14      0.2809    0.6431     0.4523
villenaroman14     0.1905    0.5421    0.3581    weren14            0.2792    0.6307     0.4382
lopezmonroy14      0.1902    0.5237    0.3552    marquardt14        0.2102    0.6431     0.3445
castillojuarez14   0.1445    0.5053    0.2855    villenaroman14     0.1961    0.5724     0.3622
marquardt14        0.1428    0.5216    0.2701
                                                 baseline           0.1820    0.6555     0.2862
ashok14            0.1318    0.5198    0.2515
                                                 baker14            0.1678    0.5000     0.3445
baker14            0.1277    0.5012    0.2494
                                                 castillojuarez14   0.1254    0.4982     0.2509
mechti14           0.1244    0.5198    0.2355
                                                 mechti14           0.1060    0.5919     0.2191
baseline           0.0930    0.5074    0.1925    ashok14               -         -          -



Table 11. Evaluation results in blogs in terms of accuracy on English (left) and Spanish (right)
texts.
                   English                                          Spanish
Team                Joint    Gender      Age     Team                Joint    Gender      Age
lopezmonroy14      0.3077    0.6795    0.3974    lopezmonroy14      0.3214    0.5893     0.4821
villenaroman14     0.3077    0.6410    0.3974    marquardt14        0.2679    0.5179     0.4821
weren14            0.2949    0.6410    0.4615    shrestha14         0.2500    0.4286     0.4643
liau14             0.2692    0.6538    0.3462    baker14            0.2321    0.5000     0.4464
shrestha14         0.2308    0.5769    0.3846    liau14             0.2321    0.5000     0.4464
castillojuarez14   0.1795    0.5128    0.3333    villenaroman14     0.2321    0.5179     0.4643
ashok14            0.1282    0.4231    0.2564    mechti14           0.1786    0.5000     0.2857
baker14            0.1282    0.5000    0.2949    weren14            0.1786    0.5357     0.2500
marquardt14        0.1282    0.4615    0.2692    castillojuarez14   0.0893    0.4464     0.2679
baseline           0.0897    0.5769    0.1410    baseline           0.0536    0.5357     0.1607
mechti14           0.0897    0.5897    0.1795    ashok14               -         -          -




                                               909
Table 12. Evaluation results in Twitter in terms of accuracy on English (left) and Spanish (right)
texts.
                   English                                            Spanish
Team                Joint     Gender        Age     Team               Joint    Gender        Age
lopezmonroy14      0.3571     0.7208     0.4935     shrestha14        0.4333    0.6556     0.6111
liau14             0.3506     0.7338     0.5065     lopezmonroy14     0.3444    0.6000     0.5333
shrestha14         0.3052     0.6688     0.4416     liau14            0.3222    0.6333     0.5000
villenaroman14     0.2078     0.5130     0.4156     marquardt14       0.3111    0.6111     0.5222
weren14            0.2013     0.5714     0.3312     weren14           0.2778    0.5333     0.5222
ashok14            0.1948     0.5000     0.3896     villenaroman14    0.2667    0.5444     0.5000
marquardt14        0.1948     0.5260     0.3766 baseline              0.2333    0.4778     0.4667
baker14            0.1688     0.5065     0.3377 baker14               0.2111    0.5000     0.4889
baseline           0.1494     0.5974     0.2792 mechti14              0.1444    0.5111     0.2222
mechti14           0.0584     0.5390     0.1104 ashok14                  -         -          -
castilloJuarez14      -          -          -   castillojuarez14         -         -          -



       Table 13. Evaluation results in hotel reviews in terms of accuracy on English texts.
                                             English
                         Team                 Joint     Gender       Age
                         liau14              0.2564     0.7259   0.3502
                         lopezmonroy14       0.2247     0.6809   0.3337
                         shrestha14          0.2223     0.6687   0.3331
                         weren14             0.2211     0.6778   0.3343
                         villenaroman14      0.2199     0.6845   0.3143
                         baseline            0.1821     0.6626   0.2753
                         marquardt14         0.1437     0.5700   0.2436
                         baker14             0.1382     0.5292   0.2594
                         ashok14             0.1291     0.5189   0.2454
                         castillojuarez14    0.1236     0.5091   0.2418
                         mechti14            0.0451     0.5012   0.0901


    In Table 14 joint identification accuracies per subcorpus and the average are shown.
From this table we can infer that: a) the best results were obtained on Twitter maybe
due to the higher number of documents (tweets) per author in comparison to the other
genre and quite likely also to the spontaneous way people express themselves; b) the
lowest results were achieved in English social media and hotel reviews, due to the lowest
results in gender identification in the first case and age identification in the second one.




                                                  910
                         Table 14. Average results in terms of accuracy.
Ranking    Team         Average    Social Media        Blogs             Twitter        Reviews
                                   EN       ES       EN      ES        EN      ES         EN
1    lopezmonroy14      0.2895    0.1902   0.2809   0.3077   0.3214   0.3571   0.3444   0.2247
2    liau14             0.2802    0.1952   0.3357   0.2692   0.2321   0.3506   0.3222   0.2564
3    shrestha14         0.2760    0.2062   0.2845   0.2308   0.2500   0.3052   0.4333   0.2223
4    weren14            0.2349    0.1914   0.2792   0.2949   0.1786   0.2013   0.2778   0.2211
5    villenaroman14     0.2315    0.1905   0.1961   0.3077   0.2321   0.2078   0.2667   0.2199
6    marquardt14        0.1998    0.1428   0.2102   0.1282   0.2679   0.1948   0.3111   0.1437
7    baker14            0.1677    0.1277   0.1678   0.1282   0.2321   0.1688   0.2111   0.1382
 8   baseline           0.1404    0.0930 0.1820 0.0897 0.0536 0.1494 0.2333             0.1821
 9   mechti14           0.1067    0.1244 0.1060 0.0897 0.1786 0.0584 0.1444             0.0451
10   castillojuarez14   0.0946    0.1445 0.1254 0.1795 0.0893    -      -               0.1236
11   ashok14            0.0834    0.1318    -   0.1282    -   0.1948    -               0.1291


    In Figure 1 the average and standard deviation of the distances between predicted
and true classes per subcorpus is shown. The highest distance on average is produced for
reviews with a value of 1.69. The lowest distances on average and standard deviation are
produced for Twitter. The similarity in distances between the social media subcorpora
and the Spanish blogs is noteworthy. The complete list of distances among participants
for each subcorpus is shown in Appendix B.




             Figure 1. Distances between predicted and true classes per subcorpus.


    In Appendix A, statistical significances of all pairwise system comparisons are de-
tailed. As can be seen in Table A17, although lopezmonroy14 is the first in the general
ranking, this system is statistically not significantly different from shrestha14, villenaro-
man14 and weren14. All systems are significantly different from the baseline, although
weren, villenaroman and marquardt form a group close to baseline. It is noteworthy that




                                              911
most of the systems are statistically indistinguishable regarding English social media,
Spanish Twitter, and blogs (both languages).
     With respect to age identification, all systems are significantly different from the
baseline except ashok14 (the latter team did not participate in the Spanish task). There
are some systems where differences are not statistically significant, such as lopezmon-
roy14 and liau14 or weren14 and villenaroman14. In blogs most of the systems are
indistinguishable but significantly different from the baseline. On the other subcorpora,
most of the systems are also different from the baseline. Looking at the accuracies the
results show that most of the systems work significantly better than the baseline in age
identification.
     With respect to gender identification, all the systems are statistically different
from the baseline, but lopezmonroy14, marquardt14, shrestha14, villenaroman14 and
weren14 form a closer group. In English social media, English and Spanish blogs and
Spanish Twitter, most of the systems are statistically not significantly different. Al-
though all the systems are different from the baseline, most of them are statistically
indistinguishable. Therefore, we cannot conclude that the systems perform better or
worse than the baseline in gender identification. For example, in English social media
all systems that are different from the baseline performed better in gender identification,
in Twitter most of them performed better, but for Spanish social media the other way
around happened and all the systems performed worse. The same happened in hotel
reviews (in English) where most of the systems performed worse.
     In Table 15 runtime results are shown. The fastest team was liau14 with bag-
of-words features. With regard to the smallest data sets (Twitter and Blogs), we can
make two groups depending on their runtime. The fastest teams utilised bag-of-words
(liau14), words n-grams [18], style features [4], style and content features [20] or, in
some cases, the second order features of [17]. In case of the largest subcorpora, such as
social media and reviews, the difference among runtimes is more evident. The fastest
ones also utilised simple content features and in some case stylistic ones. The slowest
ones, with high difference, utilised IR-based features [32], parts-of-speech [13] or com-
binations of style and content-based features [19]. One of the slowest approaches [30]
utilised term-vectors, but team participants reported that the low performance was due
to the Weka library.

                     Table 15. Runtime performance (efficiency) per subcorpus.
 Team                       Twitter                   Blogs                 Social Media             Reviews
                       EN             ES         EN             ES         EN           ES               EN
 ashok14            3:23:36.00            -      5:57.22            -   18:26:49.00     19:03.24
 baker14               5:43.02      3:52.23      0:56.05      0:39.77    2:24:15.00     18:01.23      1:21.96
 castillojuarez14            -            -      5:13.49      0:59.76   11:36:32.00     20:23.85     18:06.34
 liau14                0:55.39      0:27.29      0:06.02      0:04.30      12:53.09      0:27.05      0:12.65
 lopezmonroy14         7:02.91      5:36.05      3:47.04      3:22.02      34:06.53      6:25.89      4:01.40
 marquardt14        1:47:15.00     35:06.63   not-known       7:36.18   36:05:51.00   2:08:14.00   5:44:45.00
 mechti14              8:12.00      0:32.00      4:13.00      0:11.00    2:43:56.00      1:24.00   1:21:33.00
 shrestha14            2:31.40      1:10.59      1:56.50      0:39.83      26:31.50      3:26.41      2:13.22
 villenaroman14     1:12:22.00     38:28.70     10:06.74      8:04.18   69:55:12.00   9:14:15.00   5:38:07.00
 weren14              41:32.38   1:33:48.00      4:46.46      4:06.79   30:18:02.00   2:34:33.00   1:17:29.00




                                                   912
    We executed PAN-AP 2013 approaches for gender identification on the social me-
dia documents of PAN-AP 2014 (social media was the data used in PAN-AP 2013). A
comparison for age identification was not possible due to the different age classes in
PAN-AP 2013 and PAN-AP 2014. Most of the approaches failed at execution time so
we only show those which could be executed. The only team with results for both years
is lopezmonroy.18 In Table 16 a comparison is shown. In English, although the best
result was obtained by lopezmonroy13 [16], the majority of PAN-AP 2014 approaches
obtained better results than PAN-AP 2013. In Spanish, results are more balanced be-
tween teams of the two years, although the two best results were obtained respectively
by cagnina13 and haro13 [7]. The high number of approaches below the baseline in
Spanish is noteworthy, as well as the higher accuracies obtained in Spanish than in En-
glish (being Spanish a gender-marked language). With respect to participants of both
years, lopezmonroy13 achieved better results than lopezmonroy14 in English but not in
Spanish.

Table 16. PAN-AP 2013 approaches evaluation results in PAN-AP 2014 social media in terms of
accuracy on English (left) and Spanish (right) texts (gender identification).

                       English                                   Spanish
             Team                Gender                Team                Gender
             lopezmonroy13       0.5438                cagnina13            0.6943
             villenaroman14      0.5421                haro13               0.6855
             liau14              0.5385                liau14               0.6837
             shrestha14          0.5382                baseline             0.6555
             weren14             0.5361                shrestha14           0.6449
             cagnina13           0.5287                lopezmonroy14        0.6431
             lopezmonroy14       0.5237                marquardt14          0.6431
             marquardt14         0.5216                lopezmonroy13        0.6336
             ashok14             0.5198                weren14              0.6307
             mecthi14            0.5198                jimenez13            0.6237
             baseline            0.5074                mechti14             0.5919
             castillojuarez14    0.5053                villenaroman14       0.5724
             haro13              0.5036                ramirez13            0.5459
             baker14             0.5012                baker14              0.5000
             ramirez13           0.4982                castillojuarez14     0.4982
             jimenez13           0.4967
             patra13             0.4917



6      Conclusion
In this paper we present the results of the 2nd International Author Profiling Task at
PAN-2014 within CLEF-2014. Given four different genres, namely, social media, blogs,
Twitter, and hotel reviews, in the two languages English and Spanish, the 10 participants
of the task had to identify gender and age of anonymous authors.
18
     lopezmonroy team was identified by pastor in PAN-AP 2013 (team obtaining the best perfor-
     mance)




                                             913
    The participants used several different features to approach the problem: content-
based (bag of words, words n-grams, term vectors, named entities, dictionary words,
slang words, contractions, sentiment words, and so on) and stylistic-based (frequen-
cies, punctuations, POS, HTML use, readability measures and many different statis-
tics). One participant [32] also combined many different IR-based features such as the
cosine similarity or the Okapi BM25. This evaluation showed that good results were
obtained by approaches which used simple content features (except the second order
representation in [17] and the IR based features in [32]), for example bag-of-words
(liau14), words n-grams [18] and term vectors [30]. Character n-grams demonstrated
not to be a good approach for author profiling in general. The best results employed a
second order representation based on relationships among terms, documents, profiles
and subprofiles [17].
    We draw following conclusions with respect to the different corpus parts: a) the
highest accuracies were achieved on Twitter. We think this is due to the fact that we
have a larger number of documents (tweets) per profile and the more spontaneous way
to communicate in this social medium; b) the lowest results were obtained in English
social media and hotel reviews, due to the lowest results in gender and age identification
respectively; c) the highest distance between predicted and truth classes in age identifi-
cation occurs in hotel reviews. A further analysis is needed in order to understand if for
instance there are cases of deceptive opinions.
Acknowledgements The PAN task on author profiling has been organised in the
framework of the WIQ-EI IRSES project (Grant No. 269180) within the FP 7 Marie
Curie People Framework of the European Commission. We would like to thank Atribus
by Corex for sponsoring the award for the winner team. We thank Julio Gonzalo,
Jorge Carrillo and Damiano Spina from UNED for helping with the Twitter subcor-
pus. The work of the first author was partially funded by Autoritas Consulting SA and
by Ministerio de Economía y Competitividad de España under grant ECOPORTUNITY
IPT-2012-1220-430000 and CSO2013-43054-R. The work of the second author was in
the framework the DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Ap-
plications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on
Multimodal Interaction in Intelligent Systems.


Bibliography
 1. Yuridiana Aleman, Nahun Loya, Darnes Vilarino Ayala, and David Pinto. Two
    Methodologies Applied to the Author Profiling Task—Notebook for PAN at
    CLEF 2013. In Forner et al. [8].
 2. Enrique Amigó, Jorge Carrillo-de-Albornoz, Irina Chugur, Adolfo Corujo, Julio
    Gonzalo, Edgar Meij, Maarten de Rijke, and Damiano Spina. Overview of
    RepLab 2014: author profiling and reputation dimensions for Online Reputation
    Management. In Proceedings of the Fifth International Conference of the CLEF
    Initiative, September 2014.
 3. Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni.
    Gender, genre, and writing style in formal written texts. TEXT, 23:321–346, 2003.




                                          914
 4. Christopher Ian Baker. Proof of Concept Framework for Prediction—Notebook
    for PAN at CLEF 2014. In Cappellato et al. [6].
 5. John D. Burger, John Henderson, George Kim, and Guido Zarrella.
    Discriminating gender on twitter. In Proceedings of the Conference on Empirical
    Methods in Natural Language Processing, EMNLP ’11, pages 1301–1309,
    Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
 6. Linda Cappellato, Nicola Ferro, Martin Halvey, and Wessel Kraaij, editors. CLEF
    2014 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings
    (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/ Vol-1180/, 2014.
 7. Fermin Cruz, Rafa Haro, and Javier Ortega. ITALICA at PAN 2013: An Ensemble
    Learning Approach to Author Profiling—Notebook for PAN at CLEF 2013. In
    Forner et al. [8].
 8. Pamela Forner, Roberto Navigli, and Dan Tufis, editors. CLEF 2013 Evaluation
    Labs and Workshop – Working Notes Papers, 23-26 September, Valencia, Spain,
    2013.
 9. Tim Gollub, Benno Stein, and Steven Burrows. Ousting Ivory Tower Research:
    Towards a Web Framework for Providing Experiments as a Service. In Bill Hersh,
    Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, 35th International
    ACM Conference on Research and Development in Information Retrieval (SIGIR
    12), pages 1125–1126. ACM, August 2012. ISBN 978-1-4503-1472-5. doi:
    http://dx.doi.org/10.1145/2348283.2348501.
10. Tim Gollub, Benno Stein, Steven Burrows, and Dennis Hoppe. TIRA:
    Configuring, Executing, and Disseminating Information Retrieval Experiments. In
    A Min Tjoa, Stephen Liddle, Klaus-Dieter Schewe, and Xiaofang Zhou, editors,
    9th International Workshop on Text-based Information Retrieval (TIR 12) at
    DEXA, pages 151–155, Los Alamitos, California, September 2012. IEEE. ISBN
    978-1-4673-2621-6. doi:
    http://doi.ieeecomputersociety.org/10.1109/DEXA.2012.55.
11. Tim Gollub, Martin Potthast, Anna Beyer, Matthias Busse, Francisco Rangel,
    Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Recent Trends in Digital
    Text Forensics and its Evaluation. In Pamela Forner, Henning Müller, Roberto
    Paredes, Paolo Rosso, and Benno Stein, editors, Information Access Evaluation
    meets Multilinguality, Multimodality, and Visualization. 4th International
    Conference of the CLEF Initiative (CLEF 13), pages 282–302, Berlin Heidelberg
    New York, September 2013. Springer. ISBN 978-3-642-40801-4. doi:
    http://dx.doi.org/10.1007/978-3-642-40802-1_28.
12. Sumit Goswami, Sudeshna Sarkar, and Mayur Rustagi. Stylometric analysis of
    bloggers’ age and gender. In Eytan Adar, Matthew Hurst, Tim Finin, Natalie S.
    Glance, Nicolas Nicolov, and Belle L. Tseng, editors, ICWSM. The AAAI Press,
    2009.
13. Gilad Gressel, Hrudya P, Surendran K, Thara S, Aravind A, and Prabaharan
    Poomachandran. Ensemble Learning Approach for Author Profiling—Notebook
    for PAN at CLEF 2014. In Cappellato et al. [6].
14. Janet Holmes and Miriam Meyerhoff. The Handbook of Language and Gender.
    Blackwell Handbooks in Linguistics. Wiley, 2003.




                                       915
15. Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. Automatically
    categorizing written texts by author gender. literary and linguistic computing
    17(4), 2002.
16. A. Pastor Lopez-Monroy, Manuel Montes-Y-Gomez, Hugo Jair Escalante, Luis
    Villasenor-Pineda, and Esau Villatoro-Tello. INAOE’s Participation at PAN’13:
    Author Profiling task—Notebook for PAN at CLEF 2013. In Forner et al. [8].
17. A. Pastor López-Monroy, Manuel Montes y Gómez, Hugo Jair-Escalante, and
    Luis Villase nor Pineda. Using Intra-Profile Information for Author
    Profiling—Notebook for PAN at CLEF 2014. In Cappellato et al. [6].
18. Suraj Maharjan, Prasha Shrestha, and Thamar Solorio. A Simple Approach to
    Author Profiling in MapReduce—Notebook for PAN at CLEF 2014. In
    Cappellato et al. [6].
19. James Marquardt, Golnoosh Fanardi, Gayathri Vasudevan, Marie-Francine
    Moens, Sergio Davalos, Ankur Teredesai, and Martine De Cock. Age and Gender
    Identification in Social Media—Notebook for PAN at CLEF 2014. In Cappellato
    et al. [6].
20. Seifeddine Mechti, Maher Jaoua, and Lamia Hadrich Belguith. Machine learning
    for classifying authors of anonymous tweets, blogs and reviews—Notebook for
    PAN at CLEF 2014. In Cappellato et al. [6].
21. Michal Meina, Karolina Brodzinska, Bartosz Celmer, Maja Czokow, Martyna
    Patera, Jakub Pezacki, and Mateusz Wilk. Ensemble-based Classification for
    Author Profiling Using Various Features—Notebook for PAN at CLEF 2013. In
    Forner et al. [8].
22. Dong Nguyen, Noah A. Smith, and Carolyn P. Rosé. Author age prediction from
    text using linear regression. In Proceedings of the 5th ACL-HLT Workshop on
    Language Technology for Cultural Heritage, Social Sciences, and Humanities,
    LaTeCH ’11, pages 115–123, Stroudsburg, PA, USA, 2011. Association for
    Computational Linguistics.
23. Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder. "how old do
    you think i am?"; a study of language and age in twitter. Proceedings of the
    Seventh International AAAI Conference on Weblogs and Social Media, 2013.
24. Eric W. Noreen. Computer intensive methods for testing hypotheses: an
    introduction. Wiley, New York, 1989.
25. Claudia Peersman, Walter Daelemans, and Leona Van Vaerenbergh. Predicting
    age and gender in online social networks. In Proceedings of the 3rd international
    workshop on Search and mining user-generated contents, SMUC ’11, pages
    37–44, New York, NY, USA, 2011. ACM.
26. James W. Pennebaker. The Secret Life of Pronouns: What Our Words Say About
    Us. Bloomsbury USA, 2013.
27. James W. Pennebaker, Mathias R. Mehl, and Kate G. Niederhoffer. Psychological
    aspects of natural language use: Our words, our selves. Annual review of
    psychology, 54(1):547–577, 2003.
28. Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, and
    Giacommo Inches. Overview of the Author Profiling Task at PAN
    2013—Notebook for PAN at CLEF 2013. In Forner et al. [8].




                                        916
29. Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker.
    Effects of age and gender on blogging. In AAAI Spring Symposium:
    Computational Approaches to Analyzing Weblogs, pages 199–205. AAAI, 2006.
30. Julio Villena-Román and José-Carlos González-Cristóbal. DAEDALUS at PAN
    2014: Guessing Tweet Author’s Gender and Age—Notebook for PAN at CLEF
    2014. In Cappellato et al. [6].
31. Hongning Wang, Yue Lu, and Chengxiang Zhai. Latent Aspect Rating Analysis
    on Review Text Data: A Rating Regression Approach. In Proceedings of the 16th
    ACM SIGKDD International Conference on Knowledge Discovery and Data
    Mining, pages 783–792, 2010.
32. Edson R.D. Weren, Viviane P. Moreira, and José P.M. de Oliveira. Exploring
    Information Retrieval features for Author Profiling—Notebook for PAN at CLEF
    2014. In Cappellato et al. [6].
33. Alexander Yeh. More accurate tests for the statistical significance of result
    differences. In Proceedings of the 18th Conference on Computational Linguistics
    - Volume 2, pages 947–953, Stroudsburg, PA, USA, 2000. Association for
    Computational Linguistics.
34. Cathy Zhang and Pengyu Zhang. Predicting gender from blog posts. Technical
    report, Technical Report. University of Massachusetts Amherst, USA, 2010.




                                       917
Appendix A                 Pairwise Comparison of All Systems
For all subsequent tables, the significance levels are encoded as follows:
     Symbol            Significance Level
         =                    p > 0.05        ∼    not significant
         *             0.05 ≥ p > 0.01        ∼    significant
         **            0.01 ≥ p > 0.001       ∼    very significant
        ***                   p ≤ 0.001       ∼    highly significant



Table A1. Significance of accuracy differences between system pairs for age identification in the
entire corpus.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***       ***       ***      ***       ***      ***     ***         ***      ***     =
baker                               =       ***      ***        =       ***     ***         ***      ***    ***
castillojuarez                              ***      ***        *       ***     ***         ***      ***    ***
liau                                                  =        ***      ***      =          ***       **    ***
lopezmonroy                                                    ***      ***      =            *       =     ***
marquardt                                                               ***     ***         ***      ***    ***
mechti                                                                          ***         ***      ***    ***
shrestha                                                                                     **       =     ***
villenaroman                                                                                          =     ***
weren                                                                                                       ***
baseline




Table A2. Significance of accuracy differences between system pairs for age identification in
English social media.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =        ***       ***      ***        **       *      ***         ***      ***    ***
baker                              **       ***      ***        =        =      ***         ***      ***    ***
castillojuarez                              ***      ***        =       ***     ***         ***      ***    ***
liau                                                  =        ***      ***      =           =        =     ***
lopezmonroy                                                    ***      ***      =           =        =     ***
marquardt                                                               ***     ***         ***      ***    ***
mechti                                                                          ***         ***      ***    ***
shrestha                                                                                     =        =     ***
villenaroman                                                                                          =     ***
weren                                                                                                       ***
baseline




Table A3. Significance of accuracy differences between system pairs for age identification in
Spanish social media.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***       ***       ***      ***       ***      ***     ***         ***      ***    ***
baker                              **       ***      ***        =       ***      **          =       ***      *
castillojuarez                              ***      ***       ***       =      ***         ***      ***     =
liau                                                  =        ***      ***      **         ***       *     ***
lopezmonroy                                                    ***      ***      =          ***       =     ***
marquardt                                                               ***     ***          =       ***      *
mechti                                                                          ***         ***      ***     **
shrestha                                                                                     **       =     ***
villenaroman                                                                                         **     **
weren                                                                                                       ***
baseline




                                                        918
Table A4. Significance of accuracy differences between system pairs for age identification in
English blogs.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =          =         =       =         =        =       =           =       **      =
baker                               =         =       =         =        =       =           =       =       **
castillojuarez                                =       =         =        =       =           =       =       **
liau                                                  =         =        =       =           =       =       **
lopezmonroy                                                     =        *       =           =       =      ***
marquardt                                                                =       =           =       **      =
mechti                                                                           *           *       **      =
shrestha                                                                                     =       =       **
villenaroman                                                                                         =      **
weren                                                                                                       ***
baseline




Table A5. Significance of accuracy differences between system pairs for age identification in
Spanish blogs.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***       ***       ***      ***       ***      ***     ***         ***      ***     **
baker                               =         =       =         =        =       =           =        *      **
castillojuarez                                =       *         *        =       *           *        =      =
liau                                                  =         =        =       =           =        *      **
lopezmonroy                                                     =        =       =           =        *      **
marquardt                                                                =       =           =       ***    ***
mechti                                                                           =           =        =      =
shrestha                                                                                     =        *     ***
villenaroman                                                                                          *     **
weren                                                                                                        =
baseline




Table A6. Significance of accuracy differences between system pairs for age identification in
English Twitter.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =        ***         *       =         =       ***      =           =        =      *
baker                             ***        **       **        =       ***      =           =        =      =
castillojuarez                              ***      ***       ***      ***     ***         ***      ***    ***
liau                                                  =          *      ***      =           *       **     ***
lopezmonroy                                                     **      ***      =           =       **     ***
marquardt                                                               ***      =           =        =      =
mechti                                                                          ***         ***      ***    ***
shrestha                                                                                     =        *     ***
villenaroman                                                                                          =     **
weren                                                                                                        =
baseline




Table A7. Significance of accuracy differences between system pairs for age identification in
Spanish Twitter.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***         =       ***      ***       ***      ***     ***         ***      ***    ***
baker                             ***         =       =         =        **      *           =        =      =
castillojuarez                              ***      ***       ***      ***     ***         ***      ***    ***
liau                                                  =         =        **      **          =        =      =
lopezmonroy                                                     =       ***      =           =        =      =
marquardt                                                                **      =           =        =      =
mechti                                                                          ***          **      ***     **
shrestha                                                                                     **       =      *
villenaroman                                                                                          =      =
weren                                                                                                        =
baseline




                                                        919
Table A8. Significance of accuracy differences between system pairs for age identification in
English hotel reviews.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =          =       ***      ***        =       ***     ***         ***      ***     *
baker                               =       ***      ***        =       ***     ***         ***      ***     =
castillojuarez                              ***      ***        =       ***     ***         ***      ***     *
liau                                                  =        ***      ***      =           *        =     ***
lopezmonroy                                                    ***      ***      =           =        =     ***
marquardt                                                               ***     ***         ***      ***     *
mechti                                                                          ***         ***      ***    ***
shrestha                                                                                     =        =     ***
villenaroman                                                                                          =      *
weren                                                                                                       ***
baseline




Table A9. Significance of accuracy differences between system pairs for gender identification in
the entire corpus.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***        **       ***      ***       ***      ***     ***         ***      ***    ***
baker                               *       ***      ***       ***       =      ***         ***      ***    ***
castillojuarez                              ***      ***       ***      ***     ***         ***      ***    ***
liau                                                 ***       ***      ***      **         ***       **    ***
lopezmonroy                                                    ***      ***      =           =        =      *
marquardt                                                               ***     ***         ***      ***     *
mechti                                                                          ***         ***      ***    ***
shrestha                                                                                     =        =      *
villenaroman                                                                                          =      *
weren                                                                                                        *
baseline




Table A10. Significance of accuracy differences between system pairs for gender identification
in English social media.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =          =         =       =         =        =       =           =       =       =
baker                               =        **       =         =        =       **          **      **      =
castillojuarez                               **       =         =        =       **         ***       *      =
liau                                                  =         =        =       =           =       =       *
lopezmonroy                                                     =        =       =           *       =       =
marquardt                                                                =       =           =       =       =
mechti                                                                           =           *       =       =
shrestha                                                                                     =       =       *
villenaroman                                                                                         =      **
weren                                                                                                        *
baseline




Table A11. Significance of accuracy differences between system pairs for gender identification
in Spanish social media.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***       ***       ***      ***       ***      ***     ***         ***      ***    ***
baker                               =       ***      ***       ***       **     ***           *      ***    ***
castillojuarez                              ***      ***       ***       **     ***          =       ***    ***
liau                                                  =         =       ***      *          ***       **     =
lopezmonroy                                                     =        *       =           **       =      =
marquardt                                                                *       =           **       =      =
mechti                                                                           *           =        =      *
shrestha                                                                                     **       =      =
villenaroman                                                                                         **     ***
weren                                                                                                        =
baseline




                                                        920
Table A12. Significance of accuracy differences between system pairs for gender identification
in English blogs.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =          =        **       **        =        =       =           *       **      =
baker                               =         =       *         =        =       =           *       =       =
castillojuarez                                =       *         =        =       =           =       =       =
liau                                                  =          *       =       =           =       =       =
lopezmonroy                                                     **       =       =           =       =       =
marquardt                                                                =       =           *       =       =
mechti                                                                           =           =       =       =
shrestha                                                                                     =       =       =
villenaroman                                                                                         =       =
weren                                                                                                        =
baseline




Table A13. Significance of accuracy differences between system pairs for gender identification
in Spanish blogs.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***       ***       ***      ***       ***      ***     ***         ***      ***    ***
baker                               =         =       =         =        =       =           =        =      =
castillojuarez                                =       =         =        =       =           =        =      =
liau                                                  =         =        =       =           =        =      =
lopezmonroy                                                     =        =       =           =        =      =
marquardt                                                                =       =           =        =      =
mechti                                                                           =           =        =      =
shrestha                                                                                     =        =      =
villenaroman                                                                                          =      =
weren                                                                                                        =
baseline




Table A14. Significance of accuracy differences between system pairs for gender identification
in English Twitter.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =        ***       ***       **        =        =       **          =        =      **
baker                             ***       ***       **        =        =       **          =        =       *
castillojuarez                              ***      ***       ***      ***     ***         ***      ***    ***
liau                                                  =        ***      ***      =          ***      **      **
lopezmonroy                                                    ***      ***      =          ***      **       *
marquardt                                                                =       *           =        =      =
mechti                                                                           *           =        =      =
shrestha                                                                                     **       =      =
villenaroman                                                                                          =      *
weren                                                                                                        =
baseline




Table A15. Significance of accuracy differences between system pairs for gender identification
in Spanish Twitter.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***         =       ***      ***       ***      ***     ***         ***      ***    ***
baker                             ***         =       =         =        =       *           =        =      =
castillojuarez                              ***      ***       ***      ***     ***         ***      ***    ***
liau                                                  =         =        =       =           =        =      =
lopezmonroy                                                     =        =       =           =        =      =
marquardt                                                                =       =           =        =      =
mechti                                                                           *           =        =      =
shrestha                                                                                     =        =     **
villenaroman                                                                                          =      =
weren                                                                                                        =
baseline




                                                        921
Table A16. Significance of accuracy differences between system pairs for gender identification
in English hotel reviews.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =          =       ***      ***       ***       =      ***         ***      ***    ***
baker                               =       ***      ***       ***      ***     ***         ***      ***    ***
castillojuarez                              ***      ***        *        =      ***         ***      ***    ***
liau                                                 ***       ***      ***     ***         ***      ***    ***
lopezmonroy                                                    ***      ***      =           =        =      =
marquardt                                                               ***     ***         ***      ***    ***
mechti                                                                          ***         ***      ***    ***
shrestha                                                                                     =        =      =
villenaroman                                                                                          =      =
weren                                                                                                        =
baseline




Table A17. Significance of accuracy differences between system pairs for joint identification in
the entire corpus.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***        **       ***      ***       ***      ***     ***         ***      ***    ***
baker                               *       ***      ***       ***       =      ***         ***      ***    ***
castillojuarez                              ***      ***       ***      ***     ***         ***      ***    ***
liau                                                 ***       ***      ***      **         ***      ***    ***
lopezmonroy                                                    ***      ***      =           =        =      **
marquardt                                                               ***     ***         ***      ***      *
mechti                                                                          ***         ***      ***    ***
shrestha                                                                                     =        =     **
villenaroman                                                                                          =      *
weren                                                                                                        *
baseline




Table A18. Significance of accuracy differences between system pairs for joint identification in
English social media.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =          =         =       =         =        =       =           =       =       =
baker                               =        **       =         =        =       **          **      **      =
castillojuarez                               **       =         =        =       **          **       *      =
liau                                                  =         =        =       =           =       =       **
lopezmonroy                                                     =        =       =           =       =       =
marquardt                                                                =       =           =       =       =
mechti                                                                           =           *       =       =
shrestha                                                                                     =       =       *
villenaroman                                                                                         =      **
weren                                                                                                        *
baseline




Table A19. Significance of accuracy differences between system pairs for joint identification in
Spanish social media.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***       ***       ***      ***       ***      ***     ***         ***      ***    ***
baker                               =       ***      ***       ***       **     ***          =       ***    ***
castillojuarez                              ***      ***       ***       **     ***          =       ***    ***
liau                                                  =         =       ***      =          ***       **     =
lopezmonroy                                                     =        *       =           **       =      =
marquardt                                                                *       =           **       =      =
mechti                                                                           *           =        =      *
shrestha                                                                                     **       =      =
villenaroman                                                                                         **     ***
weren                                                                                                        =
baseline




                                                        922
Table A20. Significance of accuracy differences between system pairs for joint identification in
English blogs.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =          =        **       **        =        =       =           *       **      =
baker                               =         =       *         =        =       =           *       =       =
castillojuarez                                =       *         =        =       =           =       =       =
liau                                                  =          *       =       =           =       =       =
lopezmonroy                                                     **       =       =           =       =       =
marquardt                                                                =       =           *       =       =
mechti                                                                           =           =       =       =
shrestha                                                                                     =       =       =
villenaroman                                                                                         =       =
weren                                                                                                        =
baseline




Table A21. Significance of accuracy differences between system pairs for joint identification in
Spanish blogs.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***       ***       ***      ***       ***      ***     ***         ***      ***    ***
baker                               =         =       =         =        =       =           =        =      =
castillojuarez                                =       =         =        =       =           =        =      =
liau                                                  =         =        =       =           =        =      =
lopezmonroy                                                     =        =       =           =        =      =
marquardt                                                                =       =           =        =      =
mechti                                                                           =           =        =      =
shrestha                                                                                     =        =      =
villenaroman                                                                                          =      =
weren                                                                                                        =
baseline




Table A22. Significance of accuracy differences between system pairs for joint identification in
English Twitter.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =        ***       ***      ***        =        =       **          =        =      **
baker                             ***       ***      ***        =        =       **          =        =      **
castillojuarez                              ***      ***       ***      ***     ***         ***      ***    ***
liau                                                  =         **      **       =          ***      **       *
lopezmonroy                                                    ***      **       =           **      **       *
marquardt                                                                =       *           =        =      =
mechti                                                                           *           =        =      =
shrestha                                                                                     **       =      =
villenaroman                                                                                          =     **
weren                                                                                                        =
baseline




Table A23. Significance of accuracy differences between system pairs for joint identification in
Spanish Twitter.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                   ***         =       ***      ***       ***      ***     ***         ***      ***    ***
baker                             ***         =       =         =        =       *           =        =      =
castillojuarez                              ***      ***       ***      ***     ***         ***      ***    ***
liau                                                  =         =        =       =           =        =      =
lopezmonroy                                                     =        =       =           =        =      =
marquardt                                                                =       =           =        =      =
mechti                                                                           *           =        =      =
shrestha                                                                                     =        =      *
villenaroman                                                                                          =      =
weren                                                                                                        =
baseline




                                                        923
Table A24. Significance of accuracy differences between system pairs for joint identification in
English hotel reviews.
                 ashok baker castillojuarez liau lopezmonroy marquardt mechti shrestha villenaroman weren baseline
ashok                    =          =       ***      ***       ***       =      ***         ***      ***    ***
baker                               =       ***      ***       ***      ***     ***         ***      ***    ***
castillojuarez                              ***      ***        **       =      ***         ***      ***    ***
liau                                                 ***       ***      ***     ***          **      ***    ***
lopezmonroy                                                    ***      ***      =           =        =      =
marquardt                                                               ***     ***         ***      ***    ***
mechti                                                                          ***         ***      ***    ***
shrestha                                                                                     =        =      =
villenaroman                                                                                          =      =
weren                                                                                                        =
baseline




                                                        924
Appendix B      Distances in Age Identification




    Figure B1. Distances between predicted and truth classes in English social media.




    Figure B2. Distances between predicted and truth classes in Spanish social media.




       Figure B3. Distances between predicted and truth classes in English blogs.




                                          925
Figure B4. Distances between predicted and truth classes in Spanish blogs.




Figure B5. Distances between predicted and truth classes in English Twitter.




Figure B6. Distances between predicted and truth classes in Spanish Twitter.




                                   926
Figure B7. Distances between predicted and truth classes in English hotel reviews.




                                      927