<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the Celebrity Profiling Task at PAN 2020</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matti Wiegmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benno Stein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Potthast</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bauhaus-Universität Weimar</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leipzig University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Celebrity profiling is author profiling applied to celebrities. As a subpopulation of social media users, celebrities are prolific authors for whom all kinds of personal information is public knowledge, and whose large followership enables new kinds of author profiling tasks: At this year's PAN, we study for the first time author profiling of social media users where an author's age, gender, and occupation has to be predicted by analyzing ten of their followers, rather than the author's original writing. This paper presents this novel approach to profiling, the 2,380-author dataset we created for to study it, and the three models that participants proposed to solve the problem in diverse ways. The participants' followerbased profiling models achieve F1-scores that far exceed random guessing, even reaching the performance-level of a baseline author profiling model when predicting occupations. Our evaluation reveals that follower-based profiling models have similar strengths and weaknesses as the author-based profiling models for celebrity profiling: They work best if the classes are topically coherent, as for the “sports” occupation, but less so in the opposite case, as for the “creator” occupation. Additionally, while predicting the age of the celebrities is still difficult, the follower-based models show a trend to predict younger users better than the author-based ones on our dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Author Profiling, the task of predicting the demographics of an author from their texts,
is a central task in authorship analysis with many applications in the social sciences,
forensics, and marketing. Author profiling technology has been developed on, and
applied to many demographics, genres, and related tasks and often achieves good results,
but all common approaches require lots of high-quality text for training from the
authors in question. Especially on social media, which is the currently dominant genre in
the field of author profiling, authors with both many public, high-quality texts and
verified personal demographics are few and far between. With current technology, it is not
possible to profile users that write only a few textual posts and only interact by reading,
liking, and forwarding the messages of other authors. Since these passive authors are
very frequent on social media, one can profile them only based on other factors. One
such factor that provides information about passive authors are the messages posted by
other authors who are closely connected to them. Social media theory points out that
users with similar demographics and interests form online communities and that online
communities develop sociolects (language variation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]), so inspecting the author’s
friends, followers, and their social graph relations may also hint at an author’s
demographics. Since celebrities are well-connected, influential, and elevated figures in their
communities, they are a suitable subpopulation to study algorithms that profile passive
users based on the social graph using the posts of connected authors.
      </p>
      <p>After introducing the task of celebrity profiling [46] for the first time, we organized a
corresponding shared task at PAN 2019 [47], asking participants to profile the age,
gender, fame, and occupation of a celebrity given their Twitter timeline. For the shared task
on celebrity profiling at PAN 2020, we tackle the problem of profiling passive authors,
namely by asking participants to predict three demographics of a celebrity—age as a
60-class problem, gender as a 2-class problem, and occupation as a 4-class problem—,
given only the tweets of the celebrity’s followers. For this task, we constructed a new
dataset containing 2,320 celebrities, each annotated with all three demographics, and
the Twitter posts of 10 randomly selected, but active followers, each with a sufficient
amount of original, English tweets. For consistency, we reused the ranking measure
from the previous celebrity profiling task: the harmonic mean of the macro-averaged
multi-class F1 for gender, occupation, and a leniently calculated F1 for age. Three teams
submitted a diverse range of models, all outperforming a baseline model trained on the
followers’ texts, improving strongly above random guessing, and closing in on another
baseline trained on the celebrities’ tweets. We thus demonstrated that the task is, in fact,
solvable. An in-depth evaluation reveals similar strengths and weaknesses of the models
compared to the previous celebrity profiling task: Topically homogeneous occupations
(e.g., sports) are easier to predict than heterogeneous ones (e.g., creators), and younger
users are easier to predict than older ones.</p>
      <p>After reviewing the related work in Section 2, we describe in more detail the task,
the construction of the task’s datasets, the reasoning underlying our performance
measures, and our baselines in Section 3. In Section 4, we survey the software submissions,
in Section 5, we report the evaluation results and present our analysis concerning the
performance of different approaches and individual demographics of the task.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The study of author profiling techniques has a rich history, with the pioneering works
done by Pennebaker et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], Koppel et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], Schler et al. [42], and Argamon et al.
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], focusing on age, gender, and personality from genres with longer, grammatical
documents such as blogs and essays. The most commonly used genre in recent years is
Twitter tweets, first used in 2011 to predict gender [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and age [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Later work also used
Facebook posts [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Reddit [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and Sina Weibo [45]. Recently added demographics
include education [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], ethnicity [44], family status [45], income [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], occupation [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ],
location of origin [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], religion [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], and location of residence [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        At PAN, author profiling has been studied since 2013, covering different
demographics including age and gender [39, 38, 41], personality[
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], language variety [40],
genres like blog posts, reviews, and social media messages [41], predicting across
genres [36], and profiling author characteristics outside the domain of demographics, such
as the authors inclination to spread fake news [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] of detecting if an author writes like
a bot [37]. The population of celebrities, introduced to author profiling by Wiegmann
et al. [46], has been studied at PAN since 2019 with the first shared task on celebrity
profiling [47] with the goal on predicting age, gender, occupation, and fame of 48,335
celebrities given the respective Twitter timeline.
      </p>
      <p>
        Methodologically, author profiling has been comparatively stable over the last
decade: most approaches utilize supervised machine learning based on the authors’
texts and varying stylometric and psycholinguistic features to encode non-lexical
information. The additional features proved to be important to the degree that even
advanced neural network architectures are only competitive if these features are explicitly
encoded [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The biggest methodological improvements, experimentally shown for
selected demographics, are the usage of message-level attention, recently proposed by
Lynn et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and of network homophily by encoding information from the social
graph. The pioneering work by Kosinski et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] shows that the common likes of
Facebook users suffice to predict demographics like gender, sexual orientation,
ethnicity, and substance use behavior with up to 0.9 accuracy. Recent advances in graph
encoding algorithms [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] motivated the use of node embeddings as supplemental
features when predicting age and gender on Facebook [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], occupation and income [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
racism and sexism [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and suicide ideation [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] on Twitter. Similar approaches have
also been explored in related fields to, for example, profile the bias and factuality of
news agencies [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. An even more advanced approach to predict the occupation of
authors was suggested by Pan et al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], who jointly encoded the adjacency matrix of the
follower graph with the biographies of all authors in the network using graph
convolutional neural networks. Additionally, the metadata of related authors in the social graph
is central in other user analysis tasks, like geolocation prediction [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Besides text-based author profiling through the homophily of social networks,
several studies explore language variation and convergence on social media.
Essentially, language variation and convergence explains how groups of people adopt lexical
changes and are, together with the psycholinguistic preferences of social groups studied
by Pennebaker et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], the reason author profiling is possible. The works that explore
language variation have shown, for example, that online language does not convergence
to a common “netspeak” but often follows the geographic and demographic [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
similarities of online communities. Besides real-world factors, a significant impact on
lexical variation is attributed to social factors. For example, Pavalanathan and Eisenstein
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] show that lexical variation decreases with the size of the intended audience, which
means that social media texts have less lexical variation if they are addressed to a larger
audience. Similarly, Tamburrini et al. [43] have shown that an author’s words are based
on the social identity of the conversion-partner. The specific impact of the network
structure on the language variations was studied by [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] who found that language
variation is adopted more quickly if individuals are more closely connected. Based on the
related work, it is reasonable to assume that the same linguistic processes of lexical
variation and convergence used by Pennebaker et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] to profile individuals based
on the individuals’ texts also apply to social groups, and it is also possible to profile
individuals to a degree based on the social groups’ texts.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Task Description: Follower-based Author Profiling</title>
      <p>
        We introduce the task of follower-based author profiling: It’s goal is to predict the three
demographics of a social media user from the writing of their followers. We
operationalize the task using celebrities and their followership on Twitter, asking for the
prediction of their age, gender, and occupation. Our training dataset contains the
timelines of ten randomly chosen followers per celebrity with at least 100 original English
tweets for each of the 2,000 celebrities, balanced by gender and occupation. Likewise,
the test dataset contains another 200 celebrities. The performance of the submissions
was judged by the harmonic mean of the multi-class F1 scores of each demographic,
and evaluated using the TIRA evaluation platform [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. All data and code are publicly
available.1
3.1
      </p>
      <sec id="sec-3-1">
        <title>Evaluation Data</title>
        <p>
          The dataset for our shared task is has been sampled from the Webis Celebrity
Profiling Corpus 2019 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This corpus contains the Twitter IDs of 71,706 celebrities and
extensive demographic information collected about them on Wikidata. We started by
extracting all celebrities from the corpus where the target demographics age, gender,
and occupation are known simultaneously, omitting all celebrities with demographics
outside of the targeted demographic spectrum:
– Gender. From the eight different gender-related Wikidata labels, only male and
female were kept, since all others are rare and too diverse for a meaningful grouping.
– Occupation. From the 1,379 different occupation-related Wikidata labels, only
those belonging to the following, manually determined super-classes were kept (all
others are too rare):
        </p>
        <p>Sports for occupations participating in professional sports, primarily athletes.
Performer for creative activities primarily involving a performance like acting,
entertainment, TV hosts, and musicians.</p>
        <p>Creator for creative activities with a focus on creating a work or piece of art, for
example, writers, journalists, designers, composers, producers, and architects.</p>
        <p>Politics for politicians and political advocates, lobbyists, and activists.
– Age. Unlike the profiling literature on age prediction, we did not define a static set
of age groups, but used the year of birth between 1940 and 1999 as extracted from
Wikidata’s Day of Birth property. Figure 1 shows the distribution of the years
of birth in the training and test datasets.</p>
        <p>For this selection of celebrity profiles, we downloaded the Twitter IDs of up to
100,000 followers, starting with the most recent. To limit excessive downloading of
follower profiles, we first acquired the user descriptions for all followers and discarded
all but the most active users with more than ten followers, more than ten followees,
and at least fifteen messages. Afterward, all users with more than 100,000 followers
1See https://pan.webis.de/data.html and https://github.com/pan-webis-de/pan-code.</p>
        <p>Training
Test
1950
1960
1970
1980</p>
        <p>1990</p>
        <p>Birthyear
or 1,000 followees were discarded. Finally, the timelines of all remaining followers
were downloaded, omitting all retweets, replies, and non-English tweets. To compile
the dataset, we randomly selected ten followers per celebrity which had at least 100
tweets left. This initial compilation of the evaluation dataset contained 10,585 celebrity
profiles with ten followers per celebrity and with at least 100 original, English Tweets
per follower. From this initial compilation, we selected the largest possible sample of
profiles balanced by occupation and by gender, yielding 2,320 celebrities for training
and test, and leaving 8,265 celebrities for an unbalanced, supplemental dataset. We
split the 2,320 celebrity dataset roughly 80:20 into a 1,920-author training dataset and
a 400-author test dataset test, and handed out the training and supplemental datasets to
the participants, keeping the test data hidden for the cloud-based evaluation on TIRA.</p>
        <p>Let T denote the set of classes labels of a given demographic (e.g., gender), where
t 2 T is a given class label (e.g., female). The prediction performance for T 2 fgender,
occupationg is measured using the macro-averaged multi-class F1-score. This measure
averages the harmonic mean of precision and recall over all classes of a demographic,
weighting each class equally, and thus promoting correct predictions of small classes:
F1;T =
2 X precision(ti) recall(ti) :
jT j ti2T precision(ti) + recall(ti)</p>
        <p>We also apply this measure to evaluate the prediction performance for the
demographic T = age, but change the computation of true positives: a predicted year is
counted as correct if it is within an "-environment of the true year, where " increases
linearly from 2 to 9 years with the true age of the celebrity in question:
" = ( 0:1 truth + 202:8):
This way of measuring the prediction performance for the age demographic addresses a
shortcoming of the traditional “fixed-age interval scheme:” Defining strict age intervals
(e.g., 10-20 years, 20-30, etc.) overly penalizes small prediction errors made at the
interval boundaries, such as predicting an age of 21 instead of 20. Furthermore, we
decided against combining precise predictions with an error function like mean squared
error, since we presume that age prediction is more difficult for older users since the
writing style presumably changes more slowly with increasing maturity.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Baselines</title>
        <p>
          Since our task is rather novel, few competitive baselines are available, so that we resort
to two basic approaches instead: The baseline n-gram which uses the follower
timelines, and the baseline oracle, which is identical to n-gram but uses the celebrities’
timelines instead of the follower timelines. Both baselines solve the task with a
multinominal logistic regression [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], where the inputs are the TF-IDF vectors of the
respective tweets. The texts are preprocessed by lowercasing, replacing hashtags, usernames,
emoticons, emojis, time expressions, and numbers with respective special tokens,
removing all remaining newlines and non-ASCII characters, and collapsing spaces. The
TF-IDF vectors are constructed from the word 1-grams and 2-grams of all concatenated
tweets of the celebrities or followers, respectively, with a per-celebrity frequency of at
least 3. We added special separator tokens to encode the end of a tweet and the end
of a follower timeline. Due to the lenient calculation of F1;age, the age prediction was
simplified to the five years: 1947, 1963, 1975, 1985, and 1994.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Survey of the Submitted Approaches</title>
      <p>Three participants submitted their software to our shared task. Altogether, the
submissions were methodologically diverse, covering creative feature engineering, thorough
feature selection, and contemporary deep learning methods. As opposed to last year,
neither approach is generally superior to the other ones, with each showing individual
strengths and weaknesses in some demographics. The overall ranking of the approaches
is shown in Table 1; in what follows, each approach is reviewed in more detail.</p>
      <p>
        The approach of Price and Hodge [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] utilizes a logistic regression classifier for
each individual demographic. The model does not directly use representations of the
text, but entirely relies on hand-crafted features as input; specifically: the average tweet
length per celebrity, the average of all word vectors of the followers’ tweets, and the
to-token-ratios of the POS-tags, stop words, named entity types, number of links,
hashtags, mentions, and emojis. To optimize their model, the authors used 20% of the
training dataset for validation in order to pre-evaluate three competing algorithms for each
demographic: logistic regression, random forest, and support vector machines. The
optimal setting of hyperparameters was determined via five-fold cross-validation on the
remaining 80% of the training dataset for each evaluated algorithm, where the optimal
parameters were determined using the macro-F1 score. The final model selection on the
left-out validation dataset using the official evaluation measures showed that the logistic
regression model was best-suited for all demographics.
      </p>
      <p>cRank Age Gender Occupation</p>
      <p>
        The approach of Koloski et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] utilizes a logistic regression classifier to predict
the age in eight classes, another logistic regression classifier to predict the occupation,
and an SVM to predict the gender of the celebrities. The model primarily uses lexical
representations as features, but limits the input text to 20 tweets per follower and thus
200 tweets in total per celebrity. Specifically, the features are computed by (1)
preprocessing the text into three versions: the original tweets, the tweets without punctuation,
and the tweets without punctuation and stop words; (2) computing the top 20,000 most
frequent character 1-grams and 2-grams and word 1-grams, 2-grams, and 3-grams; and
(3) extracting 512 dimensions with a singular value decomposition to be used as
features. To optimize their model, the authors first split the training dataset 90:10 into a
training and validation set, and used the training split in a five-fold cross-validation
to find the optimal n-gram limit, feature dimensionality, and age prediction strategy.
Specifically, six alternative feature counts between 2,500 and 50,000 were tested, seven
alternative feature dimensions between 128 and 2048, and three different strategies to
solve the age prediction task: as a regression task, as a classification task with 60 classes,
and as a classification task with eight classes. After optimizing parameters, the authors
selected their model based on their performance on the validation dataset, comparing
XGBoost, logistic regression, and linear SVMs for each demographic.
      </p>
      <p>
        The approach of Alroobaea et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] utilizes an LSTM neural network for
classification; however, no further details are revealed about its architecture. The model uses
exclusively the followers’ texts as a TF-IDF matrix as input. The text itself is
preprocessed by removing links, HTML-style tags, stop words, non-alphanumeric tokens, and
typical punctuation marks, replacing mentions with @, and stemming all remaining
tokens with NLTK’s Snowball stemmer. The authors did not report on any experiments to
optimize their model.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>Table 1 shows the results of the participants with successful submissions as well as the
performance of the three aforementioned baselines. All participants managed to surpass
the random expectation and improve on the n-gram baseline, the winning approach
by 0.11 F1 in the combined metric cRank. The best performance of the submitted
solutions already closes in on the oracle baseline, which shows that the followers’
texts contain noticeable hints about the demographics of the followee. Table 2 shows
the F1 scores for each individual class. The results show, that, although the submitted
approaches are quite diverse, their weaknesses are structural and allow some cautious
conclusions about the underlying profiling problem. First, it is easier to predict the age
of the youngest celebrities from follower tweets than from their own, but age prediction
gets increasingly difficult with increasing age. Second, predicting the celebrities’
gender from their follower tweets works better for male celebrities. Third, predicting the
occupation based on follower tweets competes with the oracle baseline.</p>
      <p>The best-performing submission for predicting the age of the celebrities from their
follower tweets from Price and Hodge achieved an F1 score of 0.432, which is with a
distance of 0.07 directly in-between the baselines n-gram and oracle. Judging by
the multi-class F1 scores shown in Table 1, the age prediction task is the most difficult
demographic to predict this year. For ease of analysis, we evaluate the age prediction
subtask as a five-class problem over the ranges of birth years with the centroids 1994,
1985, 1963, 1975, and 1947. The results of the multi-class F1 scores shown in Table 1,
the class-wise F1 scores shown in Table 2, and the misclassifications depicted in the
confusion matrices in Figure 2 (top) allow for three observations: First, most submitted
models simply perform better on the majority classes. Since no participant employed
resampling to balance the training data, this effect may be due to the unbalanced
training data. The confusion matrices illustrate this effect, where all models skew towards
the center range of birth years, except for the one of Koloski et al., who optimized the
age-prediction strategy to achieve the opposite effect: their model skews towards never
predicting the center age group. Second, both the n-gram baseline and the model of
Koloski et al. significantly outperform the oracle baseline on predicting the youngest
celebrities, born between 1990 and 1999. This observation is not explained by the class
imbalance or sampling: Although both, Koloski et al. and the n-gram baseline,
resample the age classes from 60 classes down to five or eight, respectively, they still
significantly outperform the oracle baseline, which also reduces the number of age
groups to predict. The results do not fully explain this behavior, but it may hint at useful
information contained in follower tweets towards better detecting the youngest
celebrities. However, the increased performance when predicting young celebrities does not
improve the performance in general, since the oracle baseline, followed by the model
of Price and Hodge, still achieve better multi-class F1 scores and mean absolute errors.
Third, all models poorly predict the oldest celebrities born between 1940 and 1955,
although, as shown in Figure 1, this class has as many subjects as the 1990–1999 year
range while covering a broader age spectrum.</p>
      <p>The best-performing submission for predicting the gender of the celebrities from
their follower tweets from Alroobaea et al. achieved an F1 score of 0.696, which is with
a distance of 0.057 closer to the oracle than with 0.112 to the n-gram baseline.
Predicting the binary gender has been included as a baseline task since it is very
commonly done when predicting demographics, and typically achieves accuracies above
the mark of 0.9. Based on the observed results, gender prediction is more difficult for
the sampled celebrities. The F1 scores and the confusion matrices, as shown in Figure 2
(middle), allow for one observation: The models tend towards predicting a celebrity as
male rather than as female. This kind of skew is typically explained by imbalanced data
or dataset sampling. However, both explanations are unlikely, since our dataset is
balanced and has 200 celebrities per class, which is usually sufficient to avoid biased data.
The best-performing model in this demographic tends to predict female over male, and
the oracle baseline, using the celebrities’ timelines, does so, too.</p>
      <p>The best-performing submission for predicting the occupation of the celebrities
from their follower tweets from Price and Hodge achieved an F1 score of 0.707, which
is marginally better than the oracle baseline by 0.007, on average. Predicting the
occupation is the easiest part of our shared task. We assume that occupation
prediction relies heavily on topic markers in the text, and that these topics are the common
ground for discussion between the followers of a celebrity. In this respect, it is
surprising that the submission supposedly encoding the least lexical but most stylometric
features achieved the best performance. The results of the F1 scores and the confusion
matrices shown in Figure 2 (bottom) allow for one further observation: Although the
class-wise results are mixed between the different submissions, politicians, performers,
and athletes (sports) are consistently predicted well, while creators are consistently
misclassified as either performer or politicians. These results are mostly consistent with the
results of the 2019 task, albeit, this year, politicians were less frequently misclassified
than athletes.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Outlook</title>
      <p>This paper overviews the shared task on celebrity profiling at PAN 2020. The goal of
the task was to determine three demographics of celebrities on Twitter based on their
followers writing, rather than their own: the age as a 60-class problem with lenient
evaluation, the gender as a two-class problem, and the occupation as a four-class
problem. The submitted models rely on a variety of proven methods: feature-based machine
learning with stylometric or n-gram features, and LSTMs on TF-IDF matrices. The
individual demographics’ results point towards similar difficulties as were found the
corresponding shared task of 2019: the topically more diverse occupations “creator” and
“performer” are harder to profile, as are older authors over younger ones. Our results
impressively demonstrate that it is possible to profile authors based on their followers’
texts almost as well as on their own. However, there is still much potential to explore
different approaches and gain further insights. Technologically, utilizing the messages
of followers to improve author profiling models is a promising future direction.</p>
      <sec id="sec-6-1">
        <title>Acknowledgments</title>
        <p>We thank our participants for their effort and dedication, and the CLEF organizers for
hosting PAN and the shared task on celebrity profiling.
[36] Rangel, F., Montes-y-Gómez, M., Potthast, M., Stein, B.: Overview of the 6th Author
Profiling Task at PAN 2018: Cross-domain Authorship Attribution and Style Change
Detection. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) CLEF 2018 Evaluation
Labs and Workshop – Working Notes Papers, 10-14 September, Avignon, France, CEUR
Workshop Proceedings, CEUR-WS.org (Sep 2018), ISSN 1613-0073, URL
http://ceur-ws.org/Vol-2125/
[37] Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and</p>
        <p>
          Gender Profiling. In: [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], URL http://ceur-ws.org/Vol-2380/
[38] Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B.,
Daelemans, W.: Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato,
L., Ferro, N., Halvey, M., Kraaij, W. (eds.) Working Notes Papers of the CLEF 2014
Evaluation Labs, CEUR Workshop Proceedings, CEUR-WS.org (Sep 2014), ISSN
1613-0073, URL http://ceur-ws.org/Vol-1180/
[39] Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the Author
Profiling Task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013
Evaluation Labs and Workshop – Working Notes Papers, 23-26 September, Valencia,
Spain, CEUR-WS.org (Sep 2013), ISBN 978-88-904810-3-1, ISSN 2038-4963, URL
http://ceur-ws.org/Vol-1179
[40] Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th Author Profiling Task at
PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato, L.,
Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Evaluation Labs and Workshop –
Working Notes Papers, 11-14 September, Dublin, Ireland, CEUR Workshop Proceedings,
CEUR-WS.org (Sep 2017), ISSN 1613-0073, URL http://ceur-ws.org/Vol-1866/
[41] Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of
the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog, K.,
Cappellato, L., Ferro, N., Macdonald, C. (eds.) CLEF 2016 Evaluation Labs and
Workshop – Working Notes Papers, 5-8 September, Évora, Portugal, CEUR Workshop
Proceedings, CEUR-WS.org (Sep 2016), ISSN 1613-0073, URL
http://ceur-ws.org/Vol-1609/16090750.pdf
[42] Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of Age and Gender on
Blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing
Weblogs, pp. 199–205, AAAI (2006)
[43] Tamburrini, N., Cinnirella, M., Jansen, V., Bryden, J.: Twitter users change word usage
according to conversation-partner social identity. Social Networks 40, 84?89 (01 2015),
https://doi.org/10.1016/j.socnet.2014.07.004
[44] Volkova, S., Bachrach, Y.: On predicting sociodemographic traits and emotions from
communications in social networks and their implications to online self-disclosure.
        </p>
        <p>
          Cyberpsy., Behavior, and Soc. Networking 18(12), 726–736 (2015)
[45] Wang, X., Bendersky, M., Metzler, D., Najork, M.: Learning to Rank with Selection Bias
in Personal Search. In: SIGIR, pp. 115–124, ACM (2016)
[46] Wiegmann, M., Stein, B., Potthast, M.: Celebrity Profiling. In: Korhonen, A., Màrquez, L.,
Traum, D. (eds.) 57th Annual Meeting of the Association for Computational Linguistics
(ACL 2019), pp. 2611–2618, Association for Computational Linguistics (Jul 2019), URL
https://www.aclweb.org/anthology/P19-1249
[47] Wiegmann, M., Stein, B., Potthast, M.: Overview of the Celebrity Profiling Task at PAN
2019. In: [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], URL http://ceur-ws.org/Vol-2380/
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Aletras</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chamberlain</surname>
            ,
            <given-names>B.P.</given-names>
          </string-name>
          :
          <article-title>Predicting twitter user socioeconomic attributes with network and language information</article-title>
          .
          <source>In: Proceedings of the 29th on Hypertext and Social Media</source>
          , pp.
          <fpage>20</fpage>
          -
          <lpage>24</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Alroobaea</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almulihi</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alharithi</surname>
            ,
            <given-names>F.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mechti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krichen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belguith</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          :
          <article-title>A Deep learning Model to predict gender, age and occupation of the celebrities</article-title>
          .
          <source>In: [8]</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Automatically Profiling the Author of an Anonymous Text</article-title>
          .
          <source>Commun. ACM</source>
          <volume>52</volume>
          (
          <issue>2</issue>
          ),
          <fpage>119</fpage>
          -
          <lpage>123</lpage>
          (
          <year>Feb 2009</year>
          ), ISSN 0001-0782, https://doi.org/10.1145/1461928.1461959
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Bakerman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pazdernik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <given-names>A.G.</given-names>
            ,
            <surname>Fairchild</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Bahran</surname>
          </string-name>
          , R.:
          <article-title>Twitter Geolocation: A Hybrid Approach</article-title>
          .
          <source>ACM Trans. Knowl. Discov. Data</source>
          <volume>12</volume>
          (
          <issue>3</issue>
          ) (
          <year>2018</year>
          ), https://doi.org/10.1145/3178112
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Baly</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karadzhov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>An</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwak</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dinkov</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glass</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>What was written vs. who read it: News media profiling using text analysis and social media context</article-title>
          . arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>04518</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Bevendorff</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Heuristic Authorship Obfuscation</article-title>
          . In: Korhonen,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Màrquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Traum</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <article-title>57th Annual Meeting of the Association for Computational Linguistics (ACL</article-title>
          <year>2019</year>
          ), pp.
          <fpage>1098</fpage>
          -
          <lpage>1108</lpage>
          , Association for Computational Linguistics (
          <year>Jul 2019</year>
          ), URL https://www.aclweb.org/anthology/P19-1104
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Burger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Henderson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zarrella</surname>
          </string-name>
          , G.:
          <article-title>Discriminating Gender on Twitter</article-title>
          .
          <source>In: Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , pp.
          <fpage>1301</fpage>
          -
          <lpage>1309</lpage>
          , ACM (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eickhoff</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Névéol</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . (eds.):
          <article-title>CLEF 2020 Labs and Workshops, Notebook Papers</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.
          <source>org (Sep</source>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Losada</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
          </string-name>
          , H. (eds.):
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.
          <source>org (Sep</source>
          <year>2019</year>
          ), URL http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2380</volume>
          /
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Carmona</surname>
            ,
            <given-names>M.Á.Á.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guzmán-Falcón</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Gómez,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pineda</surname>
            ,
            <given-names>L.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reyes-Meza</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sulayes</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          :
          <article-title>Overview of MEX-A3T at ibereval 2018: Authorship and aggressiveness analysis in mexican spanish tweets</article-title>
          .
          <source>In: IberEval@SEPLN, CEUR Workshop Proceedings</source>
          , vol.
          <volume>2150</volume>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>96</lpage>
          , CEUR-WS.org (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Connor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.A.</given-names>
            ,
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.P.</surname>
          </string-name>
          :
          <article-title>Diffusion of lexical change in social media</article-title>
          .
          <source>PLOS ONE</source>
          <volume>9</volume>
          (
          <issue>11</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          (11
          <year>2014</year>
          ), https://doi.org/10.1371/journal.pone.0113114, URL https://doi.org/10.1371/journal.pone.0113114
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Farnadi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Cock</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          :
          <article-title>User profiling through deep multimodal fusion</article-title>
          .
          <source>In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining</source>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>179</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Fatima</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anwar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nawab</surname>
            ,
            <given-names>R.M.A.</given-names>
          </string-name>
          :
          <article-title>Multilingual author profiling on facebook</article-title>
          .
          <source>Inf. Process. Manage</source>
          .
          <volume>53</volume>
          (
          <issue>4</issue>
          ),
          <fpage>886</fpage>
          -
          <lpage>904</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Leveraging emotional signals for credibility detection</article-title>
          .
          <source>In: Proceedings of the 42nd International ACM SIGIR</source>
          , pp.
          <fpage>877</fpage>
          -
          <lpage>880</lpage>
          , ACM (
          <year>2019</year>
          ), https://doi.org/10.1145/3331184.3331285
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Gjurkovic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snajder</surname>
          </string-name>
          , J.:
          <article-title>Reddit: A gold mine for personality prediction</article-title>
          .
          <source>In: PEOPLES@NAACL-HTL</source>
          , pp.
          <fpage>87</fpage>
          -
          <lpage>97</lpage>
          , Association for Computational Linguistics (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Grover</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leskovec</surname>
          </string-name>
          , J.: node2vec:
          <article-title>Scalable feature learning for networks</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pp.
          <fpage>855</fpage>
          -
          <lpage>864</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Kershaw</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rowe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stacey</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Towards Modelling Language Innovation Acceptance in Online Social Networks</article-title>
          .
          <source>In: Proceedings of the Ninth ACM WSDM</source>
          , pp.
          <fpage>553</fpage>
          -
          <lpage>562</lpage>
          , ACM (
          <year>2016</year>
          ), https://doi.org/10.1145/2835776.2835784
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Koloski</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pollak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Škrlj</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Know your Neighbors: Efficient Author Profiling via Follower Tweets</article-title>
          . In: [8]
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shimoni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automatically Categorizing Written Texts by Author Gender</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ),
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Kosinski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stillwell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graepel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Private traits and attributes are predictable from digital records of human behavior</article-title>
          .
          <source>Proceedings of the national academy of sciences 110(15)</source>
          ,
          <fpage>5802</fpage>
          -
          <lpage>5805</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Lynn</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balasubramanian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          :
          <article-title>Hierarchical Modeling for User Personality Prediction: The Role of Message-Level Attention</article-title>
          . In:
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          (
          <year>Jul 2020</year>
          ), https://doi.org/10.18653/v1/
          <year>2020</year>
          .acl-main.472, URL https://www.aclweb.org/anthology/2020.acl-main.
          <fpage>472</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Mishra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Del Tredici</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yannakoudakis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shutova</surname>
          </string-name>
          , E.:
          <article-title>Author profiling for abuse detection</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Computational Linguistics</source>
          , pp.
          <fpage>1088</fpage>
          -
          <lpage>1098</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Mishra</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sinha</surname>
            ,
            <given-names>P.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sawhney</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mahata</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mathur</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          :
          <article-title>Snap-batnet: Cascading author profiling and social network graphs for suicide ideation detection on social media</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop</source>
          , pp.
          <fpage>147</fpage>
          -
          <lpage>156</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhardwaj</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chieu</surname>
            ,
            <given-names>H.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puay</surname>
          </string-name>
          , N.Y.:
          <article-title>Twitter homophily: Network based prediction of user's occupation</article-title>
          . In: Korhonen,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Traum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.R.</given-names>
            ,
            <surname>Màrquez</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of the 57th ACL</source>
          , pp.
          <fpage>2633</fpage>
          -
          <lpage>2638</lpage>
          , Association for Computational Linguistics (
          <year>2019</year>
          ), https://doi.org/10.18653/v1/p19-
          <fpage>1252</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Pavalanathan</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eisenstein</surname>
          </string-name>
          , J.:
          <article-title>Audience-modulated variation in online social media</article-title>
          .
          <source>American Speech</source>
          <volume>90</volume>
          (05
          <year>2015</year>
          ), https://doi.org/10.1215/
          <fpage>00031283</fpage>
          -
          <lpage>3130324</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Peersman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
          </string-name>
          , W.,
          <string-name>
            <surname>Van Vaerenbergh</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Predicting Age and Gender in Online Social Networks</article-title>
          .
          <source>In: Proceedings of the 3rd international workshop on Search and mining user-generated contents</source>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          , SMUC '11,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2011</year>
          ),
          <source>ISBN 978-1-4503-0949-3</source>
          , https://doi.org/10.1145/2065023.2065035, URL http://doi.acm.
          <source>org/10</source>
          .1145/2065023.2065035
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niederhoffer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Psychological aspects of natural language use: Our words, our selves</article-title>
          .
          <source>Annual Review of Psychology</source>
          <volume>54</volume>
          ,
          <fpage>547</fpage>
          -
          <lpage>577</lpage>
          (
          <year>2003</year>
          ), ISSN 0066-4308, https://doi.org/10.1146/annurev.psych.
          <volume>54</volume>
          .101601.145041
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World</article-title>
          ,
          <source>The Information Retrieval Series</source>
          , Springer (Sep
          <year>2019</year>
          ),
          <source>ISBN 978-3-030-22948-1</source>
          , https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -22948-
          <issue>1</issue>
          _
          <fpage>5</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Preotiuc-Pietro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lampos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aletras</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>An analysis of the user occupational class through twitter content</article-title>
          .
          <source>In: ACL (1)</source>
          , pp.
          <fpage>1754</fpage>
          -
          <lpage>1764</lpage>
          , The Association for Computer Linguistics (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Preotiuc-Pietro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.H.:</given-names>
          </string-name>
          <article-title>User-level race and ethnicity predictors from twitter text</article-title>
          . In: Bender,
          <string-name>
            <given-names>E.M.</given-names>
            ,
            <surname>Derczynski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Isabelle</surname>
          </string-name>
          , P. (eds.)
          <source>Proceedings of the 27th International Conference on Computational Linguistics</source>
          ,
          <string-name>
            <surname>COLING</surname>
          </string-name>
          <year>2018</year>
          ,
          <string-name>
            <given-names>Santa</given-names>
            <surname>Fe</surname>
          </string-name>
          , New Mexico, USA,
          <year>August</year>
          20-
          <issue>26</issue>
          ,
          <year>2018</year>
          , pp.
          <fpage>1534</fpage>
          -
          <lpage>1545</lpage>
          , Association for Computational Linguistics (
          <year>2018</year>
          ), URL https://aclanthology.info/papers/C18-1130/c18-
          <fpage>1130</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Price</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hodge</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Celebrity Profiling using Twitter Follower Feeds: Notebook for PAN at CLEF 2020</article-title>
          . In: [8]
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Ramos</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neto</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>B.B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monteiro</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paraboni</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dias</surname>
          </string-name>
          , R.:
          <article-title>Building a corpus for personality-dependent natural language understanding and generation</article-title>
          . In: LREC,
          <string-name>
            <surname>European Language Resources Association (ELRA)</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd Author Profiling Task at PAN 2015</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          , San Juan, E. (eds.)
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <fpage>8</fpage>
          -
          <lpage>11</lpage>
          September, Toulouse, France, CEUR Workshop Proceedings, CEUR-WS.
          <source>org (Sep</source>
          <year>2015</year>
          ),
          <source>ISSN 1613-0073</source>
          , URL http://ceur-ws.
          <source>org/</source>
          Vol-1391
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghanem</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter</article-title>
          . In: [8]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>