Celebrity Profiling with Transfer Learning
                         Notebook for PAN at CLEF 2019

                                          Björn Pelzer

                               Swedish Defence Research Agency
                                              FOI
                                     Stockholm, Sweden
                                      bjorn.pelzer@foi.se


        Abstract In this approach to the Celebrity Profiling task we implemented a sys-
        tem that evaluates each tweet of an incoming feed using four classifiers, one for
        each trait: fame, occupation, gender and birthyear. The overall result for the feed
        of one celebrity is then determined by the majority of the individual tweet results
        for each trait. The classifiers were trained using transfer learning on a language
        model, which itself had been created by unsupervised learning on the raw text of
        all the tweets in the training data.


1     Introduction
This approach to the Celebrity Profiling task [9] of the PAN 2019 competition [1] is
based on transfer learning: A large existing language model, trained on an extensive
amount of raw text, is fine-tuned with labeled training samples for a specific task. By
utilizing the word embeddings of the language model, the fine-tuning step can produce
a classifier with state-of-the-art performance using relatively few labeled training sam-
ples.
     The task organizers have provided the extensive Celebrity Profiling corpus [8] for
training, comprising 48,335 anonymized Twitter user profiles from celebrities. Each
such profile has been annotated with four traits: fame, occupation, gender and birthyear.
Each profile also comes with on average 2,181 tweet texts (presumably) authored by
the respective celebrity. Due to the special nature of the language used in tweets, we
opted to first train a Twitter-specific language model from scratch. This model was then
fine-tuned with the labeled training data provided by the competition, resulting in four
classifiers, one for each trait to be detected. The system implemented for the compe-
tition then uses these four classifiers to evaluate each provided tweet, i.e. resulting in
an estimated fame, occupation, gender and birthyear for every single tweet. When all
tweets of one person have been evaluated, the overall result for each trait of the person
is determined by the majority of the individual tweet results.
     This approach is relatively fine-grained: one classifier for each trait, and each tweet
evaluated individually. One might instead consider one combined classifier for all four
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
traits, and one could evaluate the entire set of tweets of one person as one long text.
There are several reasons for our choices. Working with individual tweets and classi-
fiers allows better tailoring of the training data for the traits: E.g. a set of tweets balanced
for gender might be imbalanced for birthyear, and our method allowed us to compose
different training sets. Also, by splitting up the tweet sets of each person, we could
ensure that each classifier was trained and validated on tweets from each person. Fi-
nally, beyond the competition we are interested in analysing individual tweets and texts
in general, and getting an idea of the expected performance in such applications was
important to us.


2    Related Work

Classifying user attributes based on tweets has been a topic of research for approxi-
mately a decade at the time of this writing, with Rao et al. [7] being among the earliest
in 2010. Machine learning has been employed for this objective since before its current
resurgence in the form of deep learning, for example by Pennacchiotti and Popescu in
2011 [5]. Author profiling based on tweets has been a part of the PAN competitions
since 2013 [4]. Yang et al. utilized transfer learning for tweet classification in 2017
[10].


3    Transfer Learning

Transfer learning on language models has become a highly successful approach for
natural language processing (NLP) in the recent past. For example, Google BERT1 [2]
achieved new state-of-the-art results on eleven NLP tasks in 2018.
     Another successful transfer learning approach is ULMFiT2 [3] as implemented in
the fast.ai-framework.3 ULMFiT set new standards earlier in 2018 before being sur-
passed by BERT.
     Both BERT and ULMFiT have significant hardware requirements. In the case of
BERT these are so severe that training a new language model from scratch is not feasible
on the hardware commonly available in academia. Instead, BERT users need to rely on
the pre-trained model available from Google. ULMFiT is more manageable: A new
language model can be trained on a computer with 128 GB RAM and an Nvidia GTX
1080 Ti GPU in less than a week. This has led to the emergence of a rich community
of ULMFiT users who create language models for different languages and share their
experiences. For this reason we chose ULMFiT for our implementation.
     Unfortunately our testing revealed that running the classifiers on the intended data
still has fairly demanding hardware requirements, regularly consuming 40 GB of RAM,
with a high-performance GPU being almost non-optional. As this exceeds the capabil-
ities of the TIRA virtual machines [6] used for the competition, we did not expect
 1
   Bidirectional Encoder Representations from Transformers, https://github.com/
   google-research/bert
 2
   Universal Language Model Fine-tuning
 3
   https://www.fast.ai
good – if any – results. Indeed, these concerns were proven correct, and our system
only handled a fraction of the data in the competition time, leading to extrapolated and
non-representative results. Nevertheless we describe our approach, as we made useful
experiences for the future.

4     Language Model
ULMFiT is provided with a pre-trained model for English, based on Wikipedia. As the
almost entirely encyclopedic language of this corpus may not be a good match when
dealing with other types of texts, the authors of ULMFiT recommend pre-training a
language model from scratch when needed, and provide some tools for this. Twitter
texts tend to contain large numbers of emoticons, links, abbreviations, colloquialisms,
spelling mistakes and bad grammar – a stark contrast to the language in Wikipedia.
Therefore we chose to train our own Twitter language model. The ULMFiT community
recommends training on a corpus with approximately 100 million tokens. The tweets
in the Celebrity Profiling training data are well above this amount with more than 1.6
billion tokens in total. The recommendation of 100 million is a decent compromise
between training time, hardware requirements during training and desired performance,
but in our experience from earlier experiments with other languages, a model gets better
with a larger corpus. We therefore trained our language model on all the tweets in the
training set. It should be noted that while the fast.ai-framework of ULMFiT comes with
tools for training a language model, several of the preprocessing steps do not scale
well to larger corpus sizes, requiring more than the 128 GB of RAM we had available
for this. Thus we reimplemented most of the preprocessing, including tokenization and
vocabulary building, and only used ULMFiT for the actual training, which required
approximately five days.

5     Classifiers
Four copies of our language model were fine-tuned to become four separate classifiers,
one for each trait used in the competition. The classifiers were trained on a per-tweet
basis. In other words the provided Celebrity Corpus, which collects all tweets from
one author into a single feed, was broken down into individual tweets before training,
and each tweet was annotated with the four trait labels from its source feed/author.
From this we selected and downsampled four training sets, one per classifier, ensuring
approximately balanced data for the respective trait. The resulting balance was often
far from perfect: Due to the sometimes extreme disproportions in the original data we
made considerable trade-offs here to keep the training sets from becoming too small.
    All four classifiers were trained according to the same One-Cycle-policy recom-
mended for ULMFiT classifier training by fast.ai. The accuracies achieved by the indi-
vidual classifiers are stated in Table 1.

5.1   Fame
The classifier for the fame trait is arguably the worst, as its accuracy of 0.39 is not much
better than randomly guessing one of the three classes of this trait. This may not be all
                                Table 1. Classifier Accuracies

      classifier         fame            occupation              gender          birthyear
      accuracy           0.39               0.51                  0.68             0.32


that surprising, considering that the link between a person’s tweet writing style and the
actual fame seems tenuous, especially given that most celebrities are still famous for
activities outside of Twitter. Nevertheless, given a large number of tweets the current
accuracy should let the system tend towards the correct decision.

5.2   Occupation
With eight different occupation classes to choose from, this classifier does fairly well
with an accuracy of 0.51. Celebrities are usually famous for the activities they perform
in their given occupation, and it seems plausible that a celebrity would often be writing
about such activities and use words that are clear indicators of the given field.

5.3   Gender
The class nonbinary in the gender trait occurs only in about 0.1 percent of the tweets,
so downsampling to actual balance would have meant discarding a lot of useful train-
ing data for the two far more likely classes female and male. Instead we only ensured
balance between the latter two. While the classifier was trained on nonbinary samples
and it will try to recognize this class, the current accuracy of 0.68 thus only holds for
data with a fairly realistic distribution, i.e. with hardly any nonbinary occurrences.

5.4   Birthyear
The birthyear trait has the highest number of possible classes, covering the years from
1940 to 2012, with a few gaps not represented by any author feeds. It seemed daunting to
train a fine-granular classifier to distinguish between all these classes. The competition
design also acknowledges this difficulty by accepting answers as correct as long as they
fall inside a certain interval around the actually correct value. Following the interval
computation formula in the competition evaluation script, we determined eight intervals
to cover the entire range, and we then reclassified each training example to the “middle”
year of the interval that contains the original year associated with that tweet.


6     Overall System
All four classifiers achieved better accuracy than random chance given the respective
number of classes. As we expected the competition test feeds to be of comparable size as
the provided training feeds, i.e. many (hundreds of) tweets per author, we considered it
viable to employ our classifiers in a majority-vote fashion: Each single tweet ta1 , ..., tan of
an author feed a is classified individually by all four classifiers. This results in the four
                                     Table 2. Overall Results

                                                F1              accuracy
                        cRank                  0.499              n/a
                        mean                    n/a              0.621
                        fame                   0.46              0.556
                        occupation             0.48              0.704
                        gender                 0.548             0.862
                        birthyear              0.518             0.364


estimates f (tai ), o(tai ), g(tai ), b(tai ) (fame, occupation, gender and birthyear) for a given
tweet tfi . The overall result for the whole author feed is then in each trait determined
as the class occurring most often (i.e. relative majority) among the individual tweet
estimates of the feed.


7    Evaluation

As expected after initial tests, our system was too demanding for the competition com-
puters, and none of the competition test sets could be evaluated in time. The competition
organizers generously provided us with the smaller competition test dataset 2, so that
we could run an classification on our own computer. We then forwarded our resulting
classification labels to the organizers, who in turn evaluated our data and gave us the re-
sults. Naturally, the results of this are not to be considered as actual competition results.
We thus present them in Table 2 without any direct comparison to the official results
from other competitors, to avoid any misunderstanding in the matter. The cRank as per
the competition rules is the harmonic mean of the F1-scores.
    We can see that the order of trait-based accuracies matches the one determined dur-
ing the training of the individual classifiers as found in Table 1: the overall performance
for gender was best, followed by occupation, fame and finally birthyear. The competi-
tors in the official results largely follow the same order, so it it seems plausible that this
corresponds to an increasing difficulty among the four traits.


8    Conclusions and Future Work

We believe our approach to be promising, but it is too heavyweight to be competitive
at this time. Systems based on language models tend to be demanding, and ours basi-
cally employs four language models simultaneously. Also, the fast.ai-framework is still
somewhat experimental, and the version we were working with showed spotty support
for non-GPU computations. As this situation stabilizes, a system like ours may become
more competitive. More effort could also be spent on optimizing our implementation,
and using multiprocessing.
    For now we have learnt some important lessons, and aspects of the system (in par-
ticular the gender and birthyear classifiers) may become useful in studies of other areas,
outside of celebrities.


References
 1. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P.,
    Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.:
    Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship
    Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber,
    A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the
    Tenth International Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019)
 2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
    transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
 3. Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR
    abs/1801.06146 (2018), http://arxiv.org/abs/1801.06146
 4. Pardo, F.M.R., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author
    profiling task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D., Ferro, N. (eds.) Working
    Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013. CEUR
    Workshop Proceedings, vol. 1179. CEUR-WS.org (2013),
    http://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-RangelEt2013.pdf
 5. Pennacchiotti, M., Popescu, A.: A machine learning approach to twitter user classification.
    In: Adamic, L.A., Baeza-Yates, R.A., Counts, S. (eds.) Proceedings of the Fifth
    International Conference on Weblogs and Social Media, Barcelona, Catalonia, Spain, July
    17-21, 2011. The AAAI Press (2011), http:
    //www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2886
 6. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture.
    In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World -
    Lessons Learned from 20 Years of CLEF. Springer (2019)
 7. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter.
    In: Proceedings of the 2Nd International Workshop on Search and Mining User-generated
    Contents. pp. 37–44. SMUC ’10, ACM, New York, NY, USA (2010),
    http://doi.acm.org/10.1145/1871985.1871993
 8. Wiegmann, M., Stein, B., Potthast, M.: Celebrity Profiling. In: Proceedings of ACL 2019
    (to appear) (2019)
 9. Wiegmann, M., Stein, B., Potthast, M.: Overview of the Celebrity Profiling Task at PAN
    2019. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and
    Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)
10. Yang, X., McCreadie, R., Macdonald, C., Ounis, I.: Transfer learning for multi-language
    twitter election classification. In: Proceedings of the 2017 IEEE/ACM International
    Conference on Advances in Social Networks Analysis and Mining 2017. pp. 341–348.
    ASONAM ’17, ACM, New York, NY, USA (2017),
    http://doi.acm.org/10.1145/3110025.3110059