Personal-ITY: A Novel YouTube-based Corpus
                            for Personality Prediction in Italian

        Elisa Bassignana                          Malvina Nissim                        Viviana Patti
    Dipartimento di Informatica                         CLCG                     Dipartimento di Informatica
        University of Turin                    University of Groningen               University of Turin
elisa.bassignana@edu.unito.it                       m.nissim@rug.nl                viviana.patti@unito.it


                        Abstract                                second one, instead, considers 16 fixed personal-
                                                                ity types, coming from the combination of the op-
     We present a novel corpus for person-                      posite poles of 4 main dimensions (E XTRAVERT-
     ality prediction in Italian, containing a                  I NTROVERT, I N TUITIVE -S ENSING, F EELING -
     larger number of authors and a different                   T HINKING, P ERCEIVING -J UDGING). Examples
     genre compared to previously available                     of full personality types are therefore four letter
     resources. The corpus is built exploit-                    labels such as ENTJ or ISFP.
     ing Distant Supervision, assigning Myers-                     The tests used to detect prevalence of traits in-
     Briggs Type Indicator (MBTI) labels to                     clude human judgements regarding semantic sim-
     YouTube comments, and can lend itself to                   ilarity and relations between adjectives that peo-
     a variety of experiments. We report on                     ple use to describe themselves and others. This
     preliminary experiments on Personal-ITY,                   is because language is believed to be a prime car-
     which can serve as a baseline for future                   rier of personality traits (Schwartz et al., 2013).
     work, showing that some types are easier                   This aspect, together with the progressive increase
     to predict than others, and discussing the                 of available user-generated data on social media,
     perks of cross-dataset prediction.                         has prompted the task of Personality Detection,
                                                                i.e., the automatic prediction of personality from
1    Introduction                                               written texts (Youyou et al., 2015; Argamon et al.,
                                                                2009; Litvinova et al., 2016; Whelan and Davies,
When faced with the same situation, different hu-
                                                                2006).
mans behave differently. This is, of course, due
                                                                   Personality detection can be useful in predicting
to different backgrounds, education paths, and life
                                                                life outcomes such as substance use, political atti-
experiences, but according to psychologists there
                                                                tudes and physical health. Other fields of applica-
is another important aspect: personality (Snyder,
                                                                tion are marketing, politics and psychological and
1983; Parks and Guay, 2009).
                                                                social assessment.
   Human Personality is a psychological construct
                                                                   As a contribution to personality detection in
aimed at explaining the wide variety of human be-
                                                                Italian, we present Personal-ITY, a new corpus of
haviours in terms of a few, stable and measurable
                                                                YouTube comments annotated with MBTI person-
individual characteristics (Vinciarelli and Moham-
                                                                ality traits, and some preliminary experiments to
madi, 2014).
                                                                highlight its characteristics and test its potential.
   Such characteristics are formalised in Trait
                                                                The corpus is made available to the community1 .
Models, and there are currently two of these mod-
els that are widely adopted: Big Five (John and
Srivastava, 1999) and Myers-Briggs Type Indica-                 2   Related Work
tor (MBTI) (Myers and Myers, 1995). The first                   There exist a few datasets annotated for personal-
examines five dimensions (O PENNESS TO EX -                     ity traits. For the shared tasks organised within the
PERIENCE, C ONSCIENTIOUSNESS , E XTROVER -
                                                                Workshop on Computational Personality Recog-
SION , AGREEABLENESS and N EUROTICISM ) and
                                                                nition (Celli et al., 2013), two datasets annotated
for each of them assigns a score in a range. The                with the Big Five traits have been released in 2013
     Copyright © 2020 for this paper by its authors. Use per-
                                                                  1
mitted under Creative Commons License Attribution 4.0 In-           https://github.com/elisabassignana/
ternational (CC BY 4.0).                                        Personal-ITY
   Corpus            Model       # user      Avg.        the corpus creators themselves using a Lin-
                                                         earSVM with word (1-2) and character (3-4) n-
   PAN2015           Big Five       38       1258
                                                         grams. Their results (reported in Table 2 for the
   T WI S TY         MBTI          490     21.343
                                                         Italian portion of the dataset) are obtained through
   Personal-ITY      MBTI         1048     10.585
                                                         10-fold cross-validation; the model is compared to
                                                         a weighted random baseline (WRB) and a major-
Table 1: Summary of Italian corpora with person-
                                                         ity baseline (MAJ).
ality labels. Avg.: average tokens per user.
                                                             Trait   WRB          MAJ          f-score
(Essays (Pennebaker and King, 2000) and myPer-               EI      65.54        77.88        77.78
sonality2 ) and two in 2014 (YouTube Personality             NS      75.60        85.78        79.21
Dataset (Biel and Gatica-Perez, 2013) and Mobile             FT      50.31        53.95        52.13
Phones interactions (Staiano et al., 2012)).                 PJ      50.19        53.05        47.01
   For the 2015 PAN Author Profiling Shared Task
                                                             Avg     60.41        67.67        64.06
(Pardo et al., 2015), personality was added to gen-
der and age in the profiling task, with tweets in En-
glish, Spanish, Italian and Dutch. These are also        Table 2: T WI S TY scores from the original paper.
annotated according to the Big Five model.               Note that all results are reported as micro-average
                                                         F-score.
   Still in the Big Five landscape, Schwartz et al.
(2013) collected a dataset of FaceBook comments
(700 millions words) written by 136.000 users            3    Personal-ITY
who shared their status updates. Interesting cor-
relations were observed between word usage and           First, we explain two major choices that we made
personality traits.                                      in creating Personal-ITY, namely the source of the
   If looking at data labelled with the MBTI traits,     data and the trait model. Second, we describe in
we find a corpus of 1.2M English tweets annotated        detail the procedure we followed to construct the
with personality and gender (Plank and Hovy,             corpus. Lastly, we provide a description of the re-
2015), and the multilingual T WI S TY (Verhoeven         sulting dataset.
et al., 2016). The latter is a corpus of data col-       Data YouTube is the source of data for our cor-
lected from Twitter annotated with MBTI person-          pus. The decision is grounded on the fact that
ality labels and gender for six languages (Dutch,        compared to the more commonly collected tweets,
German, French, Italian, Portuguese and Spanish)         YouTube comments can be longer, so that users
and a total of 18,168 authors. We are interested in      are freer to express themselves without limita-
the Italian portion of T WI S TY.                        tions. Additionally, there is a substantial amount
   Table 1 contains an overview of the available         of available data on the YouTube platform, which
Italian corpora labelled with personality traits. We     is easy to access thanks to the free YouTube APIs.
include our own, which is described in Section 3.
   Regarding detection approaches, Mairesse et al.       Trait Model Our model of choice is the MBTI.
(2007) tested the usefulness of different sets of        The first benefit of this decision is that this model
textual features making use of mostly SVMs.              is easy to use in association with a Distant Super-
   At the PAN 2015 challenge (see above) a va-           vision approach (just checking if a message con-
riety of algorithms were tested (such as Random          tains one of the 16 personality types; see Sec-
Forests, decision trees, logistic regression for clas-   tion 3.1). Another benefit is related to the ex-
sification, and also various regression models), but     istence of T WI S TY. Since both T WI S TY and
overall most successful participants used SVMs.          Personal-ITY implement the MBTI model, analy-
Regarding features, participants approached the          ses and experiments over personality detection can
task with combinations of style-based and content-       be carried out also in a cross-domain setting.
based features, as well as their combination in n-       Ethics Statement
gram models (Pardo et al., 2015).
                                                         Personality profiling must be carefully evaluated
   Experiments on T WI S TY were performed by
                                                         from an ethical point of view. In particular, of-
   2
       http://mypersonality.org                          ten, personality detection involves ethical dilem-
mas regarding appropriate utilization and interpre-         Comment                  User - MBTI label
tations of the prediction outcomes (Weiner and
                                                            Io sono ENFJ!!!          User1 - ENFJ
Greene, 2017). Concerns have been raised regard-
ing the inappropriate use of these tests with respect       Ho sempre saputo di      User2 - ISFP
to invasion of privacy, cultural bias and confiden-         essere connessa con
tiality (Mehta et al., 2019).                               Lady Gaga! ISFP!
   The data included in the Personal-ITY dataset
were publicly available on the YouTube platform          Table 3: Examples of automatic associations user
at the time of the collection. As we will explain in     - MBTI personality type.
detail in this Section, the information collected are
comments published under public videos on the
                                                         was associated to a user if they included an MBTI
YouTube platform by authors themselves. For a
                                                         combination in one of their comments. Table 3
major protection of user identities, in the released
                                                         shows some examples of such associations. The
corpus only the YouTube usernames of the authors
                                                         association process is an approximation typical of
are mentioned which are not unique identifiers.
                                                         DS approaches. To assess its validity, we manually
The YouTube IDs of the corresponding channels,
                                                         checked 300 random comments to see whether the
which are the real identifiers in the platform, al-
                                                         mention of an MBTI label was indeed referred to
lowing to trace the identity of the authors, are not
                                                         the author’s own personality. We found that in 19
released. Note also that the corpus was created for
                                                         cases (6.3%) our method led to a wrong or unsure
academic purposes and is not intended to be used
                                                         classification of the user’s personality (e.g. O tutti
for commercial deployment or applications.
                                                         gli INTJ del mondo stanno commentando questo
3.1      Corpus Creation                                 video oppure le statistiche sono sbagliate :-)). We
                                                         can assume that our dataset might therefore con-
The fact that users often self-disclose information      tain about 6-7% of noisy labels.
about themselves on social media makes it possi-            Using the acquired list of authors, we meant to
ble to adopt Distant Supervision (DS) for the ac-        obtain as many comments as possible written by
quisition of training data. DS is a semi-supervised      them. The YouTube API, however, does not al-
method that has been abundantly and successfully         low to retrieve all comments by one user on the
used in affective computing and profiling to assign      platform. In order to get around this problem we
silver labels to data on the basis of indicative prox-   relied on video similarities, and tried to expand as
ies (Go et al., 2009; Pool and Nissim, 2016; Em-         much as possible our video collection. Therefore,
mery et al., 2017).                                      as a third step, we retrieved the list of channels
   Users left comments to some videos on the             that feature our initial 10 videos, and then all of
MBTI theory in which they were stating their own         the videos within those channels.
personality type (e.g. Sono ENTJ...chi altro? [en:          Fourth, through a second AJAX request, we
”I’m ENTJ...anyone else?”]). We exploited such           downloaded all comments appearing below all
comments to create Personal-ITY with the follow-         videos retrieved through the previous step.
ing procedure.                                              Lastly, we filtered all comments retaining those
   First, we searched for as many Italian YouTube        written by authors included in our original list.
videos about MBTI as possible, ending up with            This does not obviously cover all comments by
a selection of ten with a conspicuous number of          a relevant user, but it provided us with additional
comments as the ones above3 .                            data per author.
   Second, we retrieved all the comments to these
videos using an AJAX request, and built a list of        3.2   Final Corpus Statistics
authors and their associated MBTI label. A label
                                                         For the final dataset, we decided to keep only the
   3
       Links to the 10 YouTube videos:                   authors with a sufficient amount of data. More
https://www.youtube.com/watch?v=VCo9RlDRpz0
https://www.youtube.com/watch?v=N4kC8iqUNyk              specifically, we retained only users with at least
https://www.youtube.com/watch?v=Z8S8PgW8t2U
https://www.youtube.com/watch?v=wHZOG8k7nSw              five comments, each at least five token long.
https://www.youtube.com/watch?v=lO2z3_DINqs
https://www.youtube.com/watch?v=NaKPl_y1JXg                 Personal-ITY includes 96, 815 comments by
https://www.youtube.com/watch?v=8l4o4VBXlGY
https://www.youtube.com/watch?v=GK5J6PLj218              1048 users, each annotated with an MBTI label.
https://www.youtube.com/watch?v=9P95dkVLmps
https://www.youtube.com/watch?v=g0ZIFNgUmoE              The average number of comments per user is 92
Figure 1: Distribution of the 16 personality types
in the YouTube corpus and in the Italian section of
                                                                       (a) Extravert - Introvert
T WI S TY.


and each message has on average 115 tokens.
   The amount of the 16 personality types in the
corpus is not uniform. Figure 1 shows such dis-
tribution and also compares it with the one in
T WI S TY. The unbalanced distribution can be
due to personality types not being uniformly dis-
tributed in the population, and to the fact that dif-
ferent personality types can make different choices
                                                                        (b) Sensing - Intuitive
about their online presence. Goby (2006) for ex-
ample, observed that there is a significant correla-
tion between online–offline choices and the MBTI
dimension of E XTRAVERT-I NTROVERT: extro-
verts are more likely to opt for offline modes of
communication, while online communication is
presumably easier for introverts. In Figure 1, we
also see that the four most frequent types are intro-
verts in both datasets. The conclusion is that, de-
spite the different biases, collecting linguistic data
in this way has the advantage that it reflects ac-                      (c) Thinking - Feeling
tual language use and allows large-scale analysis
(Plank and Hovy, 2015).
   Figure 2 shows more in detail, trait by trait,
the distribution of the opposite poles through the
users in Personal-ITY and in T WI S TY. As we
might have expected, in line with what is observed
in Figure 1, the two datasets present very similar
trends. Such similarities between Personal-ITY
and T WI S TY are these similarities are a further
confirmation of the reliability of the data we col-
                                                                       (d) Judging - Perceiving
lected.
                                                         Figure 2: Comparison of the distributions of the
4   Preliminary Experiments                              four MBTI traits between Personal-ITY and the
We ran a series of preliminary experiments on            Italian part of T WI S TY.
Personali-ITY which can also serve as a baseline
for future work on this dataset. We pre-processed
texts by replacing hashtags, urls, usernames and
emojis with four corresponding placeholders. We                      Trait        MAJ       Lex       Sty        Emb      FL
adopted the sklearn (Pedregosa et al., 2011)
                                                                     EI           40.55     51.85     40.46      40.55    51.65
implementation of a linear SVM (LinearSVM),
                                                                     NS           44.34     51.92     44.34      44.34    49.04
with standard parameters. We tested three types
                                                                     FT           35.01     50.67     36.27      35.01    50.86
of features. At the lexical level, we experimented
                                                                     PJ           29.49     50.53     51.04      47.06    51.03
with word (1-2) and character (3-4) n-grams, both
as raw counts as well as tf-idf weighted. Charac-                    Avg          37.35 51.24 43.03 41.74 50.65
ter n-grams were tested also with a word-boundary
option. At a more stylistically level, we con-                    Table 4: Results of the experiments on Personal-
sidered the use of emojis, hashtags, pronouns,                    ITY. FL: prediction of the full MBTI label at once,
punctuation and capitalisation. Lastly, we also                   with a character n-gram model.
experimented with embeddings-based representa-
tions, by using, on the one hand, YouTube-specific                                micro F                    macro F
(Nieuwenhuis and Nissim, 2019) pre-trained mod-                    Trait       MAJ Lex            MAJ Lex           Sty     Emb
els, on the other hand, more generic embeddings,
                                                                   EI          77.75    79.18     43.69     55.23   43.69   43.69
such as the Italian version of GloVe (Penning-
                                                                   NS          85.92    85.92     46.15     46.15   46.15   46.15
ton et al., 2014), which is trained on the Italian
                                                                   FT          53.67    55.31     34.79     52.98   35.34   34.70
Wikipedia4 . We looked for all the available em-                   PJ          53.06    54.08     34.56     53.01   35.20   34.90
beddings of the words written by each author, and
used the average as feature. If a word appeared                    Avg         67.6     68.62 39.80 51.84 40.09 39.86
more than once in the string of comments, we con-
                                                                  Table 5: Results of our experiments on T WI S TY.
sidered it multiple times in the final average.
   We used 10-fold cross-validation, and assessed
the models using macro f-score. Note that the                     two datasets on the performance of models for per-
original T WI S TY paper uses micro f-score. Thus,                sonality detection, maintaining the 10-fold cross-
for the sake of comparison, we include also micro-                validation setting and by using the model perform-
F in Table 5 for the MAJ baseline and our lexical                 ing better on average for YouTube and Twitter data
n-gram model. Table 4 shows the results of our                    (a character n-grams model). Table 6 contains the
experiments with different feature types.5 Over-                  result of such experiments6 . Scores are almost
all, lexical features (n-grams) perform best. Com-                always lower compared to the in-domain experi-
bining different feature types did not lead to any                ments (excepts for NS as regards Twitter scores
improvement. Classification was performed with                    reported in Table 5: 46.15 → 48.31), but quite in-
four separate binary classifiers (one per dimen-                  creased compared to the majority baseline.
sion), and with one single classifier predicting four
classes, i.e, the whole MBTI labels at once. In the                                   Trait     MAJ          Lex
latter case, we observe that the results are quite
high considering the increased difficulty of the                                      EI        41.64       50.57
task. Table 5 reports the scores of our models on                                     NS        44.93       48.31
T WI S TY. As for Personal-ITY, best results were                                     FT        35.04       51.31
achieved using lexical features (tf-idf n-grams);                                     PJ        30.66       48.24
stylistic features and embeddings are just above                                      Avg       38.07       49.61
the baseline. Our model outperforms the one in
(Verhoeven et al., 2016) for all traits (micro-F).                 Table 6: Merging Personal-ITY with T WI S TY.
   To test compatibility of resources and to assess
model portability, we also ran cross-domain ex-                     In the second setting, instead, we divided both
periments on Personal-ITY and T WI S TY. In the                   corpora in fixed training and test sets with a pro-
first setting, we tested the effect of merging the                portion of 80/20 and ran the models using lexi-
   4
                                                                  cal features, in order to run a cross-domain experi-
      https://hlt.isti.cnr.it/
wordembeddings                                                    ment. For direct comparison, we run the model in-
    5
      In Tables 4–5, we report the highest scores based on av-    domain again using this split. Results are shown
erages of the four traits. Considering the dimensions individ-
                                                                     6
ually, better results can be obtained by using specific models.          Prediction of the full label at once.
    Train         Personal-ITY                  T WI S TY             References
    Test        IN       CROSS            IN       CROSS              Shlomo Argamon, Moshe Koppel, James W. Pen-
               Pers     MAJ T WI         T WI     MAJ Pers              nebaker, and Jonathan Schler. 2009. Automatically
                                                                        profiling the author of an anonymous text. Commun.
    EI         58.94    44.94   49.33    55.66    44.59     44.59       ACM, 52(2):119–123, February.
    NS         52.88    47.87   47.31    47.87    45.31     45.31
                                                                      Elisa Bassignana, Malvina Nissim, and Viviana Patti.
    FT         49.20    37.58   47.09    65.26    39.13     51.04
                                                                         2020. Personal-ITY: a YouTube Comments Cor-
    PJ         54.43    32.41   32.50    56.87    36.56     38.54        pus for Personality Profiling in Italian Social Me-
    Avg        53.86 40.70 44.06 56.42 41.40 44.87                       dia. In Viviana Patti, Malvina Nissim, and Barbara
                                                                         Plank, editors, Proceedings of the Third Workshop
                                                                         on Computational Modeling of People’s Opinions,
Table 7: Results of the cross-domain experiments.                        Personality, and Emotions in Social Media, (PEO-
MAJ = baseline on the cross-domain testset.                              PLES@COLING 2020). Association for Computa-
                                                                         tional Linguistics.
                                                                      Joan-Isaac Biel and Daniel Gatica-Perez. 2013. The
in Table 7. Cross-domain scores are obtained                            youtube lens: Crowdsourced personality impres-
with the best in-domain model.7 They drop sub-                          sions and audiovisual analysis of vlogs. Multimedia,
stantially compared to in-domain, but are always                        IEEE Transactions on, 15(1):41–55.
above the baseline.                                                   Fabio Celli, Fabio Pianesi, David Stillwell, and Michal
                                                                        Kosinski. 2013. Workshop on computational per-
5        Conclusions                                                    sonality recognition: Shared task. In Seventh Inter-
                                                                        national AAAI Conference on Weblogs and Social
The experiments show that there is no single best                       Media.
model for personality prediction, as the feature                      Chris Emmery, Grzegorz Chrupała, and Walter Daele-
contribution depends on the dimension consid-                           mans. 2017. Simple queries as distant labels for
ered, and on the dataset. Lexical features perform                      predicting gender on Twitter. In Proceedings of the
                                                                        3rd Workshop on Noisy User-generated Text, pages
best, but they tend to be strictly related to the con-
                                                                        50–55, Copenhagen, Denmark, September. Associ-
text in which the model is trained and so to overfit.                   ation for Computational Linguistics.
   The inherent difficulty of the task itself is con-
                                                                      Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit-
firmed and deserves further investigations, as as-                      ter sentiment classification using distant supervision.
signing a definite personality is an extremely sub-                     CS224N project report, Stanford, 1(12):2009.
jective and complex task, even for humans.
                                                                      Valerie Goby. 2006. Personality and Online/Offline
   Personal-ITY is made available to further in-                        Choices: MBTI Profiles and Favored Communica-
vestigate the above and other issues related                            tion Modes in a Singapore Study. Cyberpsychology
to personality detection in Italian. The cor-                           & behavior : the impact of the Internet, multimedia
pus can lend itself to a psychological analysis                         and virtual reality on behavior and society, 9:5–13,
                                                                        03.
of the linguistic cues for the MBTI personal-
ity traits. On this line, it is interesting to in-                    Oliver P. John and Sanjay Srivastava. 1999. The big
vestigate the presence of evidences linking lin-                        five trait taxonomy: History, measurement, and the-
                                                                        oretical perspectives. In L. A. Pervin and O. P. John,
guistic features with psychological theories about                      editors, Handbook of personality: Theory and re-
the four considered dimensions (E XTRAVERT-                             search, page 102–138. Guilford Press.
I NTROVERT, I N TUITIVE -S ENSING, F EELING -
                                                                      Tatiana Litvinova, P. Seredin, Olga Litvinova, and Olga
T HINKING, P ERCEIVING -J UDGING). First re-                            Zagorovskaya. 2016. Profiling a set of personality
sults in this direction are presented in (Bassignana                    traits of text author: What our words reveal about us.
et al., 2020).                                                          Research in Language, 14, 12.
                                                                      François Mairesse, Marilyn A. Walker, Matthias R.
Acknowledgments                                                         Mehl, and Roger K. Moore. 2007. Using linguis-
                                                                        tic cues for the automatic recognition of personality
The work of Elisa Bassignana was partially car-                         in conversation and text. Journal of Artificial Intel-
ried out at the University of Groningen within the                      ligence Research, 30:457–500, sep.
framework of the Erasmus+ program 2019/20.                            Yash Mehta, Navonil Majumder, Alexander Gelbukh,
                                                                        and Erik Cambria. 2019. Recent trends in deep
                                                                        learning based personality detection. Artificial In-
     7
         Better results can be obtained with other specific models.     telligence Review, pages 1–27.
I.B. Myers and P.B. Myers. 1995. Gifts Differing: Un-    Mark Snyder. 1983. The influence of individuals on
   derstanding Personality Type. Mobius.                  situations: Implications for understanding the links
                                                          between personality and social behavior. Journal of
Moniek Nieuwenhuis and Malvina Nissim. 2019. The          personality, 51(3):497–516.
 Contribution of Embeddings to Sentiment Analy-
 sis on YouTube. In Proceedings of the Sixth Ital-       Jacopo Staiano, Bruno Lepri, Nadav Aharony, Fabio
 ian Conference on Computational Linguistics, Bari,         Pianesi, Nicu Sebe, and Alex Pentland. 2012.
 Italy, November 13-15, 2019, volume 2481 of CEUR           Friends don’t lie - inferring personality traits from
 Workshop Proceedings. CEUR-WS.org.                         social network structure. In UbiComp’12 - Proceed-
                                                            ings of the 2012 ACM Conference on Ubiquitous
Francisco M. Rangel Pardo, Fabio Celli, Paolo Rosso,        Computing, pages 321–330, 09.
  Martin Potthast, Benno Stein, and Walter Daele-
  mans. 2015. Overview of the 3rd Author Profil-         Ben Verhoeven, Walter Daelemans, and Barbara Plank.
  ing Task at PAN 2015. In Working Notes of CLEF           2016. TwiSty: A multilingual Twitter stylome-
  2015 - Conference and Labs of the Evaluation fo-         try corpus for gender and personality profiling. In
  rum, Toulouse, France, September 8-11, 2015, vol-        Proceedings of the Tenth International Conference
  ume 1391 of CEUR Workshop Proceedings. CEUR-             on Language Resources and Evaluation (LREC’16),
  WS.org.                                                  pages 1632–1637, Portorož, Slovenia, May. Euro-
                                                           pean Language Resources Association (ELRA).
Laura Parks and Russell P Guay. 2009. Personality,
  values, and motivation. Personality and individual     Alessandro Vinciarelli and Gelareh Mohammadi.
  differences, 47(7):675–684.                              2014. A survey of personality computing. IEEE
                                                           Transactions on Affective Computing, 5(3):273–291.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
   B. Thirion, O. Grisel, M. Blondel, P. Pretten-        Irving B. Weiner and Roger L. Greene, 2017. Ethi-
   hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-      cal Considerations In Personality Assessment, chap-
   sos, D. Cournapeau, M. Brucher, M. Perrot, and           ter 4, pages 59–74. Wiley.
   E. Duchesnay. 2011. Scikit-learn: Machine learn-
   ing in Python. Journal of Machine Learning Re-        Susan Whelan and Gary Davies. 2006. Profiling con-
   search, 12:2825–2830.                                   sumers of own brands and national brands using hu-
                                                           man personality. Journal of Retailing and Consumer
James Pennebaker and Laura King. 2000. Linguis-            Services, 13(6):393–402.
  tic styles: Language use as an individual differ-
                                                         Wu Youyou, Michal Kosinski, and David Stillwell.
  ence. Journal of personality and social psychology,
                                                          2015. Computer-based personality judgments are
  77:1296–312, 01.
                                                          more accurate than those made by humans. Pro-
Jeffrey Pennington, Richard Socher, and Christo-          ceedings of the National Academy of Sciences,
   pher D. Manning. 2014. Glove: Global vectors for       112(4):1036–1040.
   word representation. In Empirical Methods in Nat-
   ural Language Processing (EMNLP), pages 1532–
   1543.

Barbara Plank and Dirk Hovy. 2015. Personality traits
  on Twitter—or—How to get 1,500 personality tests
  in a week. In Proceedings of the 6th Workshop
  on Computational Approaches to Subjectivity, Senti-
  ment and Social Media Analysis, pages 92–98, Lis-
  boa, Portugal, September. Association for Computa-
  tional Linguistics.

Chris Pool and Malvina Nissim. 2016. Distant su-
  pervision for emotion detection using Facebook re-
  actions. In Proceedings of the Workshop on Com-
  putational Modeling of People’s Opinions, Person-
  ality, and Emotions in Social Media (PEOPLES),
  pages 30–39, Osaka, Japan, December. The COL-
  ING 2016 Organizing Committee.

H. Andrew Schwartz, Johannes C. Eichstaedt, Mar-
  garet L. Kern, Lukasz Dziurzynski, Stephanie M.
  Ramones, Megha Agrawal, Achal Shah, Michal
  Kosinski, David Stillwell, Martin E.P. Seligman,
  et al. 2013. Personality, gender, and age in the
  language of social media: The open-vocabulary ap-
  proach. PloS one, 8(9):e73791.