Evaluation Metrics for Inferring Personality from Text

                          David N. Chin                                          William R. Wright
                 University of Hawai‘i at Mānoa                          University of Hawai‘i at Mānoa
          Dept. of Information and Computer Sciences               Dept. of Information and Computer Sciences
               1680 East-West Road, POST 317                            1680 East-West Road, POST 317
                     Honolulu, HI 96822 USA                                   Honolulu, HI 96822 USA
                        chin@hawaii.edu                                       wrightwr@hawaii.edu

1.    INTRODUCTION                                               dent essays [9] show strong correlations of the word “hurri-
   There have been a rich variety of studies of the relation-    cane”: positively with Conscientiousness and Agreeableness
ship between speaker or author language usage and human          and inversely with Openness that are likely an artifact of
personality. With each additional effort, the hope is that       the fact that the essays were written soon after hurricane
we will better identify text features consistently predictive    Katrina had hit nearby and would likely not have similar
of personality across domains, and identify the appropriate      correlations in other contexts. Scores should be standard-
modeling techniques to convert those features into predic-       ized to eliminate units, or else our evaluation metrics will
tions about personality. However because each study uses         differ wildly between researchers depending on the person-
different data from often very different domains, it is impos-   ality test used.
sible to directly compare personality prediction algorithms.
                                                                 2.1    Features of interest
1.1   Features                                                      Some studies focus on building a classifier, but not on
   The community has examined a broad variety of text fea-       identifying which features were useful for classification. They
tures: LIWC categories [5, 9], MRC (which places words           run the model building tool like a black box, but what is
in emotion, perception, cognition, and communication cat-        really interesting is what is inside. Announcing which lan-
egories), POS n-grams [1], proper noun marking [8], word         guage features are most predictive of personality for their
frequency (bag-of-words), word n-grams, and various hybrid       dataset would be more interesting than, say, the classifica-
features that combine these to form meaningful structures,       tion accuracy they obtain.
not to speak of the variety of personality tests employed,          For the good of the broader community, care should be
from a basic 10-item questionnaire to those with numerous        taken to identify and announce the features f believed to be
items, as well as observer reports of personality.               associated with personality, accompanied by the frequency
   What remains is, (I) how to identify relevant features pre-   mean m, Pearson’s correlation coefficient ρx , the p-value px
dictive of personality across the many different contexts, in-   expressing h, the probability of the null hypothesis in the
cluding different localities, time periods, and writing pur-     presence of the current feature. Of course given the large
poses and (II) how to evaluate predictive models built on        number of features that serve as candidates for personal-
some combination of such features. Progress on these two         ity prediction, and the oft witnessed sparsity of text data,
tasks will promote models of increasing utility to practition-   some features that initially look promising will just turn out
ers even when their subjects differ from the typical research    to be noise, regardless one’s filtering method. This is just
study participant. A community corpora will be quite help-       the right moment for comparison with prior research. By
ful to allow researchers to compare how well their choices of    looking up or computing the aforementioned statistics from
language features allow their algorithms to predict person-      preexisting corpora, researchers can adjust h to account for
ality.                                                           prior appearances of the features f . Authors should address
                                                                 what to infer, if anything, from a feature’s absence in any of
2.    CORPORA                                                    the corpora.
                                                                    This task is distinct from feature selection, which some
   The ideal corpora for evaluating different techniques for
                                                                 modeling techniques require as a preprocessing step. The
inferring personality from text would include large amounts
                                                                 feature selection process often arbitrarily selects a single in-
of text from many different contexts including different lo-
                                                                 stance of a group of collinear features, discarding the rest. In
calities, time periods, and writing purposes (e.g., emails,
                                                                 that way relevant features are excluded from consideration
text messages, blogs, essays, tweets, fiction, technical writ-
                                                                 by an arbitrary ordering imposed by a selection algorithm
ing, etc.). These texts would be associated with personality
                                                                 random: perhaps as a result of a randomizer seed, com-
profiles, preferably with scores from the prevailing Five Fac-
                                                                 puter hardware differences or idiosyncrasies of implemen-
tor Model, which are useful for comparing individuals, but
                                                                 tation. Should researchers report the resulting feature list
also when available with the Myers-Briggs Type Indicator,
                                                                 without explicit allowances made for these issues, some con-
which is useful for other purposes. Text from multiple differ-
                                                                 fusion may result as to why some features are present, and
ent contexts is important because text from a single context
                                                                 why others (perhaps present in other studies) are missing.
would likely have some coincidental correlations of person-
ality influenced by current events that would not be found
in text from other contexts. For example, Pennebaker’s stu-      2.2    Predictive modeling
  The corpora should be pre-divided into multiple training         class within each personality trait, report precision and re-
and test sets to make it easier to compare different classifi-     call rates. When the researchers are able to analyze multiple
cation or score prediction algorithms. The purpose of these        training/test partitions, the average and standard deviation
sets is as follows:                                                for all partitions should be reported in addition to the in-
                                                                   dividual partition results. Regression models should report
     • Training set. Feature selection, data for algorithms        both root mean squared error (RMSE) and mean absolute
       for training regression or classification models, check-    error (MAE) since MAE is often less sensitive to infrequent
       ing their accuracy and tuning the algorithms, tun-          occurrences of very large errors than RMSE.
       ing parameters, and re-checking. How the training              For binary and 3-class classification algorithms, [11] rec-
       set might be further subdivided into feature selection,     ommend testing the significance of improvements in classi-
       training and validation subsets is left to the discretion   fication accuracy with the Binomial test. They argue that
       of individual researchers.                                  a t-test is simply the wrong test for comparing classifiers
                                                                   because the t-test assumes that the test sets for each “treat-
     • Test set. For a last and final test of the model created
                                                                   ment” (each algorithm) are independent and when two algo-
       from the training set. None of the activities mentioned
                                                                   rithms are compared on the same data set, the test sets are
       above should take place after this event. Nothing from
                                                                   obviously not independent. Instead they recommend using
       the test set should be used for training. Most impor-
                                                                   the Binomial test to compare the number of examples that
       tantly, no feature selection should be performed using
                                                                   algorithm A got right and algorithm B got wrong versus the
       the testing set.1
                                                                   number of examples that algorithm A got wrong and algo-
   The corpora should be pre-divided into training and test        rithm B got right, ignoring examples that both got right or
sets to make it easier to compare different classification al-     both got wrong. To apply the Binomial test, researchers
gorithms. The division should not be purely random, but            should report in an online form which entries in each test
should also take care that common text-analysis features are       set were correct/incorrect to allow for proper significance
approximately evenly distributed across the two sets and           comparisons with future/past classification algorithms.
the relative distribution of personality traits are also ap-          An alternative approach to statistical significance of clas-
proximately evenly distributed across the two sets. That is        sification algorithm differences is given by [2]. They recom-
the test set should be representative of the corpora in dis-       mend averaging the t-values from a paired Student t-test
tribution of language features, personality, and any other         of each training/test partition and converting this to a sig-
demographic markers like gender, age, and location. Also,          nificance value. This allows discounting of the effects of a
multiple such pairs of training and test sets should be pre-       single partition versus multiple partitions. To allow compar-
pared and ordered so that researchers without enough time          isons with past/future classification algorithms, researchers
or resources to repeat their analyses for all training/test par-   should report t-values for each training and test partition of
titions can compare results for the designated first partition     the corpus.
with all other researchers. We would recommend 5 to 10                For regression models, [3] recommend pairwise compar-
different divisions of training and test sets. There is also a     isons of RMSE using a test proposed by [6]. To allow
question about the relative sizes of the training vs. test sets.   comparisons with past/future regression models, researchers
With a large enough corpora, we believe a 75% training and         should report RMSE for each entry in their test set(s).
25% test size would be a good division.
   Needless to say, the corpora should be anonymized to            4.   CONCLUSION
protect the privacy of the writers. It may be useful to              An established corpora with detailed reporting require-
include gender, time period and broad geographical loca-           ments will allow researchers to much more easily compare
tion tags with the corpora. Further steps to protect against       their algorithms for inferring personality from text. How-
de-anonymization attacks might include replacing all names         ever, there will always be a need to extend the corpora to
with generic markers like NAME1, PLACE2, etc. via named            increase the coverage of different types of writing, time peri-
entity recognition.                                                ods, and localities. If the shared corpus is viewed merely as
                                                                   a benchmark set, we risk overfitting the benchmark. There-
3.    METRICS                                                      fore we recommend a series of corpora, perhaps one every
   For personality prediction, some applications will require      few years to keep adding new data to the community.
classifying users as high/low (above/below the mean) in the
five personality dimensions. For other applications, it may        5.   REFERENCES
be more useful to predict 3 classes, high (one standard devia-
                                                                    [1] Shlomo Argamon, Casey Whitelaw, Paul Chase,
tion above mean), low (one standard deviation below mean)
                                                                        Sobhan Raj Hota, Navendu Garg, and Shlomo
and medium (between high and low). Finally, regression
                                                                        Levitan. Stylistic text classification using functional
models are useful for applications that require more fine-
                                                                        lexical features. Journal of the American Society for
grained prediction of personality values.
                                                                        Information Science and Technology, 58(6):802–822,
   Binary and 3-class classification algorithms should report
                                                                        2007.
percentage accuracy for each personality trait. Also for each
                                                                    [2] Jeffrey P Bradford and Carla E Brodley. The effect of
1
  Those who disregard this and choose to perform automated              instance-space partition on significance. Machine
feature selection and train classifiers on the same observa-            Learning, 42(3):269–286, 2001.
tions should consider that massive overfitting will likely oc-
cur. A classifier trained on the “best” 300 features drawn          [3] A. Feelders and W. Verkooijen. On the statistical
from 600,000 may be overfitting most of those features, un-             comparison of inductive learning methods. In In D.
dermining external validity.                                            Fisher and H.-J. Lenz (Eds.), Learning from Data:
     Artificial and Intelligence V, pages 271–279.
     Springer-Verlag, 1996.
 [4] Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram
     Singer. An efficient boosting algorithm for combining
     preferences. The Journal of machine learning research,
     4:933–969, 2003.
 [5] J. Golbeck, C. Robles, M. Edmondson, and K. Turner.
     Predicting personality from twitter. In Privacy,
     security, risk and trust (passat), 2011 ieee third
     international conference on and 2011 ieee third
     international conference on social computing
     (socialcom), pages 149–156. IEEE, 2011.
 [6] Y. Hochberg and A. C. Tamhane. Multiple
     Comparison Procedures. John Wiley & Sons, Inc., New
     York, NY, USA, 1987.
 [7] F. Mairesse, M.A. Walker, M.R. Mehl, and R.K.
     Moore. Using linguistic cues for the automatic
     recognition of personality in conversation and text.
     Journal of Artificial Intelligence Research,
     30(1):457–500, 2007.
 [8] J. Oberlander and S. Nowson. Whose thumb is it
     anyway?: classifying author personality from weblog
     text. In Proceedings of the COLING/ACL on Main
     conference poster sessions, pages 627–634. Association
     for Computational Linguistics, 2006.
 [9] J.W. Pennebaker and L.A. King. Linguistic styles:
     language use as an individual difference. Journal of
     personality and social psychology, 77(6):1296, 1999.
[10] A. Roshchina, J. Cardiff, and P. Rosso. User profile
     construction in the twin personality-based
     recommender system. Sentiment Analysis where AI
     meets Psychology (SAAIP), page 73, 2011.
[11] Steven L. Salzberg and Usama Fayyad. On comparing
     classifiers: Pitfalls to avoid and a recommended
     approach. Data Mining and Knowledge Discovery,
     pages 317–328, 1997.
[12] Robert E Schapire. A brief introduction to boosting.
     In Ijcai, volume 99, pages 1401–1406, 1999.