Evaluation Metrics for Inferring Personality from Text David N. Chin William R. Wright University of Hawai‘i at Mānoa University of Hawai‘i at Mānoa Dept. of Information and Computer Sciences Dept. of Information and Computer Sciences 1680 East-West Road, POST 317 1680 East-West Road, POST 317 Honolulu, HI 96822 USA Honolulu, HI 96822 USA chin@hawaii.edu wrightwr@hawaii.edu 1. INTRODUCTION dent essays [9] show strong correlations of the word “hurri- There have been a rich variety of studies of the relation- cane”: positively with Conscientiousness and Agreeableness ship between speaker or author language usage and human and inversely with Openness that are likely an artifact of personality. With each additional effort, the hope is that the fact that the essays were written soon after hurricane we will better identify text features consistently predictive Katrina had hit nearby and would likely not have similar of personality across domains, and identify the appropriate correlations in other contexts. Scores should be standard- modeling techniques to convert those features into predic- ized to eliminate units, or else our evaluation metrics will tions about personality. However because each study uses differ wildly between researchers depending on the person- different data from often very different domains, it is impos- ality test used. sible to directly compare personality prediction algorithms. 2.1 Features of interest 1.1 Features Some studies focus on building a classifier, but not on The community has examined a broad variety of text fea- identifying which features were useful for classification. They tures: LIWC categories [5, 9], MRC (which places words run the model building tool like a black box, but what is in emotion, perception, cognition, and communication cat- really interesting is what is inside. Announcing which lan- egories), POS n-grams [1], proper noun marking [8], word guage features are most predictive of personality for their frequency (bag-of-words), word n-grams, and various hybrid dataset would be more interesting than, say, the classifica- features that combine these to form meaningful structures, tion accuracy they obtain. not to speak of the variety of personality tests employed, For the good of the broader community, care should be from a basic 10-item questionnaire to those with numerous taken to identify and announce the features f believed to be items, as well as observer reports of personality. associated with personality, accompanied by the frequency What remains is, (I) how to identify relevant features pre- mean m, Pearson’s correlation coefficient ρx , the p-value px dictive of personality across the many different contexts, in- expressing h, the probability of the null hypothesis in the cluding different localities, time periods, and writing pur- presence of the current feature. Of course given the large poses and (II) how to evaluate predictive models built on number of features that serve as candidates for personal- some combination of such features. Progress on these two ity prediction, and the oft witnessed sparsity of text data, tasks will promote models of increasing utility to practition- some features that initially look promising will just turn out ers even when their subjects differ from the typical research to be noise, regardless one’s filtering method. This is just study participant. A community corpora will be quite help- the right moment for comparison with prior research. By ful to allow researchers to compare how well their choices of looking up or computing the aforementioned statistics from language features allow their algorithms to predict person- preexisting corpora, researchers can adjust h to account for ality. prior appearances of the features f . Authors should address what to infer, if anything, from a feature’s absence in any of 2. CORPORA the corpora. This task is distinct from feature selection, which some The ideal corpora for evaluating different techniques for modeling techniques require as a preprocessing step. The inferring personality from text would include large amounts feature selection process often arbitrarily selects a single in- of text from many different contexts including different lo- stance of a group of collinear features, discarding the rest. In calities, time periods, and writing purposes (e.g., emails, that way relevant features are excluded from consideration text messages, blogs, essays, tweets, fiction, technical writ- by an arbitrary ordering imposed by a selection algorithm ing, etc.). These texts would be associated with personality random: perhaps as a result of a randomizer seed, com- profiles, preferably with scores from the prevailing Five Fac- puter hardware differences or idiosyncrasies of implemen- tor Model, which are useful for comparing individuals, but tation. Should researchers report the resulting feature list also when available with the Myers-Briggs Type Indicator, without explicit allowances made for these issues, some con- which is useful for other purposes. Text from multiple differ- fusion may result as to why some features are present, and ent contexts is important because text from a single context why others (perhaps present in other studies) are missing. would likely have some coincidental correlations of person- ality influenced by current events that would not be found in text from other contexts. For example, Pennebaker’s stu- 2.2 Predictive modeling The corpora should be pre-divided into multiple training class within each personality trait, report precision and re- and test sets to make it easier to compare different classifi- call rates. When the researchers are able to analyze multiple cation or score prediction algorithms. The purpose of these training/test partitions, the average and standard deviation sets is as follows: for all partitions should be reported in addition to the in- dividual partition results. Regression models should report • Training set. Feature selection, data for algorithms both root mean squared error (RMSE) and mean absolute for training regression or classification models, check- error (MAE) since MAE is often less sensitive to infrequent ing their accuracy and tuning the algorithms, tun- occurrences of very large errors than RMSE. ing parameters, and re-checking. How the training For binary and 3-class classification algorithms, [11] rec- set might be further subdivided into feature selection, ommend testing the significance of improvements in classi- training and validation subsets is left to the discretion fication accuracy with the Binomial test. They argue that of individual researchers. a t-test is simply the wrong test for comparing classifiers because the t-test assumes that the test sets for each “treat- • Test set. For a last and final test of the model created ment” (each algorithm) are independent and when two algo- from the training set. None of the activities mentioned rithms are compared on the same data set, the test sets are above should take place after this event. Nothing from obviously not independent. Instead they recommend using the test set should be used for training. Most impor- the Binomial test to compare the number of examples that tantly, no feature selection should be performed using algorithm A got right and algorithm B got wrong versus the the testing set.1 number of examples that algorithm A got wrong and algo- The corpora should be pre-divided into training and test rithm B got right, ignoring examples that both got right or sets to make it easier to compare different classification al- both got wrong. To apply the Binomial test, researchers gorithms. The division should not be purely random, but should report in an online form which entries in each test should also take care that common text-analysis features are set were correct/incorrect to allow for proper significance approximately evenly distributed across the two sets and comparisons with future/past classification algorithms. the relative distribution of personality traits are also ap- An alternative approach to statistical significance of clas- proximately evenly distributed across the two sets. That is sification algorithm differences is given by [2]. They recom- the test set should be representative of the corpora in dis- mend averaging the t-values from a paired Student t-test tribution of language features, personality, and any other of each training/test partition and converting this to a sig- demographic markers like gender, age, and location. Also, nificance value. This allows discounting of the effects of a multiple such pairs of training and test sets should be pre- single partition versus multiple partitions. To allow compar- pared and ordered so that researchers without enough time isons with past/future classification algorithms, researchers or resources to repeat their analyses for all training/test par- should report t-values for each training and test partition of titions can compare results for the designated first partition the corpus. with all other researchers. We would recommend 5 to 10 For regression models, [3] recommend pairwise compar- different divisions of training and test sets. There is also a isons of RMSE using a test proposed by [6]. To allow question about the relative sizes of the training vs. test sets. comparisons with past/future regression models, researchers With a large enough corpora, we believe a 75% training and should report RMSE for each entry in their test set(s). 25% test size would be a good division. Needless to say, the corpora should be anonymized to 4. CONCLUSION protect the privacy of the writers. It may be useful to An established corpora with detailed reporting require- include gender, time period and broad geographical loca- ments will allow researchers to much more easily compare tion tags with the corpora. Further steps to protect against their algorithms for inferring personality from text. How- de-anonymization attacks might include replacing all names ever, there will always be a need to extend the corpora to with generic markers like NAME1, PLACE2, etc. via named increase the coverage of different types of writing, time peri- entity recognition. ods, and localities. If the shared corpus is viewed merely as a benchmark set, we risk overfitting the benchmark. There- 3. METRICS fore we recommend a series of corpora, perhaps one every For personality prediction, some applications will require few years to keep adding new data to the community. classifying users as high/low (above/below the mean) in the five personality dimensions. For other applications, it may 5. REFERENCES be more useful to predict 3 classes, high (one standard devia- [1] Shlomo Argamon, Casey Whitelaw, Paul Chase, tion above mean), low (one standard deviation below mean) Sobhan Raj Hota, Navendu Garg, and Shlomo and medium (between high and low). Finally, regression Levitan. Stylistic text classification using functional models are useful for applications that require more fine- lexical features. Journal of the American Society for grained prediction of personality values. Information Science and Technology, 58(6):802–822, Binary and 3-class classification algorithms should report 2007. percentage accuracy for each personality trait. Also for each [2] Jeffrey P Bradford and Carla E Brodley. The effect of 1 Those who disregard this and choose to perform automated instance-space partition on significance. Machine feature selection and train classifiers on the same observa- Learning, 42(3):269–286, 2001. tions should consider that massive overfitting will likely oc- cur. A classifier trained on the “best” 300 features drawn [3] A. Feelders and W. Verkooijen. On the statistical from 600,000 may be overfitting most of those features, un- comparison of inductive learning methods. In In D. dermining external validity. Fisher and H.-J. Lenz (Eds.), Learning from Data: Artificial and Intelligence V, pages 271–279. Springer-Verlag, 1996. [4] Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. The Journal of machine learning research, 4:933–969, 2003. [5] J. Golbeck, C. Robles, M. Edmondson, and K. Turner. Predicting personality from twitter. In Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom), pages 149–156. IEEE, 2011. [6] Y. Hochberg and A. C. Tamhane. Multiple Comparison Procedures. John Wiley & Sons, Inc., New York, NY, USA, 1987. [7] F. Mairesse, M.A. Walker, M.R. Mehl, and R.K. Moore. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research, 30(1):457–500, 2007. [8] J. Oberlander and S. Nowson. Whose thumb is it anyway?: classifying author personality from weblog text. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 627–634. Association for Computational Linguistics, 2006. [9] J.W. Pennebaker and L.A. King. Linguistic styles: language use as an individual difference. Journal of personality and social psychology, 77(6):1296, 1999. [10] A. Roshchina, J. Cardiff, and P. Rosso. User profile construction in the twin personality-based recommender system. Sentiment Analysis where AI meets Psychology (SAAIP), page 73, 2011. [11] Steven L. Salzberg and Usama Fayyad. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, pages 317–328, 1997. [12] Robert E Schapire. A brief introduction to boosting. In Ijcai, volume 99, pages 1401–1406, 1999.