=Paper=
{{Paper
|id=Vol-1399/paper7
|storemode=property
|title=Personality Mining from Biographical Data with the "Adjectival Marker" Technique
|pdfUrl=https://ceur-ws.org/Vol-1399/paper7.pdf
|volume=Vol-1399
|dblpUrl=https://dblp.org/rec/conf/bd/PoddarKS15
}}
==Personality Mining from Biographical Data with the "Adjectival Marker" Technique==
Personality Mining from Biographical Data with the “Adjectival Marker ”
Technique
Shivani Poddar, VenuMadhav Kattagoni and Navjyoti Singh
Center for Exact Humanities, IIIT Hyderabad
shivani.poddar92@gmail.com, venumadhav.kattagoni@gmail.com, singh.navjyoti@gmail.com
Abstract
The last decade has witnessed significant work in personality mining from lexical cues in social media data. Not much work has yet been
undertaken in extracting these lexical cues from biographical data populating social media. Most of this work involves a large crowd of
researchers leveraging dictionary-based approaches such as LIWC (which primarily focus on function words). By means of this paper
we intend to introduce a novel method of personality mining from social media data called “Adjectival-marker Technique”. This method
involves extracting lexical features from descriptive texts (e.g. biographical data) to train a learning model, so as to predict the respective
personality traits of the subject. Conceptually, it draws heavily from the last 78 years of work in lexical psychology and the Big Five
personality test. However, it is not only a computational variant of the primordial theories of lexical psychology, but is also competent
in conferring a substantial accuracy of personality prediction, matching that obtained by psychometric tests. In this study, we propose
a variant of the Lexical Hypothesis from psychology. This modified hypothesis is validated by the computational results of personality
prediction achieved by the Adjectival Marker Technique discussed below. The paper also discusses some insights illustrating the
coherence of people's judgments about the subject's personality (virtual personality). The average accuracy (i.e. matching that achieved
by psychometric tests for Big 5) for prediction approximated to Extraversion - 82.82% Agreeableness - 89.62%, Conscientiousness -
92.48% and Imaginativeness/Intellect - 81.67%.
Keywords: Social Computing, Psychology, User Personality Determination, Natural Language Processing, Machine Learning
1. Introduction more automated instead of relying heavily on psychometric
1.1. Motivation tests written by the subject.
Social Media has become the most abundantly used means 1.2. The Big Five Personality Model
of communicating and propagating information online. There have been several personality models (The Big
Most information here is extensively descriptive of the Three, The Big Five, The Alternative Five, etc.) that claim
users who channel themselves through it. It is not only the to encapsulate the traits that need to be summoned so as
user who gives away information about himself (Goldbeck to effectively predict user personalities from social media
et al, 2011), but also his peers (Staiano et al, 2012). This data. However, out of all these models, the most robust
paper mainly unravels how the latter approach is nearly an and tested model, which has been consistent for the last
absolutely accurate predictor of certain personality traits. few decades, is the Big Five (Big5) model (Goldberg et
The judgements of not only peers but of people who know al, 1992). This personality model, being one of the most
us remotely over time can be an important window into supported in lexical psychology research, stood out as be-
solving the labyrinth of our personalities. The future of so- ing most resilient to carry out research of biographical so-
cial media will witness individuals choosing workplaces, cial media resources (Saucier et al, 1996). Another one
friends, books, movies, products etc, in synchrony with of the instrumental personality theories that has spawned
their own personalities. The tomorrow of the advertising the landscape of personality models is the set proposed
industry will witness a transformation from “spammers” to by Carl Jung (Myers-Briggs Type Indicator (MBTI), So-
“personalized suggestors”. This has also been cited in var- cionics, Kiersey et al, 1921). Following the paucity of
ious discussions wherein advertisers are advised to study data (for evaluating our model) available for personality
personalities instead of demographics (documented in the determination via reliable psychometric tests for the Big 5
paper personalized persuasion, (Jacob, 2012)). The afore- model, we decided to refer to a publicly published research
mentioned applications are just a tip of the iceberg. Re- dataset,1 that abundantly provided us with the MBTI per-
lationships have been discovered between personality and sonalities for people. So as to bridge this gap between the
psychological disorders, job performance (Digman et al, MBTI (for personalities which needed to be used for eval-
1990) and satisfaction (John et al, 1990), and even roman- uation) and Big 5 (the personalities which were being pre-
tic success. An extremely dynamic field of study which also dicted by our model), (Capraro et al (2002), Furnham et al
benefits from the research in the area of Human Computer (1996), McCrae et al (1989)) we used correlations shown
Interaction (HCI) is interface design. Many interface de- in Tables 1 and 2. Thus, one of the major motivations of
signing projects revolve around modelling interfaces based this paper is also to draw the most effective traits (namely:
on people's personality oriented preferences. This study, Extraversion, Agreeableness, Conscientiousness and Imag-
thus, aims to contribute to bridge this gap between bio- inativeness) from the intersection of these two instrumental
graphical data and personality research. We also attempt
to expedite the process of personality prediction, making it 1
The dataset can be found at http://www.celebritytypes.com.
39
paradigms of personality qualifiers. Hence, in scope of this “Those individual differences that are most
study, the traits we predict are Extraversion, Agreeableness, salient and socially relevant in people's lives will
Conscientiousness and Imaginativeness/Intellect. eventually (over time) become encoded into their
language as well as that of people who describe
1.3. Motivation for using Biographical Data them (via the knowledge they have of them, these
This research builds on the confluence of two major do- people could be peers, associates, friends, family
mains, the primordial theories of the lexical hypothesis and members, followers etc.); the more important
the recent computational techniques of data modeling. All- such a difference, the more likely is it to become
port's personality trait names (Allport et al, 1936) lead to expressed as a single word”.
the creation of Goldberg's adjective marker (Goldberg et
al, 1992) and have ignited various studies. Goldberg et al The “Adjectival Marker Technique” introduced in this pa-
(1990), Digman et al (1990), John et al (1990), Ostendoff per is most accurate when it is used to analyze the personal-
et al (1990) built on the same foundation. All of these con- ity of the subject who the social media resource is descrip-
verge at a single point that cites a “descriptive”, “adjectival” tive of and not the author himself. We also inferred an inter-
lexicon to be the key into a person's personality. Social me- esting observation that suggested that the views of different
dia today is littered with biographical or descriptive content people describing the subject are coherent amongst them-
of its over 1.4 billion users. Tapping this reservoir of con- selves and also with the results of the psychometric tests.
tent by the principles and techniques discussed below, the The average accuracy of the traits, based on the proposed
paper aims at unveiling a substantial part of this personality hypothesis, for a series of data spread temporally and spa-
descriptive content. tially (as compared to the results obtained by psychometric
tests) in social media came out as discussed below.
1.4. Proposed modification in the “Lexical Hypothesis
of Psychology” 1.5. Structure of the Paper
The theories of psychology were influenced by various rev- We begin by presenting a brief background on the Lexical
olutionary concepts, for instance, “trait” - a theoretical con- Psychology theories of personality determination and re-
struct which describes a basic dimension of a person's per- lated work on personality in conjuction with social media
sonality (Allport, 1937). The idea of trait gave birth to the in Section 2. We then present our dataset in Section 3 & 4
“Lexical Hypothesis of Psychology”. The initial direction and methodology for analyzing, quantifying and modelling
of this paper was solely governed by this exact hypothesis biographical data content for 574 personalities in Section
(worked upon by Klages, 1926/1932; Cattell, 1943; Nor- 5. The study proceeds on to describe the adjectival features
man, 1963; Goldberg, 1982) - used along with the machine learning techniques for classi-
“Those individual differences that are most fication and demonstrate significant improvements that the
salient and socially relevant in people's lives will model was able to achieve over baseline classification on
eventually become encoded into their language; each personality factor. In subsequent sections, the paper
the more important such a difference, the more presents the results in Section 6 and analysis of the study,
likely is it to become expressed as a single and discusses the methods we incorporated which were in-
word.” strumental in escalating the accuracy of the model for each
of the traits discussed earlier in Section 7. We finally wrap
The Lexical Hypothesis has been used in its entirety in au-
up the paper with brief discussions about the future work,
thor's personality prediction systems, like the one for Greek
sparked by this study in Section 8.
Language described by Kermanidis et al, (2012). Motivated
by the same inspiration, we too expected to extract author's
personality traits from the text they wrote. This involved 2. Related Work
mobilizing huge datasets of web blogs and essays and ex- The last few years have witnessed a considerable escala-
tracting “names” from them to determine the author's per- tion in studies which are directed at mining user person-
sonalities. However, by the course of our study, we found alities from social media data. Those which are related to
out that this was not as effective as the initial hypothesis this work can be mined in mainly 2 sections. (i) Studies
proposed (Goldberg et al, 1982). The average accuracy of which are based on lexical cues to mine author's personal-
the initial experimentation was less than 50%, which was ity, (ii) Studies which have used social media based features
as good as a randomly predicted personality set. to study the personality of the user.
Thus, we propose a modification of the Lexical Hypothe- The former section includes work by Tausczik and Pen-
sis in psychology which suggests that the personality of a nebaker (2010) wherein they mined author personality via
person is predicted based on cumulative judgements of var- LIWC (Linguistic Inquiry and word count) approaches.
ious authors about him/her. These judgements are indica- Another such study used linguistic features such as func-
tive of the respective traits of the person described along tion words, deictics, appraisal expressions and modal verbs
the lines of the Big5 personality Model. The “Adjectival to classify 2 of the Big Five traits namely neuroticism and
Marker ” Technique helps us unravel these judgements, and extraversion (Argamon et al, 2005). Oberlander & Nowson
is derived from the adjectival markers of Big5 personality (2006) classified extraversion, stability, agreeableness and
traits as discussed by Goldberg & Saucier (1996). Thus, the conscientiousness of blog authors' using n-grams as fea-
modified Lexical Hypothesis of Psychology proposed and tures and Naive Bayes algorithms. Mairesse et al, (2007)
verified in this paper is as follows: reported a long list of correlations between Big5 personality
40
traits. They obtained those correlations from psychological mentions. The same has been illustrated by means of Fig-
factor analysis on a corpus of Essays and audio cues (Pen- ure 1.
nebaker & King 1999) to develop a supervised system for
personality recognition. Luyckx et al, (2008) built a corpus
for stylometry and personality prediction from text in Dutch
using n-grams of Part-Of-Speech (POS) and chunks as fea-
tures. They used the MBTI schema in place of the Big5
(it includes 4 binary personality traits, see Briggs & Myers
(1980)). Along the same lines, Iacobelli et al, (2011) used
as features, word n-grams extracted from a large corpus of
blogs, testing different extraction settings, such as the pres-
ence/absence of stop words or inverse document frequency.
They found that bi-grams, treated as Boolean features and
keeping stop words, gave substantial results using Support
Vector Machines (SVM) as learning algorithm. Kermanidis
(2012) followed Mairesse et al, (2007) and developed a su-
pervised system for POS tagging in Modern Greek, based
on low level linguistic features, such as Part-of-Speech tags,
and psychological features, like words associated to psy-
chological states like in LIWC. Kermanidis (2012) also
somewhat operated along the lines of Lexical Hypothesis
by mining author personalities via KMeans clustering al-
gorithms. Figure 1: Data Mobilization
Personality Analysis in Social Media Analysis is a recently
observed phenomenon. Herein, some substantial work was
done by Goldbeck et al, (2011) wherein the authors pre- 3.2. Personality Traits Data
dicted the personality of 279 users from Facebook, using The Jungian Personality functions of 574 personalities were
either linguistic or social network specific features. Quercia extracted from the resource for eventual evaluation.2 Since
et al, (2011) used network features to predict the personality this was one of the most authentic reserves we found con-
of 335 Twitter users, using M5 rules as learning algorithm. sisting of personality listings (so as to evaluate the ones our
Various means of evaluation have been used by the above model predicts) we found it the most effective to be used
researchers, ranging from accuracy to AUC (Area Under for evaluating our own model. The “Adjectival Markers”
the Curve) values so as to establish relative accuracies of that the paper is based on (as described below) are a proven
models against each other. The above have been discussed indicator to reflect the Big5 traits of personality. Thus, to
and captured very effectively by Celli et al, (2013). One im- evaluate our computed predictive model via personalities
portant observation which comes to surface while analyz- for the respective subjects by an exclusively listed source,
ing relevant literature is that, none of the studies so far have we scaled the Jungian Typology type to the closest traits of
exploited the primordial lexical hypothesis and 'adjectival the Big5 using correlation factors as shown in Table 2 (Hall
traits' suggested by Saucier et al, (1996). Our work pre- et al. 2009, Capraro et al. 2002, Furnham et al. 1996, Mc-
sented in this paper carves a very different niche for itself Crae et al. 1989). Table 1 shows the supporting notations
by computing this very approach of personality adjectives, of the personality systems.
compressing the last 80 years of psychological research in
the lexical front and merging it with the latest computa- Big5/ Global5 Jung/MBTI/Kiersey Strength of
tional techniques. This confluence has yielded encouraging Correlation
results, predicting traits matching those predicted by a psy-
chometric test. Extraversion Introvert/Extrovert High
Emotional Stability Feeling/Thinking Very Low
3. Datasets Conscientiousness Judging/Percieving High
3.1. Biographical Data Mobilization Accommodation /
Agreeableness Feeling/Thinking Medium
The data collected as a part of this study was by means of a
Intellect Sensing/Intuition Medium-High
Python-based crawler. We first used a simple web crawler
to get a list of web-pages with the name of the respective Table 1: Notations for Personality Models
“person” as the argument keyword to the crawler. These
web pages were then filtered based on their meta-tags. To
boost true positives, we only considered the pages which As illustrated, 4 final personality traits were scaled (each
specified their content as “biographical” in the meta-tag de- of which had medium to high correlation with the MBTI
scriptors. This resulted in mobilization of few Wikipedia
resources, blog mentions and majorly some very descrip- 2
http://www.celebritytypes.com, wherein extensive cognitive
tive biographical websites. We then manually cleaned the functions have been used to derive the psychology of the given
noisy data to assure entity disambiguation and irrelevant personalities.
41
Semi-Correlating Descriptions 4.1. Training Data
Jung/MBTI/Kiersey Global 5 The training data set, used to mine adjectival markers, com-
INFP RCUAI, RLUAI prised of biographic data content of 283 personalities. The
INTP RCUEI, RLUEI word count of the dataset ranged from 500 - 10,000 words.
INFJ RCOAI, RLOAI The ratio of the number of adjectives to the total number of
INTJ RCOEI, RLOEI words in the dataset ranged from 0 to 0.005.
ISTJ RCOEN, RLOEN This data content was mined by means of a Python-based
ISFJ RCOAN, RLOAN web crawler, which parsed biographic websites, Wikipedia,
ISTP RCUEN, RLUEN and social media mentions.
ISFP RCUAN, RLUAN 4.2. Biographical Testing Data
ENFP SCUAI, SLUAI
The testing dataset comprised of biographic data content
ENTP SCUEI, SLUEI
of a different set of 291 personalities than the ones used
ENFJ SCOAI, SLOAI
for training. These were mined from the social media re-
ENTJ SCOEI, SLOEI
serves like Wikimedia articles, blog posts about the respec-
ESTJ SCOEN, SLOEN
tive personalities, social Q& A sites etc. The word count
ESFJ SCOAN, SLOAN and the number of adjectives to the total number of words
ESTP SCUEN, SLUEN ratio ranged from 100 10,000 words and 0.0001 to 0.003
ESFP SCUAN, SLUAN respectively.
Table 2: Correlations between Personality traits Adjectives* II I IV V III
Sympathetic 0.62 0.02 0.07 0.03 -0.05
Kind 0.60 0.07 0.02 0.00 0.06
types) namely - Agreeableness (Accommodation - A/E), Sensitive 0.46 -0.10 0.35 0.23 0.00
Extraversion (R/S), Conscientiousness (Orderliness - O/U) Rude -0.50 0.08 0.00 0.06 -0.15
and Intellect (N/I). Adventurous 0.00 0.38 -0.19 0.10 -0.04
3.3. Adjectival Marker Training Set Table 3: Factor Loadings of 5 of the 435 adjectives pre-
sented by Saucier et al (1996). (Factor I - Extraver-
The adjectives mined from the biographical data were re- sion, Factor II - Agreeableness, Factor III - Conscientious-
fined to extract the adjectival markers i.e. specific adjec- ness, Factor IV - Emotional Stability, Factor V - Intel-
tives descriptive of the subject of the biographical data. lect/Imagination)
These adjectival markers were used as features in the final
LASSO logistic regression model. The adjectival markers
extracted are based on the work of Saucier & Goldberg,
(1996). Table 3 provides the factor loadings of few of the 5. Methodology
435 adjectives (Saucier et al, 1996) on each of the five fac- The training data (283 users) was mined for adjecti-
tors as discussed in their work. The order reflects the rela- val markers according to Saucier's adjectival marker list
tive size (variance) of the factors (e.g. Factor II is the high- (Saucier et al, 1996). Personality traits and their adjectival
est), and the sign reflects the relative size of the item subsets markers were represented as a sparse User-Trait Adjective
at each pole of the factor (e.g. the negative pole of Factor Matrix for each of the 4 adjectival traits to be predicted. The
IV has more items). We have, as a part of our study, con- entries of the respective Trait (say T) matrix were set to 1 if
densed this table to solely indicate whether or not the trait there existed an adjectival marker in the user's descriptive
is descriptive of a particular trait, so as to achieve a binary biographical data and 0 if the respective adjectival marker
matrix for them (for the respective 4 of the Big 5 traits men- was not there. Thus, each personality trait was contained
tioned above). The binary equivalent for Table 3 is shown in a matrix wherein the Row of the matrix M, consisted
in Table 4. of adjectival-features and the corresponding column entry
consisted of the User-trait. The matrix entity Mij was a bi-
4. Biographical Data nary number which was 1 if the adjectival marker in the ith
row indicated the presence of the trait T in the personality
Biographical data was mined for 574 personalities from on- of the subject contained in the jth column of the Matrix M.
line resources as discussed in the former Section 3.1. This To predict the binary score of a given personality feature,
data was divided into 2 categories. Testing data and Train- we then performed a LASSO logistic regression (Tibshi-
ing data. Users with no substantial data (>100 words were rani et al., 1996, Meier et al., 2008) analysis in Weka(Hall
discarded for the analysis as of now). The data mining un- et al., 2009). A variety of regression algorithms were tested,
dertaken for acquiring these datasets is spread across var- each with a 10-fold cross-validation with 10 iterations. The
ious social media resources including Wikipedia articles, best result out of all algorithms was using a binary classifier
blog posts, social Q & A sites and community media sites with Lasso regression (with 10 fold cross validation).
(sharing biographical book excerpts, for building datasets Using the LASSO Technique ensured that there was no
of word count >10,000) overfitting because of extra adjectival features for certain
42
Adjectives Agreeableness Conscientiousness Extraversion Imaginative
Decimal* Binary Decimal* Binary Decimal* Binary Decimal* Binary
Sympathetic 0.62 1 -0.05 0 0.02 1 0.03 1
Kind 0.60 1 0.06 1 0.07 1 0.00 0
Sensitive 0.46 1 0.00 0 -0.10 0 0.23 1
Rude -0.50 0 -0.15 0 0.08 0 0.06 0
Adventurous 0.00 0 -0.04 0 0.38 1 0.10 1
Table 4: Adjectival Marker samples for various traits. Samples with values > 0 in the Saucier Goldberg table have been
given a binary count of 1, while those lower than 0 have been given 0. (*Decimal Values taken from Saucier et al (1996)).
Figure 2: Descriptive of the methodology
traits.
Since there was only single source where traits of ma-
jor personalities are classified (i.e. celebritytypes.com) we
used it to evaluate our model. We used the remaining 291
personalities for evaluation of the model. The testing bi-
ographical data was mined for adjectival trait markers and
their respective traits were predicted. The results of this
evaluation have been discussed elaborately in the next sec-
tion. Figure 2, which can be found above, is also illustrative
of the procedure define above.
Figure 3: Average accuracy percentage of the personality
6. Results
traits by adjectival marker analysis
The results by the above illustrated method are elaborated
in this section. The average accuracies compared to the 7. Discussion
personalities obtained via psychometric tests (discussed in The results obtained illustrate that this method is compe-
more detail in the following section) for considered four of tent for predicting the personalities of a person in coher-
the Big 5 traits were: Extraversion - 82.82% Agreeableness ence with other people's judgments about him/her. It gives
- 89.62%, Conscientiousness - 92.48% and Imaginative- substantial accuracies in the prediction of a person's person-
ness/Intellect - 81.67%. These readings do not necessarily ality matching with those obtained via psychometric tests.
demonstrate the prediction accuracy of the innate personal- As an essential part of this study, we have also attempted to
ity of a person but match that predicted by the psychometric capture the variation in accuracy with the change in var-
tests with the given accuracies. They are also in league with ious factors, namely, word count of the corpus, and the
few other techniques predicting the same for instance, the ratio of the number of adjectives to the total number of
work of Iacobelli et al, (2011) attempted to decipher the words.3 These are mainly intended to explore a threshold
personalities of bloggers has an average personality predic- for word count and the adjective distribution (for the given
tion accuracy of around 62.5%.
Thus, this paper proposes a technique which illustrates 3
Please note that the accuracies discussed here are the accu-
mainfold elevation in the overall accuracy of personality racy of the prediction as evaluated by the results via psychometric
prediction (as indicated by psychometric tests) via social tests for Big 5 and should not be confused with accuracies used for
media. predicting the baseline of the universal personality of a person.
43
Figure 4: Accuracy variation over word count of testing data
technique) in the document set so as to get substantial re- when the ratio of the adjectival count versus total word
sults from the Adjectival Marker Technique. The following count is low. It illustrates an accuracy of 84.00% when the
deductions can be made respective to each trait: ratio is less than 0.001, improving to 94.18% when the ra-
tio is between 0.001-0.002. Finally it escalates to 95.62%
7.1. Collective Observations when increased to be greater than 0.003 (Figure 5). As ex-
Few collective observations can be drawn from the gath- pected there is a consistent increase in accuracy with in-
ered data. As indicated in Figure 4, the accuracy in pre- crease in word count and the ratio AC/TWC.
dicting the traits increases with an increase in the data word
count. We also compared the accuracy results in predicting
the respective traits on the basis of varying distribution of 7.3. Conscientiousness
adjectives in the training dataset (Figure 5). The accuracy
in predicting the traits is relatively low when the ratio of the The accuracy in predicting Conscientiousness varies from
AC/TWC is low and increases with a subsequent increase 86.66% when the word count of the data reserves is less
in the AC/TWC ratio. than 5000 words, and subsequently increases with the in-
crease in the number of words as shown in Figure 4.
7.2. Agreeableness We also varied the adjective distribution with the word
The accuracy in predicting Agreeableness is relatively low count so as to obtain respective accuracies for the same
(73.33%) for data with word count < 5000 words, and model. It varies from an accuracy of 88.00% when the ra-
escalates up to 99.11% for big data reserves (>20,000 tio is less than 0.001, improving to 93.60% when the ratio is
words). We also compared the accuracy results of predict- between 0.001-0.002, and finally to 95.44% when increased
ing “Agreeableness” on the basis of varying distribution of to be greater than 0.003 (Figure 5). As expected there is a
adjectives in the training dataset. consistent increase in accuracy with increase in word count
The prediction of the “Agreeableness” trait is relatively low and the ratio AC/TWC.
44
Figure 5: Accuracy variation over adjective distribution (AC/TWC) in testing dataset
7.4. Imaginative 8. Conclusion & Future Work
The accuracy in predicting Imaginativeness varies from
93.33% at wordcount lower than 5000 words, and goes upto By means of this study we propose a simpler yet effec-
99.88% for big data reserves (Figure 4). tive method to facilitate personality extraction of people
The peaks observed in the variation of accuracy for “Imag- in social media. In order to achieve this we have also re-
inative” trait over the distribution of adjectives (AC/TWC) worked some perennial theories of Lexical Psychology and
range from 85.71% accuracy for AC/TWC = 0.001, 90.69% modified them with the newer concepts of machine learn-
accuracy for AC/TWC = 0.002 and finally 98.42% for ing models. This technique brings about a wave of novelty
AC/TWC >= 0.003 (Figure 5). in the wide spread lexical concepts and techniques used to
achieve user personality understanding in biographical data
7.5. Extraversion reserves. It is a significant contribution in the field of Com-
The accuracy varies from 97.70% for word count < 5000 puter Human interaction, since it is not just based on the
words and subsequently increases to 99.88% as shown in modern model training techniques of artificial intelligence,
Figure 4. but also finds solid ground in the foundational theories of
The accuracy of this trait varied from 85.71% for AC/TWC human psychology. One major drawback of this study is
= 0.001 and went on to increase upto 99.68% for AC/TWC that, it is (as of now) most optimized and accurate when
= 0.002 and then 99.70% for AC/TWC >= 0.003. tested on bigger data samples. This research is thus in-
tended to pave way for extrapolating itself to smaller data
The correlations for each of word count with accu- reserves and microblogs. We intend to apply the same tech-
racy and AC/TWC with accuracy for each of the above nique on not just adjectives but various other parts of speech
mentioned coefficient implies that for “Adjectival Markers” (POS) in the near future. There are various studies which
these are highly correlated to one another. This can also be discuss the role of a person's personality in the development
validated by the graph in Figure 6. of diseases (Friedman et al, 1987). Thus, another goal that
45
this research aims to achieve is that in the very near fu- Boele de Raad. 2000. The Big Five personality factors:
ture it would be able facilitate personality analysis for a The psycholexical approach to personality. Hogrefe &
wide range of people with varied handicaps which render Huber.
them incapable of self-analysis in order to effectively pre- Briggs, I Myers, P B Gifts differing. 1980. Understand-
dict their personalities. Statistics say that 11% of children ing personality type. Davies-Black Publishing, Moun-
4-17 years of age (6.4 million)(Friedman et al, 1987) in the tain View, CA.
United States itself have been diagnosed with Attention- Capraro RM. 2002. Myers-Briggs Type Indicator Score
Deficit / Hyperactivity disorder (the number increasing by Reliability Across: Studies a Meta-Analytic Reliabil-
3% this year). With valuable feedback from friends and ity Generalization Study. Educational and Psychologi-
family this model can help designing better technology for cal Measurement.
them and various other such people. Building upon this re- Cattell R B. 1943. The description of personality: basic
search and extending it to cover other POS would enable us traits resolved into clusters. The Journal of Abnormal
to predict personalities from scanty as well as large datasets and Social Psychology, 38(4), 476–506.
with good accuracy. The vision of this research is to train Celli F, Polonio L. 2013. Relationships between Person-
our next generation computers to not only understand peo- ality and Interactions in Facebook. Social Networking:
ple in terms of their choices, but the innate personalities Recent Trends, Emerging Issues and Future Outlook,
which lead them to make those choices (leading to smart pages 41–54 Nova Science Publishers.
suggestive advertising systems etc). The future work of this
Digman J. 1990. Personality structure: Emergence of
research will also include combining this technique with
the five-factor model. Annual review of psychology,
pre-existing ones (e.g. LIWC, etc.) so as to increase the
4(1):417–440.
personality prediction accuracy to match that achieved by
Friedman, Howard S. 1987. The disease-prone person-
psychometric tests. We also intend to work on a lexical per-
ality: A meta-analytic view of the construct. Booth-
sonality ontology, which analyzes the relationship of per-
Kewley, Stephanie American Psychologist, 42(6), 539–
sonality (both direct and indirect) with the various parts of
555.
speech (POS) i.e. extending it from being solely adjectival
markers to various other POS. We would soon be gradu- Furnham A. 1996. The big five versus the big four: the
ating from solely Big5 trait prediction to evolving various relationship between the Myers-Briggs Type Indicator
mental states which can be predicted from the abundant lex- (MBTI) and NEO-PI five factor model of personality.
ical resources available online. Thus graduating the singly Personality and Individual Differences.
dimensioned Big5 model to a multi-dimensional graphical Goldbeck J, Robles C, Turner K. 2011. Predicting Person-
ontology tree of a person. ality with Social Media. In Proceedings of the annual
conference extended abstracts on Human factors in com-
puting systems.
Goldberg L R. 1990. An alternative description of person-
ality: The Big-Five factor structure. Journal of Person-
ality and Social Psychology, 59(6):1216–1229.
Goldberg L R. 1992. The Development of Markers for the
Big Five Factor Structure. Psychological Assessment,
4(1):26–42.
Hall M, E Frank, G Holmes, B Pfahringer, P Reutemann, I
Witten. 2009. The WEKA data mining software. An up-
date. ACM SIGKDD Explorations Newsletter, 11(1):10–
18.
Iacobelli F, Gill A J, Nowson S, Oberlander J. 2011.
Large scale personality classification of bloggers. Lec-
ture Notes in Computer Science, 6975.
Jacob B Hirsh, Sonia K Kang, Galen V. Bodenhausen
2012. Personalized Persuasion : Tailoring Persuasive
Figure 6: Variation of correlation coefficient based on dis- Appeals to Recipients Personality Traits. Psychological
tribution of adjectives in testing dataset Science.
Jacopo S, Bruno L, Nadav A, Fabio P, Nicu S, Alex P 2012.
Friends dont Lie - Inferring Personality Traits from So-
cial Network Structure. In Proceedings of UbiComp.
9. References 180–185, 5 Sep - 8 Sep, Pittsburgh, USA, ACM, 978-
Allport. 1936. Traitnames. A psycho-lexical study, Psy- 1-4503-1224-0/12/09.
chological Monographs, 47. Jill G, Midian K. 1982. Operational efficiency and the
Argamon S, Dhawle S, Koppel M, Pennebaker J W. 2005. growth of short-term memory span. Journal of Experi-
Lexical Predictors of Personality Type. In Proceedings mental Child Psychology, 33(3), 386–404.
of Joint Annual Meeting of the Interface and the Classi- John O E. 1990. The Big Fivefactor taxonomy: Di-
fication Society of North America. mensions of personality in the natural language and in
46
questionnaires. Handbook of personality theory and re- Text Analysis Methods. Journal of Language and Social
search, L.A. pervin, pages 66–100 Guilford Press, New Psychology, 29(1):24–54.
York. http://www.celebritytypes.com/.
Karl Jung. 1921. Psychological Types.
Klages L. 1926. Die Grundlagen der Chat-akterkunde
[111e science of character]. Lcipzig: Barth.
Lukas M, Sara van de Geer and Peter B. 2008. The
group lasso for logistic regression.J. R. Statist Soc. Ei-
dgenssische Technische Hochschule, Zrich, Switzerland,
70(1):53–71.
Kermanidis K L. 2012. Mining Authors Personality Traits
from Modern Greek Spontaneous Text. 4th International
Workshop on Corpora for Research on Emotion Senti-
ment & Social Signals, in conjunction with LREC12.
Luyckx K, Daelemans, W Personae 2008. A corpus for
author and personality prediction from text. In Proceed-
ings of LREC- 2008, the Sixth International Language
Resources and Evaluation Conference.
Mairesse F, Walker, M Personage. 2007. Personality Gen-
eration for Dialogue. In Proceedings of the 45th Annual
Meeting of the Association for Computational Linguis-
tics ACL.
Mairesse F, Walker M A, Mehl M R, Moore R K. 2007.
Using Linguistic Cues for the Automatic Recognition of
Personality in Conversation and Text. Journal of Artifi-
cial intelligence Research, 30.
McCrae R, Costa P. 1989. Reinterpreting the Myers-
Briggs Type Indicator From the Perspective of the Five-
Factor Model of Personality. Journal of Personality, Na-
tional Center on Birth Defects and Developmental Dis-
abilities Division of Human Development and Disabili-
ties, USA.
Norman, Warren T. 1963. Toward an adequate taxonomy
of personality attributes: Replicated factor structure in
peer nomination personality ratings. The Journal of Ab-
normal and Social Psychology, 66(6), 574–583.
Oberlander J, Nowson S. 2006. Whose thumb is it any-
way? classifying author personality from weblog text.
In Proceedings of the 44th Annual Meeting of the Asso-
ciation for Computational Linguistics ACL.
Ostendoff E. 1990. prache und Personlichkeitsstruktur:
Zur Validitat des Funf-Faktoren-Modeiis der Person-
lichkeit [Language and personality structure: On the va-
lidity of the five-factor model of personality]. Regens-
burg, Federal Republic of Germany : Roderer Verlag.
Pennebaker J W, King L A. 1999. Linguistic styles: Lan-
guage use as an individual difference. Journal of Person-
ality and Social Psychology, 77.
Quercia D, Kosinski M, Stillwell D, Crowcroft J. 2011.
Our Twitter Proles, Our Selves: Predicting Personality
with Twitter. In Proceedings of SocialCom. 180–185.
Robert T. 1996. Regression Shrinkage and Selection via
the Lasso. Journal of the Royal Statistical Society. Series
B (Methodological), 58(1):267–288.
Saucier G, Goldberg L R. 1996. Evidence for the Big Five
in analysis of familiar English Personality adjectives Eu-
ropean Journal of Personality, 10:61–77.
Yla R Tausczik, James W Pennebaker. 2010. The Psy-
chological Meaning of Words: LIWC and Computerized
47