=Paper=
{{Paper
|id=Vol-1399/paper7
|storemode=property
|title=Personality Mining from Biographical Data with the "Adjectival Marker" Technique
|pdfUrl=https://ceur-ws.org/Vol-1399/paper7.pdf
|volume=Vol-1399
|dblpUrl=https://dblp.org/rec/conf/bd/PoddarKS15
}}
==Personality Mining from Biographical Data with the "Adjectival Marker" Technique==
<pdf width="1500px">https://ceur-ws.org/Vol-1399/paper7.pdf</pdf>
<pre>
    Personality Mining from Biographical Data with the “Adjectival Marker ”
                                  Technique
                          Shivani Poddar, VenuMadhav Kattagoni and Navjyoti Singh
                                     Center for Exact Humanities, IIIT Hyderabad
               shivani.poddar92@gmail.com, venumadhav.kattagoni@gmail.com, singh.navjyoti@gmail.com

                                                                  Abstract
The last decade has witnessed significant work in personality mining from lexical cues in social media data. Not much work has yet been
undertaken in extracting these lexical cues from biographical data populating social media. Most of this work involves a large crowd of
researchers leveraging dictionary-based approaches such as LIWC (which primarily focus on function words). By means of this paper
we intend to introduce a novel method of personality mining from social media data called “Adjectival-marker Technique”. This method
involves extracting lexical features from descriptive texts (e.g. biographical data) to train a learning model, so as to predict the respective
personality traits of the subject. Conceptually, it draws heavily from the last 78 years of work in lexical psychology and the Big Five
personality test. However, it is not only a computational variant of the primordial theories of lexical psychology, but is also competent
in conferring a substantial accuracy of personality prediction, matching that obtained by psychometric tests. In this study, we propose
a variant of the Lexical Hypothesis from psychology. This modified hypothesis is validated by the computational results of personality
prediction achieved by the Adjectival Marker Technique discussed below. The paper also discusses some insights illustrating the
coherence of people's judgments about the subject's personality (virtual personality). The average accuracy (i.e. matching that achieved
by psychometric tests for Big 5) for prediction approximated to Extraversion - 82.82% Agreeableness - 89.62%, Conscientiousness -
92.48% and Imaginativeness/Intellect - 81.67%.

Keywords: Social Computing, Psychology, User Personality Determination, Natural Language Processing, Machine Learning


                     1.    Introduction                                    more automated instead of relying heavily on psychometric
1.1. Motivation                                                            tests written by the subject.
Social Media has become the most abundantly used means                     1.2. The Big Five Personality Model
of communicating and propagating information online.                       There have been several personality models (The Big
Most information here is extensively descriptive of the                    Three, The Big Five, The Alternative Five, etc.) that claim
users who channel themselves through it. It is not only the                to encapsulate the traits that need to be summoned so as
user who gives away information about himself (Goldbeck                    to effectively predict user personalities from social media
et al, 2011), but also his peers (Staiano et al, 2012). This               data. However, out of all these models, the most robust
paper mainly unravels how the latter approach is nearly an                 and tested model, which has been consistent for the last
absolutely accurate predictor of certain personality traits.               few decades, is the Big Five (Big5) model (Goldberg et
The judgements of not only peers but of people who know                    al, 1992). This personality model, being one of the most
us remotely over time can be an important window into                      supported in lexical psychology research, stood out as be-
solving the labyrinth of our personalities. The future of so-              ing most resilient to carry out research of biographical so-
cial media will witness individuals choosing workplaces,                   cial media resources (Saucier et al, 1996). Another one
friends, books, movies, products etc, in synchrony with                    of the instrumental personality theories that has spawned
their own personalities. The tomorrow of the advertising                   the landscape of personality models is the set proposed
industry will witness a transformation from “spammers” to                  by Carl Jung (Myers-Briggs Type Indicator (MBTI), So-
“personalized suggestors”. This has also been cited in var-                cionics, Kiersey et al, 1921). Following the paucity of
ious discussions wherein advertisers are advised to study                  data (for evaluating our model) available for personality
personalities instead of demographics (documented in the                   determination via reliable psychometric tests for the Big 5
paper personalized persuasion, (Jacob, 2012)). The afore-                  model, we decided to refer to a publicly published research
mentioned applications are just a tip of the iceberg. Re-                  dataset,1 that abundantly provided us with the MBTI per-
lationships have been discovered between personality and                   sonalities for people. So as to bridge this gap between the
psychological disorders, job performance (Digman et al,                    MBTI (for personalities which needed to be used for eval-
1990) and satisfaction (John et al, 1990), and even roman-                 uation) and Big 5 (the personalities which were being pre-
tic success. An extremely dynamic field of study which also                dicted by our model), (Capraro et al (2002), Furnham et al
benefits from the research in the area of Human Computer                   (1996), McCrae et al (1989)) we used correlations shown
Interaction (HCI) is interface design. Many interface de-                  in Tables 1 and 2. Thus, one of the major motivations of
signing projects revolve around modelling interfaces based                 this paper is also to draw the most effective traits (namely:
on people's personality oriented preferences. This study,                  Extraversion, Agreeableness, Conscientiousness and Imag-
thus, aims to contribute to bridge this gap between bio-                   inativeness) from the intersection of these two instrumental
graphical data and personality research. We also attempt
to expedite the process of personality prediction, making it                  1
                                                                                  The dataset can be found at http://www.celebritytypes.com.

                                                                      39
paradigms of personality qualifiers. Hence, in scope of this                       “Those individual differences that are most
study, the traits we predict are Extraversion, Agreeableness,                  salient and socially relevant in people's lives will
Conscientiousness and Imaginativeness/Intellect.                               eventually (over time) become encoded into their
                                                                                language as well as that of people who describe
1.3. Motivation for using Biographical Data                                    them (via the knowledge they have of them, these
This research builds on the confluence of two major do-                        people could be peers, associates, friends, family
mains, the primordial theories of the lexical hypothesis and                      members, followers etc.); the more important
the recent computational techniques of data modeling. All-                      such a difference, the more likely is it to become
port's personality trait names (Allport et al, 1936) lead to                              expressed as a single word”.
the creation of Goldberg's adjective marker (Goldberg et
al, 1992) and have ignited various studies. Goldberg et al              The “Adjectival Marker Technique” introduced in this pa-
(1990), Digman et al (1990), John et al (1990), Ostendoff               per is most accurate when it is used to analyze the personal-
et al (1990) built on the same foundation. All of these con-            ity of the subject who the social media resource is descrip-
verge at a single point that cites a “descriptive”, “adjectival”        tive of and not the author himself. We also inferred an inter-
lexicon to be the key into a person's personality. Social me-           esting observation that suggested that the views of different
dia today is littered with biographical or descriptive content          people describing the subject are coherent amongst them-
of its over 1.4 billion users. Tapping this reservoir of con-           selves and also with the results of the psychometric tests.
tent by the principles and techniques discussed below, the              The average accuracy of the traits, based on the proposed
paper aims at unveiling a substantial part of this personality          hypothesis, for a series of data spread temporally and spa-
descriptive content.                                                    tially (as compared to the results obtained by psychometric
                                                                        tests) in social media came out as discussed below.
1.4.   Proposed modification in the “Lexical Hypothesis
       of Psychology”                                                   1.5.    Structure of the Paper
The theories of psychology were influenced by various rev-              We begin by presenting a brief background on the Lexical
olutionary concepts, for instance, “trait” - a theoretical con-         Psychology theories of personality determination and re-
struct which describes a basic dimension of a person's per-             lated work on personality in conjuction with social media
sonality (Allport, 1937). The idea of trait gave birth to the           in Section 2. We then present our dataset in Section 3 & 4
“Lexical Hypothesis of Psychology”. The initial direction               and methodology for analyzing, quantifying and modelling
of this paper was solely governed by this exact hypothesis              biographical data content for 574 personalities in Section
(worked upon by Klages, 1926/1932; Cattell, 1943; Nor-                  5. The study proceeds on to describe the adjectival features
man, 1963; Goldberg, 1982) -                                            used along with the machine learning techniques for classi-
          “Those individual differences that are most                   fication and demonstrate significant improvements that the
       salient and socially relevant in people's lives will             model was able to achieve over baseline classification on
       eventually become encoded into their language;                   each personality factor. In subsequent sections, the paper
        the more important such a difference, the more                  presents the results in Section 6 and analysis of the study,
           likely is it to become expressed as a single                 and discusses the methods we incorporated which were in-
                               word.”                                   strumental in escalating the accuracy of the model for each
                                                                        of the traits discussed earlier in Section 7. We finally wrap
The Lexical Hypothesis has been used in its entirety in au-
                                                                        up the paper with brief discussions about the future work,
thor's personality prediction systems, like the one for Greek
                                                                        sparked by this study in Section 8.
Language described by Kermanidis et al, (2012). Motivated
by the same inspiration, we too expected to extract author's
personality traits from the text they wrote. This involved                                  2.    Related Work
mobilizing huge datasets of web blogs and essays and ex-                The last few years have witnessed a considerable escala-
tracting “names” from them to determine the author's per-               tion in studies which are directed at mining user person-
sonalities. However, by the course of our study, we found               alities from social media data. Those which are related to
out that this was not as effective as the initial hypothesis            this work can be mined in mainly 2 sections. (i) Studies
proposed (Goldberg et al, 1982). The average accuracy of                which are based on lexical cues to mine author's personal-
the initial experimentation was less than 50%, which was                ity, (ii) Studies which have used social media based features
as good as a randomly predicted personality set.                        to study the personality of the user.
Thus, we propose a modification of the Lexical Hypothe-                 The former section includes work by Tausczik and Pen-
sis in psychology which suggests that the personality of a              nebaker (2010) wherein they mined author personality via
person is predicted based on cumulative judgements of var-              LIWC (Linguistic Inquiry and word count) approaches.
ious authors about him/her. These judgements are indica-                Another such study used linguistic features such as func-
tive of the respective traits of the person described along             tion words, deictics, appraisal expressions and modal verbs
the lines of the Big5 personality Model. The “Adjectival                to classify 2 of the Big Five traits namely neuroticism and
Marker ” Technique helps us unravel these judgements, and               extraversion (Argamon et al, 2005). Oberlander & Nowson
is derived from the adjectival markers of Big5 personality              (2006) classified extraversion, stability, agreeableness and
traits as discussed by Goldberg & Saucier (1996). Thus, the             conscientiousness of blog authors' using n-grams as fea-
modified Lexical Hypothesis of Psychology proposed and                  tures and Naive Bayes algorithms. Mairesse et al, (2007)
verified in this paper is as follows:                                   reported a long list of correlations between Big5 personality

                                                                   40
traits. They obtained those correlations from psychological             mentions. The same has been illustrated by means of Fig-
factor analysis on a corpus of Essays and audio cues (Pen-              ure 1.
nebaker & King 1999) to develop a supervised system for
personality recognition. Luyckx et al, (2008) built a corpus
for stylometry and personality prediction from text in Dutch
using n-grams of Part-Of-Speech (POS) and chunks as fea-
tures. They used the MBTI schema in place of the Big5
(it includes 4 binary personality traits, see Briggs & Myers
(1980)). Along the same lines, Iacobelli et al, (2011) used
as features, word n-grams extracted from a large corpus of
blogs, testing different extraction settings, such as the pres-
ence/absence of stop words or inverse document frequency.
They found that bi-grams, treated as Boolean features and
keeping stop words, gave substantial results using Support
Vector Machines (SVM) as learning algorithm. Kermanidis
(2012) followed Mairesse et al, (2007) and developed a su-
pervised system for POS tagging in Modern Greek, based
on low level linguistic features, such as Part-of-Speech tags,
and psychological features, like words associated to psy-
chological states like in LIWC. Kermanidis (2012) also
somewhat operated along the lines of Lexical Hypothesis
by mining author personalities via KMeans clustering al-
gorithms.                                                                                Figure 1: Data Mobilization
Personality Analysis in Social Media Analysis is a recently
observed phenomenon. Herein, some substantial work was
done by Goldbeck et al, (2011) wherein the authors pre-                 3.2. Personality Traits Data
dicted the personality of 279 users from Facebook, using                The Jungian Personality functions of 574 personalities were
either linguistic or social network specific features. Quercia          extracted from the resource for eventual evaluation.2 Since
et al, (2011) used network features to predict the personality          this was one of the most authentic reserves we found con-
of 335 Twitter users, using M5 rules as learning algorithm.             sisting of personality listings (so as to evaluate the ones our
Various means of evaluation have been used by the above                 model predicts) we found it the most effective to be used
researchers, ranging from accuracy to AUC (Area Under                   for evaluating our own model. The “Adjectival Markers”
the Curve) values so as to establish relative accuracies of             that the paper is based on (as described below) are a proven
models against each other. The above have been discussed                indicator to reflect the Big5 traits of personality. Thus, to
and captured very effectively by Celli et al, (2013). One im-           evaluate our computed predictive model via personalities
portant observation which comes to surface while analyz-                for the respective subjects by an exclusively listed source,
ing relevant literature is that, none of the studies so far have        we scaled the Jungian Typology type to the closest traits of
exploited the primordial lexical hypothesis and 'adjectival             the Big5 using correlation factors as shown in Table 2 (Hall
traits' suggested by Saucier et al, (1996). Our work pre-               et al. 2009, Capraro et al. 2002, Furnham et al. 1996, Mc-
sented in this paper carves a very different niche for itself           Crae et al. 1989). Table 1 shows the supporting notations
by computing this very approach of personality adjectives,              of the personality systems.
compressing the last 80 years of psychological research in
the lexical front and merging it with the latest computa-                      Big5/ Global5     Jung/MBTI/Kiersey         Strength of
tional techniques. This confluence has yielded encouraging                                                                 Correlation
results, predicting traits matching those predicted by a psy-
chometric test.                                                             Extraversion          Introvert/Extrovert         High
                                                                         Emotional Stability       Feeling/Thinking         Very Low
                       3.   Datasets                                     Conscientiousness        Judging/Percieving          High
3.1. Biographical Data Mobilization                                       Accommodation /
                                                                           Agreeableness           Feeling/Thinking         Medium
The data collected as a part of this study was by means of a
                                                                              Intellect            Sensing/Intuition      Medium-High
Python-based crawler. We first used a simple web crawler
to get a list of web-pages with the name of the respective                        Table 1: Notations for Personality Models
“person” as the argument keyword to the crawler. These
web pages were then filtered based on their meta-tags. To
boost true positives, we only considered the pages which                As illustrated, 4 final personality traits were scaled (each
specified their content as “biographical” in the meta-tag de-           of which had medium to high correlation with the MBTI
scriptors. This resulted in mobilization of few Wikipedia
resources, blog mentions and majorly some very descrip-                    2
                                                                            http://www.celebritytypes.com, wherein extensive cognitive
tive biographical websites. We then manually cleaned the                functions have been used to derive the psychology of the given
noisy data to assure entity disambiguation and irrelevant               personalities.

                                                                   41
               Semi-Correlating Descriptions                             4.1. Training Data
          Jung/MBTI/Kiersey         Global 5                             The training data set, used to mine adjectival markers, com-
                 INFP               RCUAI, RLUAI                         prised of biographic data content of 283 personalities. The
                 INTP               RCUEI, RLUEI                         word count of the dataset ranged from 500 - 10,000 words.
                 INFJ               RCOAI, RLOAI                         The ratio of the number of adjectives to the total number of
                 INTJ               RCOEI, RLOEI                         words in the dataset ranged from 0 to 0.005.
                 ISTJ              RCOEN, RLOEN                          This data content was mined by means of a Python-based
                  ISFJ             RCOAN, RLOAN                          web crawler, which parsed biographic websites, Wikipedia,
                 ISTP              RCUEN, RLUEN                          and social media mentions.
                 ISFP              RCUAN, RLUAN                          4.2. Biographical Testing Data
                 ENFP               SCUAI, SLUAI
                                                                         The testing dataset comprised of biographic data content
                 ENTP               SCUEI, SLUEI
                                                                         of a different set of 291 personalities than the ones used
                 ENFJ               SCOAI, SLOAI
                                                                         for training. These were mined from the social media re-
                 ENTJ               SCOEI, SLOEI
                                                                         serves like Wikimedia articles, blog posts about the respec-
                 ESTJ              SCOEN, SLOEN
                                                                         tive personalities, social Q& A sites etc. The word count
                 ESFJ              SCOAN, SLOAN                          and the number of adjectives to the total number of words
                 ESTP              SCUEN, SLUEN                          ratio ranged from 100 10,000 words and 0.0001 to 0.003
                 ESFP              SCUAN, SLUAN                          respectively.
       Table 2: Correlations between Personality traits                     Adjectives*       II        I       IV      V       III
                                                                            Sympathetic      0.62     0.02     0.07    0.03    -0.05
                                                                               Kind          0.60     0.07     0.02    0.00     0.06
types) namely - Agreeableness (Accommodation - A/E),                         Sensitive       0.46    -0.10     0.35    0.23     0.00
Extraversion (R/S), Conscientiousness (Orderliness - O/U)                      Rude          -0.50    0.08     0.00    0.06    -0.15
and Intellect (N/I).                                                        Adventurous      0.00     0.38    -0.19    0.10    -0.04

3.3.   Adjectival Marker Training Set                                    Table 3: Factor Loadings of 5 of the 435 adjectives pre-
                                                                         sented by Saucier et al (1996). (Factor I - Extraver-
The adjectives mined from the biographical data were re-                 sion, Factor II - Agreeableness, Factor III - Conscientious-
fined to extract the adjectival markers i.e. specific adjec-             ness, Factor IV - Emotional Stability, Factor V - Intel-
tives descriptive of the subject of the biographical data.               lect/Imagination)
These adjectival markers were used as features in the final
LASSO logistic regression model. The adjectival markers
extracted are based on the work of Saucier & Goldberg,
(1996). Table 3 provides the factor loadings of few of the                                  5.     Methodology
435 adjectives (Saucier et al, 1996) on each of the five fac-            The training data (283 users) was mined for adjecti-
tors as discussed in their work. The order reflects the rela-            val markers according to Saucier's adjectival marker list
tive size (variance) of the factors (e.g. Factor II is the high-         (Saucier et al, 1996). Personality traits and their adjectival
est), and the sign reflects the relative size of the item subsets        markers were represented as a sparse User-Trait Adjective
at each pole of the factor (e.g. the negative pole of Factor             Matrix for each of the 4 adjectival traits to be predicted. The
IV has more items). We have, as a part of our study, con-                entries of the respective Trait (say T) matrix were set to 1 if
densed this table to solely indicate whether or not the trait            there existed an adjectival marker in the user's descriptive
is descriptive of a particular trait, so as to achieve a binary          biographical data and 0 if the respective adjectival marker
matrix for them (for the respective 4 of the Big 5 traits men-           was not there. Thus, each personality trait was contained
tioned above). The binary equivalent for Table 3 is shown                in a matrix wherein the Row of the matrix M, consisted
in Table 4.                                                              of adjectival-features and the corresponding column entry
                                                                         consisted of the User-trait. The matrix entity Mij was a bi-
                4.    Biographical Data                                  nary number which was 1 if the adjectival marker in the ith
                                                                         row indicated the presence of the trait T in the personality
Biographical data was mined for 574 personalities from on-               of the subject contained in the jth column of the Matrix M.
line resources as discussed in the former Section 3.1. This              To predict the binary score of a given personality feature,
data was divided into 2 categories. Testing data and Train-              we then performed a LASSO logistic regression (Tibshi-
ing data. Users with no substantial data (>100 words were                rani et al., 1996, Meier et al., 2008) analysis in Weka(Hall
discarded for the analysis as of now). The data mining un-               et al., 2009). A variety of regression algorithms were tested,
dertaken for acquiring these datasets is spread across var-              each with a 10-fold cross-validation with 10 iterations. The
ious social media resources including Wikipedia articles,                best result out of all algorithms was using a binary classifier
blog posts, social Q & A sites and community media sites                 with Lasso regression (with 10 fold cross validation).
(sharing biographical book excerpts, for building datasets               Using the LASSO Technique ensured that there was no
of word count >10,000)                                                   overfitting because of extra adjectival features for certain

                                                                    42
           Adjectives          Agreeableness      Conscientiousness             Extraversion               Imaginative
                             Decimal* Binary      Decimal* Binary             Decimal* Binary           Decimal* Binary
          Sympathetic          0.62        1        -0.05             0          0.02            1        0.03           1
             Kind              0.60        1         0.06             1          0.07            1        0.00           0
           Sensitive           0.46        1         0.00             0         -0.10            0        0.23           1
             Rude             -0.50        0        -0.15             0          0.08            0        0.06           0
          Adventurous          0.00        0        -0.04             0          0.38            1        0.10           1

Table 4: Adjectival Marker samples for various traits. Samples with values > 0 in the Saucier Goldberg table have been
given a binary count of 1, while those lower than 0 have been given 0. (*Decimal Values taken from Saucier et al (1996)).


                                          Figure 2: Descriptive of the methodology


traits.
Since there was only single source where traits of ma-
jor personalities are classified (i.e. celebritytypes.com) we
used it to evaluate our model. We used the remaining 291
personalities for evaluation of the model. The testing bi-
ographical data was mined for adjectival trait markers and
their respective traits were predicted. The results of this
evaluation have been discussed elaborately in the next sec-
tion. Figure 2, which can be found above, is also illustrative
of the procedure define above.


                                                                      Figure 3: Average accuracy percentage of the personality
                        6.    Results
                                                                      traits by adjectival marker analysis

The results by the above illustrated method are elaborated
in this section. The average accuracies compared to the                                     7.       Discussion
personalities obtained via psychometric tests (discussed in           The results obtained illustrate that this method is compe-
more detail in the following section) for considered four of          tent for predicting the personalities of a person in coher-
the Big 5 traits were: Extraversion - 82.82% Agreeableness            ence with other people's judgments about him/her. It gives
- 89.62%, Conscientiousness - 92.48% and Imaginative-                 substantial accuracies in the prediction of a person's person-
ness/Intellect - 81.67%. These readings do not necessarily            ality matching with those obtained via psychometric tests.
demonstrate the prediction accuracy of the innate personal-           As an essential part of this study, we have also attempted to
ity of a person but match that predicted by the psychometric          capture the variation in accuracy with the change in var-
tests with the given accuracies. They are also in league with         ious factors, namely, word count of the corpus, and the
few other techniques predicting the same for instance, the            ratio of the number of adjectives to the total number of
work of Iacobelli et al, (2011) attempted to decipher the             words.3 These are mainly intended to explore a threshold
personalities of bloggers has an average personality predic-          for word count and the adjective distribution (for the given
tion accuracy of around 62.5%.
Thus, this paper proposes a technique which illustrates                   3
                                                                            Please note that the accuracies discussed here are the accu-
mainfold elevation in the overall accuracy of personality             racy of the prediction as evaluated by the results via psychometric
prediction (as indicated by psychometric tests) via social            tests for Big 5 and should not be confused with accuracies used for
media.                                                                predicting the baseline of the universal personality of a person.

                                                                 43
                                 Figure 4: Accuracy variation over word count of testing data


technique) in the document set so as to get substantial re-             when the ratio of the adjectival count versus total word
sults from the Adjectival Marker Technique. The following               count is low. It illustrates an accuracy of 84.00% when the
deductions can be made respective to each trait:                        ratio is less than 0.001, improving to 94.18% when the ra-
                                                                        tio is between 0.001-0.002. Finally it escalates to 95.62%
7.1.   Collective Observations                                          when increased to be greater than 0.003 (Figure 5). As ex-
Few collective observations can be drawn from the gath-                 pected there is a consistent increase in accuracy with in-
ered data. As indicated in Figure 4, the accuracy in pre-               crease in word count and the ratio AC/TWC.
dicting the traits increases with an increase in the data word
count. We also compared the accuracy results in predicting
the respective traits on the basis of varying distribution of           7.3.   Conscientiousness
adjectives in the training dataset (Figure 5). The accuracy
in predicting the traits is relatively low when the ratio of the        The accuracy in predicting Conscientiousness varies from
AC/TWC is low and increases with a subsequent increase                  86.66% when the word count of the data reserves is less
in the AC/TWC ratio.                                                    than 5000 words, and subsequently increases with the in-
                                                                        crease in the number of words as shown in Figure 4.
7.2.   Agreeableness                                                    We also varied the adjective distribution with the word
The accuracy in predicting Agreeableness is relatively low              count so as to obtain respective accuracies for the same
(73.33%) for data with word count < 5000 words, and                     model. It varies from an accuracy of 88.00% when the ra-
escalates up to 99.11% for big data reserves (>20,000                   tio is less than 0.001, improving to 93.60% when the ratio is
words). We also compared the accuracy results of predict-               between 0.001-0.002, and finally to 95.44% when increased
ing “Agreeableness” on the basis of varying distribution of             to be greater than 0.003 (Figure 5). As expected there is a
adjectives in the training dataset.                                     consistent increase in accuracy with increase in word count
The prediction of the “Agreeableness” trait is relatively low           and the ratio AC/TWC.

                                                                   44
                   Figure 5: Accuracy variation over adjective distribution (AC/TWC) in testing dataset


7.4. Imaginative                                                             8.    Conclusion & Future Work
The accuracy in predicting Imaginativeness varies from
93.33% at wordcount lower than 5000 words, and goes upto            By means of this study we propose a simpler yet effec-
99.88% for big data reserves (Figure 4).                            tive method to facilitate personality extraction of people
The peaks observed in the variation of accuracy for “Imag-          in social media. In order to achieve this we have also re-
inative” trait over the distribution of adjectives (AC/TWC)         worked some perennial theories of Lexical Psychology and
range from 85.71% accuracy for AC/TWC = 0.001, 90.69%               modified them with the newer concepts of machine learn-
accuracy for AC/TWC = 0.002 and finally 98.42% for                  ing models. This technique brings about a wave of novelty
AC/TWC >= 0.003 (Figure 5).                                         in the wide spread lexical concepts and techniques used to
                                                                    achieve user personality understanding in biographical data
7.5. Extraversion                                                   reserves. It is a significant contribution in the field of Com-
The accuracy varies from 97.70% for word count < 5000               puter Human interaction, since it is not just based on the
words and subsequently increases to 99.88% as shown in              modern model training techniques of artificial intelligence,
Figure 4.                                                           but also finds solid ground in the foundational theories of
The accuracy of this trait varied from 85.71% for AC/TWC            human psychology. One major drawback of this study is
= 0.001 and went on to increase upto 99.68% for AC/TWC              that, it is (as of now) most optimized and accurate when
= 0.002 and then 99.70% for AC/TWC >= 0.003.                        tested on bigger data samples. This research is thus in-
                                                                    tended to pave way for extrapolating itself to smaller data
The correlations for each of word count with accu-                  reserves and microblogs. We intend to apply the same tech-
racy and AC/TWC with accuracy for each of the above                 nique on not just adjectives but various other parts of speech
mentioned coefficient implies that for “Adjectival Markers”         (POS) in the near future. There are various studies which
these are highly correlated to one another. This can also be        discuss the role of a person's personality in the development
validated by the graph in Figure 6.                                 of diseases (Friedman et al, 1987). Thus, another goal that

                                                               45
this research aims to achieve is that in the very near fu-            Boele de Raad. 2000. The Big Five personality factors:
ture it would be able facilitate personality analysis for a              The psycholexical approach to personality. Hogrefe &
wide range of people with varied handicaps which render                  Huber.
them incapable of self-analysis in order to effectively pre-          Briggs, I Myers, P B Gifts differing. 1980. Understand-
dict their personalities. Statistics say that 11% of children            ing personality type. Davies-Black Publishing, Moun-
4-17 years of age (6.4 million)(Friedman et al, 1987) in the             tain View, CA.
United States itself have been diagnosed with Attention-              Capraro RM. 2002. Myers-Briggs Type Indicator Score
Deficit / Hyperactivity disorder (the number increasing by               Reliability Across: Studies a Meta-Analytic Reliabil-
3% this year). With valuable feedback from friends and                   ity Generalization Study. Educational and Psychologi-
family this model can help designing better technology for               cal Measurement.
them and various other such people. Building upon this re-            Cattell R B. 1943. The description of personality: basic
search and extending it to cover other POS would enable us               traits resolved into clusters. The Journal of Abnormal
to predict personalities from scanty as well as large datasets           and Social Psychology, 38(4), 476–506.
with good accuracy. The vision of this research is to train           Celli F, Polonio L. 2013. Relationships between Person-
our next generation computers to not only understand peo-                ality and Interactions in Facebook. Social Networking:
ple in terms of their choices, but the innate personalities              Recent Trends, Emerging Issues and Future Outlook,
which lead them to make those choices (leading to smart                  pages 41–54 Nova Science Publishers.
suggestive advertising systems etc). The future work of this
                                                                      Digman J. 1990. Personality structure: Emergence of
research will also include combining this technique with
                                                                         the five-factor model. Annual review of psychology,
pre-existing ones (e.g. LIWC, etc.) so as to increase the
                                                                         4(1):417–440.
personality prediction accuracy to match that achieved by
                                                                      Friedman, Howard S. 1987. The disease-prone person-
psychometric tests. We also intend to work on a lexical per-
                                                                         ality: A meta-analytic view of the construct. Booth-
sonality ontology, which analyzes the relationship of per-
                                                                         Kewley, Stephanie American Psychologist, 42(6), 539–
sonality (both direct and indirect) with the various parts of
                                                                         555.
speech (POS) i.e. extending it from being solely adjectival
markers to various other POS. We would soon be gradu-                 Furnham A. 1996. The big five versus the big four: the
ating from solely Big5 trait prediction to evolving various              relationship between the Myers-Briggs Type Indicator
mental states which can be predicted from the abundant lex-              (MBTI) and NEO-PI five factor model of personality.
ical resources available online. Thus graduating the singly              Personality and Individual Differences.
dimensioned Big5 model to a multi-dimensional graphical               Goldbeck J, Robles C, Turner K. 2011. Predicting Person-
ontology tree of a person.                                               ality with Social Media. In Proceedings of the annual
                                                                         conference extended abstracts on Human factors in com-
                                                                         puting systems.
                                                                      Goldberg L R. 1990. An alternative description of person-
                                                                         ality: The Big-Five factor structure. Journal of Person-
                                                                         ality and Social Psychology, 59(6):1216–1229.
                                                                      Goldberg L R. 1992. The Development of Markers for the
                                                                         Big Five Factor Structure. Psychological Assessment,
                                                                         4(1):26–42.
                                                                      Hall M, E Frank, G Holmes, B Pfahringer, P Reutemann, I
                                                                         Witten. 2009. The WEKA data mining software. An up-
                                                                         date. ACM SIGKDD Explorations Newsletter, 11(1):10–
                                                                         18.
                                                                      Iacobelli F, Gill A J, Nowson S, Oberlander J. 2011.
                                                                         Large scale personality classification of bloggers. Lec-
                                                                         ture Notes in Computer Science, 6975.
                                                                      Jacob B Hirsh, Sonia K Kang, Galen V. Bodenhausen
                                                                         2012. Personalized Persuasion : Tailoring Persuasive
Figure 6: Variation of correlation coefficient based on dis-             Appeals to Recipients Personality Traits. Psychological
tribution of adjectives in testing dataset                               Science.
                                                                      Jacopo S, Bruno L, Nadav A, Fabio P, Nicu S, Alex P 2012.
                                                                         Friends dont Lie - Inferring Personality Traits from So-
                                                                         cial Network Structure. In Proceedings of UbiComp.
                    9.    References                                     180–185, 5 Sep - 8 Sep, Pittsburgh, USA, ACM, 978-
Allport. 1936. Traitnames. A psycho-lexical study, Psy-                  1-4503-1224-0/12/09.
  chological Monographs, 47.                                          Jill G, Midian K. 1982. Operational efficiency and the
Argamon S, Dhawle S, Koppel M, Pennebaker J W. 2005.                     growth of short-term memory span. Journal of Experi-
  Lexical Predictors of Personality Type. In Proceedings                 mental Child Psychology, 33(3), 386–404.
  of Joint Annual Meeting of the Interface and the Classi-            John O E. 1990. The Big Fivefactor taxonomy: Di-
  fication Society of North America.                                     mensions of personality in the natural language and in

                                                                 46
  questionnaires. Handbook of personality theory and re-               Text Analysis Methods. Journal of Language and Social
  search, L.A. pervin, pages 66–100 Guilford Press, New                Psychology, 29(1):24–54.
  York.                                                              http://www.celebritytypes.com/.
Karl Jung. 1921. Psychological Types.
Klages L. 1926. Die Grundlagen der Chat-akterkunde
  [111e science of character]. Lcipzig: Barth.
Lukas M, Sara van de Geer and Peter B. 2008. The
  group lasso for logistic regression.J. R. Statist Soc. Ei-
  dgenssische Technische Hochschule, Zrich, Switzerland,
  70(1):53–71.
Kermanidis K L. 2012. Mining Authors Personality Traits
  from Modern Greek Spontaneous Text. 4th International
  Workshop on Corpora for Research on Emotion Senti-
  ment & Social Signals, in conjunction with LREC12.
Luyckx K, Daelemans, W Personae 2008. A corpus for
  author and personality prediction from text. In Proceed-
  ings of LREC- 2008, the Sixth International Language
  Resources and Evaluation Conference.
Mairesse F, Walker, M Personage. 2007. Personality Gen-
  eration for Dialogue. In Proceedings of the 45th Annual
  Meeting of the Association for Computational Linguis-
  tics ACL.
Mairesse F, Walker M A, Mehl M R, Moore R K. 2007.
  Using Linguistic Cues for the Automatic Recognition of
  Personality in Conversation and Text. Journal of Artifi-
  cial intelligence Research, 30.
McCrae R, Costa P. 1989. Reinterpreting the Myers-
  Briggs Type Indicator From the Perspective of the Five-
  Factor Model of Personality. Journal of Personality, Na-
  tional Center on Birth Defects and Developmental Dis-
  abilities Division of Human Development and Disabili-
  ties, USA.
Norman, Warren T. 1963. Toward an adequate taxonomy
  of personality attributes: Replicated factor structure in
  peer nomination personality ratings. The Journal of Ab-
  normal and Social Psychology, 66(6), 574–583.
Oberlander J, Nowson S. 2006. Whose thumb is it any-
  way? classifying author personality from weblog text.
  In Proceedings of the 44th Annual Meeting of the Asso-
  ciation for Computational Linguistics ACL.
Ostendoff E. 1990. prache und Personlichkeitsstruktur:
  Zur Validitat des Funf-Faktoren-Modeiis der Person-
  lichkeit [Language and personality structure: On the va-
  lidity of the five-factor model of personality]. Regens-
  burg, Federal Republic of Germany : Roderer Verlag.
Pennebaker J W, King L A. 1999. Linguistic styles: Lan-
  guage use as an individual difference. Journal of Person-
  ality and Social Psychology, 77.
Quercia D, Kosinski M, Stillwell D, Crowcroft J. 2011.
  Our Twitter Proles, Our Selves: Predicting Personality
  with Twitter. In Proceedings of SocialCom. 180–185.
Robert T. 1996. Regression Shrinkage and Selection via
  the Lasso. Journal of the Royal Statistical Society. Series
  B (Methodological), 58(1):267–288.
Saucier G, Goldberg L R. 1996. Evidence for the Big Five
  in analysis of familiar English Personality adjectives Eu-
  ropean Journal of Personality, 10:61–77.
Yla R Tausczik, James W Pennebaker. 2010. The Psy-
  chological Meaning of Words: LIWC and Computerized

                                                                47

</pre>