=Paper= {{Paper |id=Vol-1391/48-CR |storemode=property |title=Automatic Profiling of Twitter Users Based on Their Tweets: Notebook for PAN at CLEF 2015 |pdfUrl=https://ceur-ws.org/Vol-1391/48-CR.pdf |volume=Vol-1391 |dblpUrl=https://dblp.org/rec/conf/clef/SuleaD15 }} ==Automatic Profiling of Twitter Users Based on Their Tweets: Notebook for PAN at CLEF 2015== https://ceur-ws.org/Vol-1391/48-CR.pdf
    Automatic Profiling of Twitter Users Based on Their
                           Tweets
                        Notebook for PAN at CLEF 2015

                       Octavia-Maria S, ulea1,2 and Daniel Dichiu1
                                    1
                                  Bitdefender Romania
               2
               Center for Computational Linguistics, University of Bucharest
            mary.octavia@gmail.com, ddichiu@bitdefender.com



       Abstract In this paper we go through our approach at solving the PAN Author
       Profiling task. We introduce a novel way of computing the type/token ratio of an
       author and show that, although strong correlations have been observed between
       high extroversion and low type/token ratios in the past, this ratio is not neces-
       sarily a strong indicator of extroversion. Since the text of a person is influenced
       by all 7 features (gender, age, and big five personality traits) that are required
       to be automatically identified in this task, we used this ratio, along with Term
       frequency-Inverse document frequency (tf-idf ) matrices, in all 7 subtasks and all
       4 corpora and obtained good results.


1   Introduction

While the importance of age and gender is a more familiar notion in user or author
profiling, automatic personality detection is a relatively new task [10]. Since many cor-
relations between personality traits and consumer preferences have been reported ([5],
[9]), a natural interest arose in the automatic detection of personality on social media
networks in the last few years, especially on the micro-blogging site, Twitter, where the
privacy setting for its users’ posts and activity is by default public ([7], [3]). Since the
main activity of Twitter users involves language (tweets), and since many correlations
have been identified between lingusitic features of a text and personality traits of its
author [4], the idea of automatically detecting the personality of Twitter users based on
their tweets is only natural. In what follows, we will describe our approach to PAN’s
third Author Profiling task [8], discuss our cross-validation results and briefly compare
them with the results obtained after the final testing.


2   Our Approach

For all datasets and subtasks, the estimators, the parameter search function, the cross-
validation strategy, and some of the feature extractors we used were from the scikit-learn
module for python [6]. For the processing of the other features, we also used the nltk
module [1]. This implementation choice of python modules was motivated by the swift-
ness with which prototyping can occur. The two classification tasks (for gender and age)
were carried out using LinearSVC() [2] while the 5 regression tasks (for the personal-
ity traits), using Ridge(). In order to have balanced classes during cross-validation, we
used StratifiedKFold() with the number of folds set to 5. The best parameters for the
estimators were found using RandomizedSearch().
     For features, we tried several approaches, but eventually settled on using two: first,
the tf-idf matrix at character level, with various n-gram ranges and parameter tuning,
depending on the language and subtask, and second, the type/token ratio of a user or
verbosity rate. These two features were combined using scikit-learn’s FeatureUnion().
     The tf-idf scores were extracted using scikit-learn’s TfidfVectorizer(). This vector-
izer was applied either on all tweets of one user put together, or on each tweet pertaning
to one user. More precisely, in the sparse matrix created by the TfidfVectorizer(), the
columns represented, in both cases, all the character n-grams extracted from all the
tweets in one of the four datasets, while each line represented either all tweets of one
user concatenated, or one tweet of a user. Our cross-validation results, which will be
presented further, showed that the former method was consistently more appropriate
for the classfication tasks and the latter, for regression. An intuitive answer would be
that gender and age specific features change less often, while personality traits may
influence each tweet.
     The verbosity ratio of a user was only inspired by the type/token ratio and is not
one per se, since distinguishing between a linguistic manifestation of a conceptual type
(bicycle in sentence 1.a), and its token (bicycle in sentence 1.b), implies deep semantic
analysis which is far from trivial with today’s tools in Natural Language Processing.

  (1)   Type/token distinction
        a. The bicycle is more popular now.                                           Type
        b. The bicycle is in the garage.                                            Token

What we did to echo the idea of a type/token ratio was to compute, for each user, the
ratio between the total number of unique stems and the total number of words used
after applying stemming. From this ratio we excluded stop words. Stemming was done
using the nltk implementation of the Snowball algorithm since it offered a version for
each of the four languages present in this year’s task. For stopwords lists, we used
nltk.corpus.stopwords. The motivation for using this feature was the often observed
correlation between extroversion and type/token ratio [4].
    However, our preliminary analysis, by computing both Pearson and Spearman cor-
relation coefficients on verbosity ratios versus personality scores, showed no clear-cut
linear relationships. The fact that Spearman correlation coefficient was better than the
Pearson correlation coefficient only goes to show that the relationship is rather a mono-
tonic one than linear. Below are the top three statistically significant correlation scores
on all corpora and all personality scores, computed with python’s scipy.stats package.
The plots were drawn using seaborn python module.
    For the Dutch corpus (figure 1 on page 3), we found that there was a -0.46 Pearson
correlation with a p-value < 0.001 and -0.49 Spearman rank correlation with a p-value
< 0.001 between verbosity ratios and openness scores. Also, given a verbosity ratio,
males tended to have higher openness scores than females.
               Figure 1: A somewhat negative correlation between a Dutch
               person’s openness score and his/her verbosity ratio.




    Also for the Dutch corpus (figure 2 on page 4), we found that there was a -0.34
Pearson correlation with a p-value < 0.001 and -0.45 Spearman rank correlation with
a p-value < 0.001 between verbosity ratios and neuroticism (stable) scores. Regarding
gender separation, males tended to be more stable at a given verbosity ratio.
     For the Italian corpus (figure 3 on page 4), we found that there was a -0.33 Pearson
correlation with a p-value < 0.001 and -0.40 Spearman rank correlation with a p-value
< 0.001 between verbosity ratios and agreeableness scores. Apparently, on average,
Italian females were more agreeable than Italian males at a given verbosity ratio.
   We also present the results of verbosity ratios for the classification tasks.
    In the English training corpus (table 1 on page 3), across all age groups, males had
a slightly higher verbosity ratio than females. We also observed that verbosity ratios
increased slightly with age, across both genders.



                           Table 1: Verbosity ratios on English

                       Gender               Median Mean Std
                       female (all ages)    0.6731 0.6584 0.0887
                       male (all ages)      0.683 0.68 0.083
                       18-24 (both genders) 0.6763 0.6650 0.0782
                       25-34 (both genders) 0.6864 0.6704 0.0921
                       35-49 (both genders) 0.6756 0.6737 0.0903
                       50-xx (both genders) 0.6927 0.6769 0.0989
Figure 2: A somewhat negative correlation between a Dutch
person’s neuroticism score and his/her verbosity ratio.




Figure 3: A somewhat negative correlation between an Italian
person’s aggreableness score and her/his verbosity ratio.
     The difference betweem male and female verbosity ratios was minimal on the Span-
ish training corpus (table 2 on page 5). However, we observed a larger difference when
it came to age groups, with the highest verbosity ratio being for age group 25-34 (with
a median of 0.70) and the lowest for age group 35-49 (with a median of 0.67).


                          Table 2: Verbosity ratios on Spanish

                       Gender               Median Mean Std
                       female (all ages)    0.6937 0.6817 0.0483
                       male (all ages)      0.6901 0.6855 0.0623
                       18-24 (both genders) 0.6844 0.6858 0.0645
                       25-34 (both genders) 0.7016 0.6873 0.0514
                       35-49 (both genders) 0.6693 0.6787 0.0528
                       50-xx (both genders) 0.6859 0.6715 0.0614



   A similar difference we also observed on the Italian corpus (table 3 on page 5).
Female users tended to have a lower verbosity ratio (with a median of 0.69), while
males had a median verbosity ratio of 0.72.


                           Table 3: Verbosity ratios on Italian

                             Gender Median Mean Std
                             female 0.6870 0.6891 0.0328
                             male 0.7223 0.7117 0.0874



   As for the Dutch training corpus (table 4 on page 5), the difference between male
and female verbosity ratios was again minimal, with a difference between medians of
under 2 percentage points.


                           Table 4: Verbosity ratios on Dutch

                             Gender Median Mean Std
                             female 0.6995 0.6967 0.0502
                             male 0.6822 0.6934 0.0694



   Given these inconclusive findings, we decided to use a combination of tf-idf on
character n-grams with verbosity scores, which improved cross-validation results over
models based on the same features taken separately.
3   Cross-Validation Results


                   Table 5: TfidfVectorizer parameters and results on English

        Subtask         Range Max-df Min-df Sublinear tf Vocab. CV result Result
        gender        1, 3     0.75    0.17   TRUE         3211 78.94%          76.76%
        age           3, 5     0.98    0.14   FALSE        13677 75.65%         78.87%
        stable        2, 6     N/A     N/A    TRUE         773075 0.1825        0.1951
        agreeable     2, 6     N/A     N/A    FALSE        773075 0.1411        0.1396
        extroverted 2, 6       N/A     N/A    TRUE         773075 0.1359        0.1318
        conscientious 2, 6     N/A     N/A    TRUE         773075 0.131         0.1297
        open          2, 6     N/A     N/A    TRUE         773075 0.1193        0.1246




                   Table 6: TfidfVectorizer parameters and results on Spanish

         Subtask        Range Max-df Min-df Sublinear tf Vocab. CV result Result
         gender        2, 6    0.85    0.15    FALSE       20649 88%            87.5%
         age           1, 3    0.82    0.07    FALSE       6540 73%             75%
         stable        2, 6    N/A     N/A     TRUE        563605 0.1812        0.1816
         agreeable     2, 6    N/A     N/A     FALSE       563605 0.1478        0.1501
         extroverted 2, 6      N/A     N/A     TRUE        563605 0.1517        0.1703
         conscientious 1, 3    0.94    0.07    TRUE        431    0.1137        0.1559
         open          2, 6    N/A     N/A     FALSE       563605 0.1421        0.1417



    Comparing the cross-validation results to the final test results, we can see signs of
overfitting only in some of the cases in which we used relatively more features. Overall,
our models generalized well when the number of features was smaller. LinearSVC() and
Ridge() allowed us to use sparse matrices, which meant we did not have to transform to
dense matrices (which would have occupied too much memory) or reduce dimensions
(which is a computationally expensive operation).
    As we stated before, we concatenated each user’s tweets for the classification tasks,
while for the regression tasks we used each individual tweet. This led, on average, to a
smaller vocabulary for the classification tasks.
    On the English corpus (table 5 on page 6), our system over-fitted slightly on the
gender, stable and open tasks. On the Spanish corpus (table 6 on page 6), our system
over-fitted slightly on extroverted and conscientious tasks. On the Dutch corpus (table
7 on page 7), our system over-fitted slightly on extroverted and conscientious tasks.
    The biggest difference between cross-validation and test results was on the Italian
corpus (table 8 on page 7), where our system over-fitted on all tasks, but extroverted.
The biggest overfit was for the gender task, with a difference of 15 percentage points
between cross-validation results and test corpus results.
                   Table 7: TfidfVectorizer parameters and results on Dutch

         Subtask       Range Max-df Min-df Sublinear tf Vocab. CV result Result
         gender        1, 3   0.95     0.17   FALSE        3663 76.47%          84.38%
         stable        1, 3   N/A      N/A    TRUE         17015 0.1592         0.1405
         agreeable     2, 6   N/A      N/A    FALSE        237795 0.123         0.1114
         extroverted 2, 6     N/A      N/A    TRUE         237795 0.1075        0.131
         conscientious 2, 6   N/A      N/A    FALSE        237795 0.0964        0.1147
         open          3, 5   N/A      N/A    TRUE         127649 0.0915        0.0846

                   Table 8: TfidfVectorizer parameters and results on Italian

         Subtask       Range Max-df Min-df Sublinear tf Vocab. CV result Result
         gender        5, 7   0.72     0.1    TRUE         31745 78.94%         63.89%
         stable        3, 5   N/A      N/A    TRUE         165628 0.1502        0.1913
         agreeable     3, 5   N/A      N/A    FALSE        165628 0.1227        0.122
         extroverted 2, 4     N/A      N/A    FALSE        74719 0.1191         0.1141
         conscientious 3, 5   N/A      N/A    FALSE        165628 0.101         0.114
         open          2, 6   N/A      N/A    FALSE        307144 0.1298        0.1438


4   Conclusions
Based on our results, we conclude that a combination of simple features like tf-idf and
verbosity ratios obtain reasonable results that generalize well. Comparing our approach
across all corpora, we found that this solution worked best as a regressor for the Dutch
corpus and as a classifier for the Spanish corpus. We found that the best tf-idf fea-
tures are those at character-level ngrams, with ngram ranges of up to 2, 6. Above this
threshold, the system seemed to overfit. We also found that there is at best a monotone
relationship between verbosity ratios and personality scores. Nevertheless, combining
them with other, many-dimensional features, like tf-idf matrices, improves results and
generalizes well.


References
 1. Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media
    Inc. (2009)
 2. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large
    linear classification. Journal of Machine Learning Research 9, 1871–1874 (June 2008)
 3. Golbeck, J., Robles, C., Edmondson, M., Turner, K.: Predicting personality from twitter. In:
    SocialCom/PASSAT. pp. 149–156. IEEE (2011), http://dblp.uni-trier.de/db/
    conf/socialcom/socialcom2011.html#GolbeckRET11
 4. Mairesse, F., Walker, M.A., Mehl, M.R., Moore, R.K.: Using linguistic cues for the
    automatic recognition of personality in conversation and text. Journal of Artificial
    Intelligence Research (JAIR pp. 457–500 (2007)
 5. McCrae, R.R., Costa, P.T.: Personality in Adulthood: A Five-Factor Theory Perspective
    (2nd ed.). New York: Guildford (2003)
 6. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
    Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
    Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal
    of Machine Learning Research 12, 2825–2830 (Oct 2011)
 7. Quercia, D., Kosinski, M., Stillwell, D., Crowcroft, J.: Our twitter profiles, our selves:
    Predicting personality with twitter. In: Proceedings of the Third International Conference
    on Social Computing (SocialCom) and the Third International Conference on Privacy,
    Security, Risk and Trust (PASSAT). pp. 180–185. IEEE (Oct 2011), http:
    //ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6113111&tag=1
 8. Rangel, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author
    profiling task at pan 2015. In: Cappellato, L., Ferro, N., Gareth, J., San Juan, E. (eds.) CLEF
    2015 Labs and Workshops, Notebook Papers. CEUR-WS.org (2015)
 9. Roozmand, O., Ghasem-Aghaee, N., Nematbakhsh, M., Baraani, A., Hofstede, G.:
    Computational modeling of uncertainty avoidance in consumer behavior. International
    Journal of Research and Reviews in Computer Science pp. 18–26 (April 2011)
10. Vinciarelli, A., Mohammadi, G.: A survey of personality computing. T. Affective
    Computing 5(3), 273–291 (2014),
    http://dx.doi.org/10.1109/TAFFC.2014.2330816