A Random Forest Approach for Authorship Profiling

       Alonso Palomino-Garibay1 , Adolfo T. Camacho-González1 , Ricardo A.
Fierro-Villaneda2 , Irazú Hernández-Farias3 , Davide Buscaldi4 , and Ivan V. Meza-Ruiz2
                                        1
                                             Facultad de Ciencias
          2
              Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas (IIMAS)
                           Universidad Nacional Autonoma de Mexico (UNAM)
                                         Ciudad de Mexico, Mexico
                         3
                           Pattern Recognition and Human Language Technology,
                                    Universitat Politécnica de Valencia
                                               Valencia, Spain
                     4
                       Laboratoire d’Informatique de Paris Nord, CNRS (UMR 7030)
                       Universite Paris 13, Sorbonne Paris Cité, Villetaneuse, France


         Abstract In this paper we present our approach to extract profile information from
         anonymized tweets for the author profiling task at PAN 2015 [10]. Particularly we
         explore the versatility of random forest classifiers for the genre and age groups
         information and random forest regressions to score important aspects of the
         personality of a user. Furthermore we propose a set of features tailored for this
         task based on characteristics of the twitters. In particular, our approach relies on
         previous proposed features for sentiment analysis tasks.


Keywords: Author Profiling, Random forest, Random Forest Regression, NLP, Machine
Learning.


1     Introduction

Authorship profiling exploits the sociolinguistic observations of particular spoken and
written language that different groups of people use. However to extract important infor-
mation about an author (e.g. demographics, personality and cultural background) just
by analyzing raw text has a high potential number of applications from market research
to forensics. From a marketing perspective recommendation systems which are vital part
of today’s Web can benefit of extract the profile dimensions of potential costumers to
improve the way recommendations are performed. Moreover large corporations may be
attracted to know what type of people like or dislike their products, based on analysis of
blogs and online product reviews. From a forensic point of view authorship profiling can
help to identify characteristics of crime perpetrators when there are many or few specific
suspects to consider [1]
    In this edition of the PAN 2015 Author Profiling, the task was formally defined as
follows1 :
 1
     As described in the official website of the competition http://pan.webis.de/(2015).
        This task is about predicting an author’s demographics from her writing.
    Participants will be provided with Twitter tweets in English and Spanish to
    predict age, gender and personality traits. Moreover, they will be provided also
    with tweets in Italian and Dutch and asked to predict the gender and personality.
    Our approach proposes to use classifiers for the age and gender information and a set
of regressors for the personality traits: extroverted, stable, agreeable, conscientious
and open. In particular these traits are specified by a score. In this work we explore the
use of Random Forest for both aspects of the task, classification and regression [3].
    Our approach heavily depends on tailored features for the task. We have three types
of features: lexical, twitter statistics and word list based . The lexical corresponds to
features extracted over the whole vocabulary of the tweets. Statistic of the tweets count
different aspects of the typical format of tweets; for instance the use of for mention
of other users, or # for the marking of the topic of the tweet. The word list features
correspond to total scores or frequencies of the use of terms within a tweet. For this type
of feature we only consider specific terms from different word lists. An important part
of these word lists is based on previous research on sentiment analysis. We explore the
used of terms which determine degrees of polarity, irony or affect.
    This paper is organized as follows: In the second section we give a complete de-
scription of the designed features for this task. In the third section we describe our
methodology for authorship profiling. In the fourth section we describe the corpora pro-
vided by the PAN workshop 2015. In the fifth section we show the results, in particular
we evaluate the performance of the system with accuracy metric.

2   Feature Engineering
Text representation is fundamental and indispensable for automatic information process-
ing, in our approach we extract a set of tailored features from a collection of tweets
of a particular user. Although different speech communities might tend to write about
different topics and in different ways, there are two types of features used for authorship
profiling: content-based and style-based. The following list presents the used features:
 1. BOW/TF-IDF:
    Based on the Vector Space Model, tweets are represented as a vector where each
    component is associated with a particular word from the corpus vocabulary. Typ-
    ically, each component value is assigned using the information retrieval measure
    tf-idf this technique has been extensively used in text mining, information retrieval
    and NLP to classify text.

 2. POS (Parts of speech):
    Unigram and bigrams of sequences of POS tags. These were obtained using the
    Core NLP Standford POS tagger (English and Spanish) [7], and the Tree Tagger
    (Italian and Dutch) [12].

 3. Irony detection words list [11]:
    Irony is difficult to be defined, generally humor denotes this rhetorical device, struc-
    tural ambiguity can be represented by the dispersion in the number of combinations
    among the words that constitute humor examples [11]. For this feature, frequency
    and total score of words in tweets from an irony detection counter which uses a
    predefined word list where essential to match this event. Two dimension of the list
    use the counter factuality and the temporal compression.

 4. Sentiment polarity word list [8]:
    For this feature we extracted the total score of positive and negative terms in tweets
    from predefined word list, all the occurrences were represented as a frequency vector.

 5. Sentiword word list [2]: For this feature we use SENTIWORDNET 3.0, a well
    studied lexical resource to model the semantic orientation of sentiment classification
    and opinion mining applications, The total score of positive and negative terms in
    tweets from SENTIWORDNET 3.0. that are in users tweets are counted, for positive
    and negative instances. Translation for Spanish and Italian language support where
    crucial.

 6. Affect word list [14]:
    The total score of affect terms in tweets from a word list. All the words from the
    user tweets that occurred in the list and have a greater or lower score of affect terms
    are counted into a matrix. This can purvey evidence of the personality of the user.

 7. Taboo word list:
    Frequency of taboo words used in predefined list. Slang words are frequent in
    younger age groups, particularly this can be a remarkable feature that may show the
    type of personality of an author.

 8. Emoticons:
    Frequency of emoticons used from predefined list. This feature can provide the type
    of personality as well as the age group of a user. All the occurrences of the terms of
    that match in the profiles are represented as a feature vector.

 9. Punctuation:
    Frequency of punctuation signs from a predefined list. This can catch the type of
    discourse structure and semantics of a user.

10. Links:
    A frequency of domain links is helpful to match sites that contain interesting topics
    for the different demographic dimensions, if the tweet is repeated several times with
    a link this can be considered as a primary source of information.

11. Tweets statistics:
    This feature extract diverse types of statistics from tweets. Number of words, letters,
    capital letters, capital letter in initial position, numbers, lower cases, sentences. RT
    for retweets, for citations of usernames, and # for self defined topic of the tweet.
    Stylometric analysis is useful to identify gender and age groups [5].
Besides the previous engineered features we also tested with positive and negative
frequency terms from [6] and a histogram of the Jaccard similarity coefficients among
users tweets. Empirically we found that none of these features helped the for the task,
since our metrics fall after being evaluated with this features.
    Table 1 shows the final configuration of the features per language.


              Feature    English       Spanish        Italian       Dutch
              1           tfidf           tfidf          tfidf         tfidf
              2          1gram          Bigram         Bigram        Bigram
              3        Freq/Score                    Freq/Score
              4      Score pos/neg Score pos/neg Score pos/neg Score pos/neg
              5      Score pos/neg Score pos/neg Score pos/neg Score pos/neg
              6          Socre           Score          Score         Score
              7                                                        Freq
              8           Freq           Freq           Freq           Freq
              9           Freq           Freq           Freq           Freq
              10          Freq           Freq           Freq           Freq
              11          Freq           Freq            Stat          Freq
                    Table 1. Features and configuration used per language


3   Approach

Our approach to authorship profiling relies in applying machine learning techniques
to map text into categories. First we take the lexical corpora provided by PAN-2015
and labeled according to a category in function of a profile or user. For instance, for
author gender analysis we labeled as male or female each set of tweets. From the above
proposed features we yield a document-term matrix, this means that each tweet was
represented as a numerical vector in order to abstract features.
    Then a supervised method computes classifiers and regressors based on the random
forest algorithm, to the training examples. Finally the predictive ability of both (classifi-
cation and regression) is tested on the testing data. We built two classifiers for English
and Spanish (gender and age) and one for Italian and Dutch (gender). Additionally
we created five regressors one per personality trait per language. Each classifier and
regressor was independent from each other. Random forests have outstanding in recent
years since the classification accuracy of this type of algorithms have outperformed
SVMs and other machine learning algorithms in other knowledge areas for instance
bio-informatics and computational biology creating classification methods for cancer
diagnosis based on micro-array data [13]. We assume that this type of ensemble methods
hold true for NLP tasks. The goal of ensemble methods is to combine the predictions of
several base estimators built with a given learning algorithm in order to improve general-
izability/robustness over a single estimator [9]. For this task we focused in averaging
methods, which are learning algorithms that yield several estimators independently and
then average their outcomes. Intuitively the averaging estimator is better than any single
base estimators, as a result of reduced variance.
    Briefly in Random Forests (both, regression and classification), each estimator in
the ensemble is built from a bootstrap sample from the training set. When the algorithm
splits a node during the generation of the decision tree, the chosen split is no longer
the best split of all the features. Rather, the split that is selected is the best split of a
random subset of the features. Due this randomness, the bias of the forest usually slightly
increases but, due to averaging, its variance decreases, usually more than compensating
for the increase in bias, finally this produces a better model [9].
    The training was performed with Scikit-Learn, a library that provides a compre-
hensive suite of machine learning tools for Python. It extends this general-purpose
programming language with machine learning operations: learning algorithms, pre-
processing tools, model selection procedures and composition mechanisms to create
complex machine learning work-flows [9].


3.1    Parameters


For both, regression and classification n_estimatorswhich is the number of trees
in the forest, if n_estimators is larger accuracy will increase, however this will
increase the complexity to compute an prediction output. By the other hand if a lower
amount of estimators is used the variance will reduce, but it will increase de bias of
the model. Empirically we found that a good set up for classification of genre was:
n_estimators = 2000.


4     Corpora

The corpora consists of tweets in four languages: English, Spanish, Italian, and Dutch ev-
ery language has a collection of tweets from different users. The tweets were anonymized
by removing the username information from the author and the mention to other user-
names. The tweets as expected contain orthographic and typographic errors, colloqui-
alisms, jargon and meta information such as re-tweets and link information. Not all the
tweets were written by the author, for instance re-tweets and some tweets produced
by automatic systems associated to the user. Both gender and age demographics were
provided by the users answering an online test, however the personality trait scores were
extracted using a personality test.2
    The gender variable can take two values: male and female. The age variable four:
18-24, 25-34, 35-49 and 50-xx. While the five personality traits are assessed by a score
which goes from −0.5 to 0.5. Table 2 presents the sizes and number of tweets per user
available in the training corpora provided by the organizers of the task [10].

 2
     Based on website: http://your-personality-test.com/
                                 Language Number Tweets
                                            of users per user
                                  English     152      100
                                  Spanish     100      100
                                   Italian     38      100
                                   Dutch       34      100
                            Table 2. Length of corpus per language


5    Results
Using a cross validation setting over the corpora we evaluate the performance of our
system as follows. For gender and age we report F1-score and root mean square error
(RMSE) for the personalities traits.


                             Trait      English Spanish Italian Dutch
                           Gender    0.706      0.750    0.773 0.765
                          Age groups 0.612      0.465    N/A N/A
                          Extroverted 0.023 0.024 0.018 0.014
                             Stable    0.041 0.036 0.025 0.027
                             Open      0.018 0.025 0.025 0.014
                         Conscientious 0.021 0.023 0.013 0.012
                           Agreeable 0.021 0.020 0.023 0.020
Table 3. Performance in training/development set. F-score for gender and age classification, and
RMSE scores for personality traits.


                                       pan15-author-profiling-test
Language GLOBAL RMSE Age Agreeable Both Conscientious Extroverted Gender Open Stable
Dutch    0.6703    0.1595 NA        0.1598      NA     0.1787        0.1604   0.5000 0.1055 0.1928
English 0.5217     0.1749 0.4085 0.1572         0.2183 0.1526        0.1676   0.5000 0.1582 0.2392
Italian  0.6682    0.1636 NA        0.1463      NA     0.1553        0.1336   0.5000 0.1831 0.1997
Spanish 0.6215     0.1660 0.5114 0.1536         0.4091 0.1473        0.1729   0.8295 0.1530 0.2035
              Table 4. Final results on test produced by the TIRA system [4].


6    Conclusions
In this paper we described our methodology for authorship profiling with PAN-2015
corpora. Author profiling has growing importance for national security, criminal investi-
gations, and marketing research [1]. Our methodology uses random forests model for
classification and regression. For this work we build a baseline system for the author
profiling task that uses set of general features.
    Our system presented some failures with the classification of the gender class which
affected our performance. Additionally, we believe that the training of models was
over-fitted by the number of estimators in both classification and regression Random
Forest models.
    For further research we plan to perform a better feature engineering by adding
more specific features of content and style for the authorship and to implement a hyper-
parameter optimization to tune the models.

Acknowledgments We acknowledge Rodrigo Sanabria contributions to the source code
in the early stages of the project.


References
 1. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of
    an anonymous text. Communications of the ACM 52(2), 119–123 (2009)
 2. Baccianella, S., Esuli, A., Sebastiani, F.: Sentiwordnet 3.0: An enhanced lexical resource for
    sentiment analysis and opinion mining.
 3. Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
 4. Gollub, T., Stein, B., Burrows, S.: Ousting Ivory Tower Research: Towards a Web Framework
    for Providing Experiments as a Service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M.
    (eds.) 35th International ACM Conference on Research and Development in Information
    Retrieval (SIGIR 12). pp. 1125–1126. ACM (Aug 2012)
 5. Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In:
    Third International AAAI Conference on Weblogs and Social Media (2009)
 6. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the tenth ACM
    SIGKDD international conference on Knowledge discovery and data mining. pp. 168–177.
    ACM (2004)
 7. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The stanford
    corenlp natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the
    Association for Computational Linguistics: System Demonstrations. pp. 55–60 (2014)
 8. Nielsen, F.Å.: A new anew: Evaluation of a word list for sentiment analysis in microblogs.
    arXiv preprint arXiv:1103.2903 (2011)
 9. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
    Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
    M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine
    Learning Research 12, 2825–2830 (2011)
10. Rangel, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: In: Cappellato L., Ferro N.,
    Gareth J. and San Juan E. (Eds). (Eds.) CLEF 2015 Labs and Workshops, Notebook Papers.
    CEUR-WS.org, (2015).
11. Reyes, A., Rosso, P., Veale, T.: A multidimensional approach for detecting irony in twitter.
    Language resources and evaluation 47(1), 239–268 (2013)
12. Schmid, H.: Improvements in part-of-speech tagging with an application t german. In: In
    Proceedings of the ACL SIGDAT-Workshop. Citeseer (1995)
13. Statnikov, A., Aliferis, C.F.: Are random forests better than support vector machines for
    microarray-based cancer classification? In: AMIA annual symposium proceedings. vol. 2007,
    p. 686. American Medical Informatics Association (2007)
14. Whissell, C., Fournier, M., Pelland, R., Weir, D., Makarec, K.: A dictionary of affect in
    language: Iv. reliability, validity, and applications. Perceptual and Motor Skills 62(3), 875–
    888 (1986)