=Paper= {{Paper |id=Vol-1391/68-CR |storemode=property |title=Author Profiling using Stylometric and Structural Feature Groupings |pdfUrl=https://ceur-ws.org/Vol-1391/68-CR.pdf |volume=Vol-1391 |dblpUrl=https://dblp.org/rec/conf/clef/GrivasKG15 }} ==Author Profiling using Stylometric and Structural Feature Groupings== https://ceur-ws.org/Vol-1391/68-CR.pdf
      Author profiling using stylometric and structural
                     feature groupings
                        Notebook for PAN at CLEF 2015

           Andreas Grivas, Anastasia Krithara, and George Giannakopoulos

     Institute of Informatics and Telecommunications, NCSR Demokritos, Athens, Greece
                          {agrv, akrithara, ggianna}@iit.demokritos.gr



       Abstract In this paper we present an approach for the task of author profiling.
       We propose a coherent grouping of features combined with appropriate prepro-
       cessing steps for each group. The groups we used were stylometric and structural,
       featuring among others, trigrams and counts of twitter specific characteristics. We
       address gender and age prediction as a classification task and personality predic-
       tion as a regression problem using Support Vector Machines and Support Vector
       Machine Regression respectively on documents created by joining each user’s
       tweets.


1   Introduction
PAN, held as part of the CLEF conference is an evaluation lab on uncovering plagia-
rism, authorship, and social software misuse. In 2015, PAN featured 3 tasks, plagiarism
detection, author identification and author profiling.
    The 2015 Author Profiling task challenged participants to predict gender, age, and
5 personality traits (extroversion, stability, openness, agreeableness, conscientiousness)
in 4 languages (English, Spanish, Italian and Dutch).
    It featured quite a few novelties compared to the 2014 task. The addition of 5 per-
sonality traits to be estimated for the task, a change from 5 to 4 classes in the age esti-
mation task, as well as a reduction in the size of the training dataset from 306 instances
to 152 instances - user profiles.
    In this paper we present an approach for tackling the author profiling task. In the
next section the different steps of our approach are presented in details, while in section
3 the evaluation of the method is discussed.


2   Approach
For the author profiling task we proposed a coherent grouping of features combined
with appropriate preprocessing steps for each group. The idea was to create an easily
comprehensible, extensible and parameterizable framework for testing many different
feature and preprocessing combinations.
    We mainly focused on the gender and age subtasks as can be seen from the general
approach taken towards personality traits, were we used the same features for all 5
different cases.
     The architecture of the system we developed is portrayed in Figure 1. We will only
sketch the outline of the system here, we will go into more details in the next sections.
     The layers that can be seen correspond to the data structuring, preprocessing, feature
extraction and classification steps that are carried out for the training and test cases. We
follow a different preprocessing pipeline depending on the group of features we want to
extract. We then combine the two groups, apply normalization and feature scaling and
move on to the classification step where we train our model.
     In the data structuring part of system we create a document for each user by joining
all his tweets from the dataset.
     This document is then preprocessed in the case of stylometric feature extraction.
We initially remove all HTML tags found in the document and then we clear all twitter
specific characteristics and tokens, such as hashtags, @replies as well as urls from the
text. Using this cleaned form we then check for exact duplicate tweets and discard any
if found.
     We then extract structural features from the unprocessed document and stylometric
features from the processed edition of the document. After concatenating these features
together we normalize and scale their values, in order to avoid complications that can
arise in the classification stage due to features with numeric values that differ a lot.
     The last step, is the classification stage, where we train a Support Vector Machine
or a Support Vector Machine Regression model depending on the subtask.


2.1   Features

In the tasks of Author Profiling and Author Identification many different types of fea-
tures have been deemed important discriminative factors. In the same spirit as [5], we
tried to group together features in a coherent way, such that we could perform suitable
preprocessing steps for each group. Also, by grouping together features in such a way,
it would be easier later on to split the task into separate classification subtasks and use
a voting schema to obtain a final result.
    In this work, we created two groups of features, namely the stylometric and struc-
tural features. The structural group of features aimed to trace characteristics of the text
that were interdependent with the use of the twitter platform. Features such as counts
of @mentions, hashtags and URLs in a user’s tweets.
    The stylometric group of features tried to capture characteristics of context that a
user generates in a non automatic way. Different features were tested, such as tf-idf of
ngrams, bag of ngrams, ngram graphs [1], bag of words, tf-idf of words, bag of smileys
(emoticons), counts of words that were all in capital letters and counts of words of size
1 − 20.
    Table 1 summarizes which of the features mentioned above were used for each
subtask.
    We based the stylometric aspect of our approach on trigrams since they capture
stylometric features well and are more extensible to unknown text when a small training
set has been used, comparing to a bag of words approach.
                                                      tweets

                                                               raw tweets


                                              ts                        clean html
                                           ee
                                         tw
                                     w
                                  ra                                     detwittify

                                                                    remove duplicates

                                                                        clean tweets


                    structural features                             stylometric features


                                  extracted               extracted
                                   features                features
                                         h                  i
                                           X1 |X2 . . . |Xn

                                    concatenated features
                                               normalization
                                                 & scaling


                                         normalized features



                                                   classification




                          Figure 1: Architecture of the system


2.2   Preprocessing
Preprocessing is an important step which cannot be disregarded in this task. As texts
are tweets, they contain specific information entangled in the text (hashtags, @replies
and URL links). Therefore, an important decision involves deciding how to correctly
deal with this bias.
    Tweets also contain a large amount of quotations and repeated robot text, which
may be structurally important but should be stylometrically insignificant.
    In our approach, a different preprocessing pipeline was applied to each group of fea-
tures as described above. There was no preprocessing done for structural features. Sty-
lometric feature preprocessing encompassed removing any HTML found in the tweets,
removing twitter bias such as HTML tags, @mentions, hashtags and URLs and remov-
ing exact duplicate tweets after removing twitter specific text. To elaborate a bit on re-
moving twitter bias, @username and URLs were deleted, while hashtags were stripped
of the hashtag character #.
    In some approaches [2] that use tweets as a text source for classification, tweets are
joined in order to create larger documents of text. For this task we joined all tweets
for each user, however, it should make sense to try joining less texts and create more
                            Table 1: Features used for each subtask
                             Subtask        Group          Feature
                                          Stylometry   tfidf trigrams
                            gender
                                          Structural           –
                                                       tfidf trigrams
                                          Stylometry
                                                     count word length
                             age                          count @
                                          Structural count hashtags
                                                        count URLs
                                                       tfidf trigrams
                                          Stylometry count word length
                                                      count caps word
                       personality traits
                                                          count @
                                          Structural count hashtags
                                                        count URLs


                                         Profiling
                                         Features


               Structural                                       Stylometry



   Number of   Number of     Number of    Tf-idf of   Bag of    Ngram    Word length   Number of
   Hashtags      Links       Mentions     Ngrams      Smileys   Graphs                 Uppercase



                                 Figure 2: Groups of features


samples for each user, and then classify the user according to the label that has the
majority of the predictions.

2.3   Classifiers
Regarding classification and regression, we used a Support Vector Machine (SVM) with
a RBF kernel and a SVM with a linear kernel for the age and gender subtask respec-
tively. In the case of the age subtask, we also employed the use of class weights inversely
proportional to class frequencies since the distribution of instances in the classes was
skewed. We used the implementations of the scikit-learn library [3] of the aforemen-
tioned machine learning algorithms.
    Regarding the personality traits subtask, Support Vector Machine Regression (SVR)
with a linear kernel was used.
    For each subtask the features were concatenated and were then scaled and normal-
ized. Scaling was performed in the features such that the values were in the range [−1, 1]
with 0 mean and unit variance. Normalization was performed along instances so that
each row had unit norm.
   The above classifiers and combination of features were used for all languages of the
challenge, namely English, Spanish, Dutch and Italian.


3     Evaluation
3.1        Dataset
The Pan 2015 dataset featured less instances for training (152 users) than the earlier
tasks in author profiling. The distribution of age and gender over the instances of the
training set can be seen in Figure 3.


      80                                              60
                                                                                               count
      70
                                                      50
      60

                                                      40
      50


      40                                              30

      30
                                                      20
      20

                                                      10
      10
                             count
       0                                               0
                                     M




                                                           18-24
                 F




                                                                                    35-49
                                                                   25-34




                                                                                            50-XX
                           labels
                                                                           labels


                     Figure 3: Age and gender distribution over training samples




3.2        Performance Measures
In the context of Pan 2015 systems were evaluated using accuracy for the gender and
age subtasks and average Root Mean Squared Error (RMSE) for the personality subtask.
In order to obtain a global ranking the following formula was used.
                                     (1 − RMSE) + joint accuracy
                                                                                                       (1)
                                                2

3.3        Results
Our approach was in the top two approaches based on accuracy, regarding the gender
classification subtask in all languages as can be seen in Figure 4. This fact hints that
trigrams can capture gender information regardless of language and generalize well for
datasets of this size.
    However, results in Figure 5 show that our system performed less optimally in the
case of age classification where more features that were considered helpful were used.
    Using the scoring procedure described in Equation 1, our system scored 3rd overall
in the over profiling task. An overview of the approaches and results for the author
profiling task can be found in [4].
3.4                  Future Work

In the context of our approach we will further evaluate the features used for the age
classification subtask, in order to examine which of them are more useful and which ac-
tually deteriorate the performance of the approach on the test set. We will also develop a
more sophisticated approach for personality trait identification, considering more spe-
cific features and preprocessing for each personality trait separately. Finally we will
attempt to create more documents for each user by joining less tweets for each docu-
ment and then arrive at a conclusion by using the average decision for all of the user
documents. It will be interesting to see the impact of this approach on the results for
each user.


                                            Gender - English                                                                                 Gender - Spanish
                alvarezcarmona15                                                                           alvarezcarmona15
                         grivas15                                                                                  grivas15
              gonzalesgallardo15                                                                                miculicich15
                         kiprov15                                                                                  kiprov15
  participant




                      teisseyre15                                                            participant
                                                                                                       gonzalesgallardo15
                     miculicich15                                                                                   sulea15
                         sulea15                                                                                   bartoli15
                         arroju15                                                                                cheema15
                         weren15                                                                                 poulston15
                     maharjan15                                                                                    weren15
                                0.0   0.2     0.4        0.6    0.8    1.0                                               0.0         0.2       0.4        0.6     0.8     1.0
                                                 accuracy                                                                                         accuracy

                                             Gender - Dutch                                                                                Gender - Italian
                        grivas15                                                         gonzalesgallardo15
                alvarezcarmona15                                                                               grivas15
           gonzalesgallardo15                                                                                  kocher15
                         sulea15                                                                             poulston15
                                                                               participant
participant




                     miculicich15                                                             alvarezcarmona15
                        kocher15                                                                            maharjan15
                    mccollister15                                                                           miculicich15
                     maharjan15                                                                                 sulea15
                        bartoli15                                                                              ameer15
                     teisseyre15                                                                               weren15
                               0.0    0.2     0.4         0.6    0.8     1.0                                         0.0       0.2          0.4        0.6      0.8     1.0
                                                 accuracy                                                                                      accuracy


        Figure 4: Top ten participants regarding gender prediction accuracy in all languages




4                  Acknowledgments
This work was supported by REVEAL (http://revealproject.eu/) project, which has re-
ceived funding by the European Unions 7th Framework Program for research, technol-
ogy development and demonstration under the Grant Agreements No. FP7-610928.
                                           Age - English                                                        Age - Spanish
               alvarezcarmona15                                                       alvarezcarmona15
                        sulea15                                                               kiprov15
           gonzalesgallardo15                                                                 weren15
                    teisseyre15                                                                sulea15
 participant




                                                                        participant
                       grivas15                                                               kocher15
                       bartoli15                                                  gonzalesgallardo15
                     poulston15                                                               grivas15
                       kiprov15                                                               arroju15
                       ameer15                                                             miculicich15
                   mccollister15                                                           maharjan15
                               0.0   0.2   0.4        0.6   0.8   1.0                               0.0   0.2   0.4        0.6   0.8   1.0
                                              accuracy                                                             accuracy



               Figure 5: Top ten participants regarding age prediction accuracy in all languages


References
1. Giannakopoulos, G., Karkaletsis, V., Vouros, G., Stamatopoulos, P.: Summarization system
   evaluation revisited: N-gram graphs. ACM Trans. Speech Lang. Process. 5(3), 5:1–5:39 (Oct
   2008), http://doi.acm.org/10.1145/1410358.1410359
2. Mikros, G., Perifanos, K.: Authorship attribution in greek tweets using author’s multilevel
   n-gram profiles (2013), https://www.aaai.org/ocs/index.php/SSS/SSS13/paper/view/5714
3. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
   Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
   M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine
   Learning Research 12, 2825–2830 (2011)
4. Rangel, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author
   profiling task at pan 2015. In: Cappellato L., Ferro N., Gareth J. and San Juan E. (Eds).
   (Eds.) CLEF 2015 Labs and Workshops, Notebook Papers. CEUR-WS.org (2015)
5. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci.
   Technol. 60(3), 538–556 (Mar 2009), http://dx.doi.org/10.1002/asi.v60:3