Using Simple Content Features for the
                      Author Profiling Task
                       Notebook for PAN at CLEF 2013

          Edson R. D. Weren, Viviane P. Moreira, and José P. M. de Oliveira

                    Institute of Informatics UFRGS - Porto Alegre - Brazil
                           {erdweren,viviane,palazzo}@inf.ufrgs.br


       Abstract This paper describes the methods we have employed to solve the au-
       thor profiling task at PAN-2013. Our goal was to use simple features to identify
       the age group and the gender of the author of a given text. We introduce the fea-
       tures, detail how the classifiers were trained, and how the experiments were run.


1   Introduction
Author profiling deals with the problem of finding as much information as possible
about an author, just by analysing a text produced by the author. It has a growing im-
portance in applications such as forensics, marketing and security [1].
     This paper reports on the participation of the INF-UFRGS team at the author pro-
filing task which has run for the first time at CLEF2013. In short, the task requires that
participating teams come up with approaches that take a given text as input and identify
the gender (male/female) and the age group (10s, 20s, 30s) of its author.
     As our first attempt in solving the author profiling task, our aim was to design a
simple approach in which we exploit features extracted from the contents of the texts.
The idea was to try to identify discriminative features and use them in a classifier which
predicts the gender and the age group of the author.

2   Identifying Author Profiles
Our underlying assumption was that authors from the same gender or age group tend
to use similar terms and that the distribution of these terms would be different across
genders and age groups. To implement this notion, all conversations were indexed using
an Information Retrieval engine and then we treat the conversation we wish to classify
as a query. The idea is that the conversations that will be retrieved (i.e., the most similar
to the query) will be the ones from the same gender and age group.
    The training dataset was composed of conversations (xml files) about various topics
grouped by author. Conversations were in English and Spanish and were annotated with
the gender and the age group of the author. For a complete description of the dataset,
please refer to [5]. Each conversation was represented by a set of features, namely:
−FeatureSet 1: Cosine
Cosine_10s, Cosine_20s, Cosine_30s, Cosine_female, Cosine_male.
Number of times a conversation from each gender/age group appeared in the top-k ranks
for the query composed by the keywords in the conversation. For this featureset, queries
and conversations were compared using the cosine similarity (Eq. 1). For example,
if we retrieve 10 conversations in response to a query composed by the keywords in
conversation q, and 5 of the retrieved conversations were in the 10’s age group, then the
value for Cosine_10s is 5.                        →
                                                  −c · →
                                                       −q
                                 cosine(c, q) = → −    →
                                                       −                              (1)
                                                 | c || q |
where →−c and →−
               q are the vectors for the conversations and the query, respectively. The
vectors are composed of tfi,c × idfi weights where tfi,c is the frequency of term i in
                                   N
conversation c, and IDFi = log n(i)    where N is the total number of conversations in
the collection, and n(i) is the number of conversations containing i.

−FeatureSet 2: Okapi
Okapi_10s, Okapi_20s, Okapi_30s, Okapi_female, Okapi_male
Similar to the previous featureset, this is the number of times a conversation from each
gender/age group appeared in the top-k ranks for the query composed by the keywords
in the conversation. For this featureset, queries and conversations were compared using
the Okapi BM25 score (Eq. 2).
                                     n
                                   X                 tfi,c · (k1 + 1)
                  BM 25(c, q) =        IDFi                           |D|
                                                                                      (2)
                                   i=1         tfi,c + k1 (1 − b + b avg  )

where tfi,c and IDFi are as in Eq. 1 |d| is the length (in words) of conversation c,
avgdl is the average conversation length in the collection, k1 and b are parameters that
tune the importance of the presence of each term in the query and the length of the
conversations. In our experiments, we used k1 = 1.2 and b = 0.75.

−FeatureSet 3: Flesch-Kincaid readability tests
There are two tests that indicate the comprehension difficulty of a text: Flesch Reading
Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) [4]. They are given by Eqs. 3
and 4. Higher FRE scores indicate a material that is easier to read. For example, a
text with a FRE scores between 90 and 100 could be easily read by a 11 year old,
while texts with scores below 30 would be best understood by undergraduates. FKGL
scores indicate a grade level. A FKGL of 7, indicates that the text is understandable by
a 7th grade student. Thus, the higher the FKGL score, the higher the number of years
in education required to understand the text. The idea of using these scores is to help
distinguish the age of the author. Younger authors are expected to use shorter words and
thus have a smaller FKGL and a high FRE.
                                                                           
                                         #words                  #syllables
          F RE = 206.835 − 1.015                        − 84.6                        (3)
                                        #sentences                #words
                                                                       
                                #words                       #syllables
          F KGL = 0.39                          + 11.8                        − 15.59   (4)
                               #sentences                     #words

Training the Classifiers: Four classifiers are necessary, since there are two languages
and two dimensions in each (age and gender). We employed a decision-tree classifier.
In all cases, the attributes were selected using the BestFirst method.
−Gender/Spanish
Training was done on 3K randomly selected conversations.The attributes used were
Cosine_female, Okapi_female, and Okapi_male.
−Age/Spanish
Since the number of conversations for the 10’s age group was much smaller than the
number for the other two classes and classifiers are known to perform better when the
number of instances in each class are balanced, we used a method known as random
oversampling. The method basically selects and replicates random instances from the
minority class. According to [2], this approach performs as well as more sophisticated
heuristic methods. The attributes used were Okapi_10s, Okapi_30s, FRE, and
FKGL.
−Gender/English
Analysing our attributes, we noticed that none of them were good discriminator for gen-
der in English texts. The attributes used were Cosine_female, Cosine_male,
Okapi_female, Okapi_male, FRE, and FKGL.
−Age/English
The attributes used were the same as for Spanish. Since the 10s class had fewer in-
stances, random oversampling was applied.

3     Experiments
The steps taken to process the datasets and run our experiments were the following:
1) Pre-process the conversations in the training data to tokenise and remove tags (no
stemming or stopword removal was performed).
2) Randomly choose 10% of the conversations to be used as queries.
3) Index the remaining 90% of the pre-processed conversations with a retrieval engine.
The system we used was Zettair1 , which is a compact and fast search engine developed
by RMIT University (Australia). It performs a series of IR tasks such as indexing and
matching. Zettair implements several methods for ranking documents in response to
queries and has calculates cosine and Okapi BM25.
4) Compute FeatureSets 1 and 2 using the results from the queries submitted to Zettair.
The top 10 scoring conversations were retrieved.
5) Calculate FRE and F KGL for the conversations used as queries. The code available
from2 was used.
6) Train the classifiers and generate the decision tree model. Weka [3], was used to build
the classification models. It implements several decision tree classification algorithms,
we chose J48.
7) Use the trained classifiers to predict the classes of the conversations used as queries.
    Once the classifiers are trained, than we can use them to predict the classes for new
conversations for which we do not know the age and the gender of the authors. Thus,
the conversations from the test data were treated as queries and went through steps 1,
4, 5, and 7.
    Table 1 shows our results on the training data. Our best scores were for gender in
Spanish (90% correct classification), while our worst results were for gender in En-
 1
     http://www.seg.rmit.edu.au/zettair/
 2
     http://tikalon.com/blog/2012/readability.c
    Table 1. Results for the training dataset (10-fold cross-validation) and for the test dataset

                                        Gender/ES Age/ES Gender/EN Age/EN
                 Correctly Classified     0.91     0.77    0.51     0.55
                 Precision                0.92     0.76    0.52     0.54
                 F-measure                0.90     0.77    0.45     0.53
                 Accuracy - Test Data     0.53     0.46    0.50     0.51


glish (51% correct classification). We attribute this to the fact that in Spanish, most
adjectives need to agree with the gender of the author. Thus a woman would say that
she is "cansada" while a man would say that he is "cansado". In English, both would
say "tired". For age, we also scored better in Spanish. When we look at the test data,
however, the scores for Spanish decrease significantly. The most noticeable reduction
was for gender, for which only 53% of the conversations were accurately classified.
The scores for English remained similar across training and test data. We speculate that
this happened because fewer instances were used to generate the Spanish classification
models, and they may not have been comprehensive enough to account for all aspects
in the data.

4    Conclusion
This paper described our experiments for the author profiling task at PAN-2013. We
employed four classifiers which exploit simple features to identify the age group and
the gender of authors. This was a preliminary investigation and we plan to continue
searching for improvements. Analysing the results from all participating groups, we
see that there is still a lot of room for improvement. As future work, we will investigate
the use of other features.
Acknowledgements: This work has been partially supported by CNPq-Brazil (478979/2012-6).
We thank Martin Potthast and the PAN organising team for their help in getting our software to
run.


References
1. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of
   an anonymous text. Commun. ACM 52(2), 119–123 (Feb 2009)
2. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods
   for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (Jun
   2004)
3. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA
   data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (Nov 2009)
4. Kincaid, J.P., Fishburne, R.P., Rogers, R.L., Chissom, B.S.: Derivation of New Readability
   Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for
   Navy Enlisted Personnel. Tech. rep. (Feb 1975)
5. Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Stamatatos, E., Rosso, P.,
   Stein, B.: Overview of the 5th international competition on plagiarism detection. In: CLEF
   2013 Evaluation Labs and Workshops - Working Notes Papers (Sept 2013)