Exploring Information Retrieval features for
Author Profiling
Notebook for PAN at CLEF 2014
Edson R. D. Weren, Viviane P. Moreira, and José P. M. de Oliveira
Institute of Informatics UFRGS - Porto Alegre - Brazil
{erdweren,viviane,palazzo}@inf.ufrgs.br
Abstract This paper describes the methods we have employed to solve the au-
thor profiling task at PAN-2014. Our goal was to rely mainly on features from
Information Retrieval to identify the age group and the gender of the author of
a given text. We describe the features, the classification algorithms employed,
and how the experiments were run. Also, we provide an analysis of our results
compared to other groups.
1 Introduction
Author profiling deals with the problem of finding as much information as possible
about an author, just by analysing a text produced by that author. This is a challenging
task which has applications in forensics, marketing, and security [1].
This paper reports on the participation of the INF-UFRGS team at the second edition
of the author profiling task, organised in the scope of the PAN Workshop series, which
is collocated with CLEF2014. More details about the task and the workshop can be
found in [2,5] The task requires that participating teams come up with approaches that
take a text as input and predict the gender (male/female) and the age group (18-24,
25-34, 35-49, 50-64, or 64+) of its author.
2 Features
The texts from each author, or documents, were represented by a set of 64 features (or
attributes), which were divided into five groups. Next, we explain each of these groups.
Length These are simple features that calculate the absolute length of the text.
– Number of Characters;
– Number of Words; and
– Number of Sentences.
Information Retrieval This is the group of features that encode our assumption that
authors from the same gender or age group tend to use similar terms and that the dis-
tribution of these terms would be different across genders and age groups. The process
here was the same as in [6]. The complete set of texts is indexed by an Information
1164
Retrieval (IR) System. Then, the text that we wish to classify is used as a query and the
k most similar texts are retrieved. The ranking is given by the cosine or Okapi metrics
as explained below. We employ a total of 30 IR-based features.
– Cosine
female_cosine_sum, male_cosine_sum, female_cosine_count,
male_cosine_count, female_cosine_avg, male_cosine_avg,
18-24_cosine_sum, 25-34_cosine_sum, 35-49_cosine_sum,
50-64_cosine_sum, 65-xx_cosine_sum, 18-24_cosine_count,
25-34_cosine_count, 35-49_cosine_count, 50-64_cosine_count,
65-xx_cosine_count, 18-24_cosine_avg, 25-34_cosine_avg,
35-49_cosine_avg, 50-64_cosine_avg, 65-xx_cosine_avg.
These features are computed as an aggregation function over the top-k results for
each age/gender group obtained in response to a query composed by the key-
words in the text that we wish to classify. We tested three types of aggregation
functions, namely: count, sum, and average. For this featureset, queries and doc-
uments were compared using the cosine similarity (Eq. 1). For example, if we re-
trieve 100 documents in response to a query composed by the keywords in q, and
50 of the retrieved documents were in the 18-24’s age group, then the value for
18-24_cosine_avg is the the average of the 50 cosine scores for this class.
Similarly, 18-24_cosine_sum is the summation of such scores, and
18-24_cosine_count simply counts how many retrieved documents fall into
the 18-24_cosine_count category.
→
−c · →
−q
cosine(c, q) = → − →
− (1)
| c || q |
where →−c and →−
q are the vectors for the document and the query, respectively. The
vectors are composed of tfi,c × idfi weights where tfi,c is the frequency of term i
N
in document c, and IDFi = log n(i) where N is the total number of documents in
the collection, and n(i) is the number of documents containing i.
– Okapi BM25
female_okapi_sum, male_okapi_sum, female_okapi_count,
male_okapi_count, female_okapi_avg, male_okapi_avg,
18-24_okapi_sum, 25-34_okapi_sum, 35-49_okapi_sum,
50-64_okapi_sum, 65-xx_okapi_sum, 18-24_okapi_count,
25-34_okapi_count, 35-49_okapi_count, 50-64_okapi_count,
65-xx_okapi_count, 18-24_okapi_avg, 25-34_okapi_avg,
35-49_okapi_avg, 50-64_okapi_avg, 65-xx_okapi_avg .
Similar to the previous featureset, these features compute an aggregation function (average,
sum, and count) over the the retrieved results from each gender/age group that appeared in
the top-k ranks for the query composed by the keywords in the document. For this featureset,
queries and documents were compared using the Okapi BM25 score (Eq. 2).
n
X tfi,c · (k1 + 1)
BM 25(c, q) = IDFi |D|
(2)
i=1 tfi,c + k1 (1 − b + b avgdl )
where tfi,c and IDFi are as in Eq. 1 |d| is the length (in words) of document c, avgdl is the
average document length in the collection, k1 and b are parameters that tune the importance
1165
of the presence of each term in the query and the length of the text. In our experiments, we
used k1 = 1.2 and b = 0.75.
Readability Readability tests indicate the comprehension difficulty of a text.
– Flesch-Kincaid readability tests
We employ two tests that indicate the comprehension difficulty of a text: Flesch
Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) [4]. They are given
by Eqs. 3 and 4. Higher FRE scores indicate a material that is easier to read. For
example, a text with a FRE scores between 90 and 100 could be easily read by
a 11 year old, while texts with scores below 30 would be best understood by un-
dergraduates. FKGL scores indicate a grade level. A FKGL of 7, indicates that the
text is understandable by a 7th grade student. Thus, the higher the FKGL score, the
higher the number of years in education required to understand the text. The idea
of using these scores is to help distinguish the age of the author. Younger authors
are expected to use shorter words and thus have a smaller FKGL and a high FRE.
#words #syllables
F RE = 206.835 − 1.015 − 84.6 (3)
#sentences #words
#words #syllables
F KGL = 0.39 + 11.8 − 15.59 (4)
#sentences #words
Correctness This group of features aims at capturing the correctness of the text.
– Words in the dictionary: ratio between the words from the text found in
the OpenOffice US dictionary1 and the total number of words in the text.
– Cleanliness: ratio between the number of characters in the preprocessed text
and the number of characters in the raw text. The idea is to assess how "clean" the
original text is.
– Repeated Vowels: in some cases, authors use words with repeated vowels for
emphasis. e.g. "I am soo tired". This group of features counts the numbers of re-
peated vowels (a, e, i, o, and u) in sequence within a word.
– Repeated Punctuation: this features compute the number of repeated punc-
tuation marks (i.e., commas, semi-colons, full stops, question marks, and
exclamation marks) in sequence in the text.
Style
– HTML tags: this feature consists in counting the number of HTML tags that indi-
cate line breaks
, images
, and links .
– Diversity: this feature is calculated as the ratio between the distinct words in the
text and the total number of words in the text.
1
http://extensions.openoffice.org/en/project/
english-dictionaries-apache-openoffice
1166
Table 1. Top 5 features in terms of Information Gain
Age Gender
Corpus Lang
Top 5 features IG Type Top 5 features IG Type
18-24_okapi_sum 0.083 IR male_okapi_avg 0.160 IR
50-64_cosine_sum 0.083 IR 25-34_okapi_avg 0.154 IR
Twitter EN 25-34_okapi_sum 0.081 IR male_okapi_sum 0.153 IR
25-34_cosine_sum 0.077 IR 35-49_okapi_avg 0.152 IR
18-24_cosine_sum 0.075 IR female_okapi_avg 0.140 IR
0.140 Style number of words 0.183 Length
25-34_okapi_count 0.136 IR words in the dictionary 0.157 Correctness
Twitter ES 25-34_cosine_sum 0.129 IR male_okapi_sum 0.155 IR
25-34_cosine_count 0.123 IR diversity 0.149 Style
50-64_cosine_sum 0.114 IR male_cosine_sum 0.148 IR
diversity 0.000 Style female_cosine_sum 0.156 IR
male_okapi_sum 0.000 IR male_okapi_count 0.146 IR
Blog EN male_okapi_count 0.000 IR female_okapi_count 0.137 IR
female_okapi_count 0.000 IR female_cosine_count 0.118 IR
female_okapi_sum 0.000 IR cleanliness 0.114 Correctness
25-34_cosine_sum 0.260 IR number of words 0.251 Length
words in the dictionary 0.231 Correctness words in the dictionary 0.226 Correctness
Blog ES 50-64_okapi_avg 0.224 IR repeated_e 0.206 Correctness
50-64_okapi_sum 0.224 IR 50-64_okapi_avg 0.200 IR
25-34_cosine_count 0.223 IR male_okapi_sum 0.194 IR
50-64_cosine_sum 0.122 IR female_cosine_count 0.008 IR
50-64_cosine_count 0.122 IR female_cosine_sum 0.007 IR
SocialMedia EN 35-49_cosine_count 0.117 IR female_okapi_count 0.007 IR
18-24_cosine_count 0.116 IR male_okapi_count 0.007 IR
35-49_cosine_sum 0.114 IR male_cosine_count 0.006 IR
18-24_okapi_count 0.200 IR female_cosine_count 0.081 IR
50-64_okapi_count 0.200 IR female_cosine_sum 0.079 IR
SocialMedia ES 18-24_cosine_count 0.193 IR male_cosine_count 0.071 IR
35-49_cosine_count 0.191 IR 25-34_cosine_avg 0.053 IR
18-24_cosine_sum 0.189 IR female_okapi_count 0.052 IR
65-XX_cosine_sum 0.098 IR female_okapi_count 0.106 IR
25-34_okapi_count 0.098 IR male_okapi_count 0.106 IR
Reviews EN 25-34_cosine_count 0.087 IR female_cosine_count 0.079 IR
65-XX_cosine_count 0.083 IR male_cosine_count 0.079 IR
65-XX_okapi_count 0.082 IR female_cosine_sum 0.072 IR
3 Usefulness of the Features
In order to evaluate how discriminant each of the 64 features described in Section 2 is,
we calculated their information gain with respect to the class. The five highest ranking
features for each corpus and each class are shown in Table 1. The vast majority of the
most discriminative features is from the IR group. Style, length, and correctness also
appear, but at a much lower frequency. For Age-Blogs-EN, none of our features had a
good score for information gain. Interestingly, we got the best scores for this corpus on
the test data, compared to other groups.
Information gain evaluates each feature independently from each other. However,
when selecting the best group of features, we wish to avoid redundant features by keep-
ing features that have at the same time a high correlation with the class and a low
intercorrelation. With this aim, we used Weka’s [3] subset evaluators to select good sub-
sets of features. These subsets are shown in Table 2. The number of attributes in these
1167
Table 2. Best subset of features for each corpus
Corpus Lang Age Gender
18-24_cosine_sum
18-24_cosine_count
male_okapi_count male_okapi_sum
Twitter EN
35-49_okapi_count
repeated_e
repeated_exclamation
50-64_cosine_sum
65-XX_cosine_count
25-34_okapi_sum
male_cosine_sum
25-34_okapi_count
male_cosine_count
Twitter ES
words_in_dictionary
words_in_dictionary
repeated_exclamation
number_of_characters
repeated_e
repeated_semicolon
male_cosine_avg
50-64_okapi_count female_cosine_sum
Blog EN
male_cosine_count
repeated_exclamation female_okapi_count
repeated_interrogation
65-XX_cosine_count
repeated_e
Blog ES 65-XX_cosine_avg
repeated_exclamation
25-34_okapi_sum
female_cosine_avg
male_cosine_count
male_cosine_avg
18-24_cosine_sum
25-34_cosine_avg
35-49_cosine_count
35-49_cosine_avg
female_okapi_count
SocialMedia EN 18-24_okapi_count
FRE
65-XX_okapi_avg
FKGL
repeated_exclamation
repeated_i
repeated_interrogation
repeated_fullstop
50-64_cosine_sum
18-24_cosine_count female_cosine_sum
female_okapi_sum male_cosine_avg
male_okapi_count male_okapi_count
18-24_okapi_sum 18-24_okapi_count
SocialMedia ES 18-24_okapi_count FKGL
18-24_okapi_avg repeated_a
repeated_i
number_of_characters repeated_u
repeated_a repeated_exclamation
repeated_ponto
female_cosine_avg
18-24_cosine_sum
65-XX_cosine_sum
65-XX_cosine_count
65-XX_okapi_sum
25-34_okapi_count
female_cosine_sum
65-XX_okapi_count
50-64_okapi_count
FKGL
Reviews EN 65-XX_okapi_count
number_of_characters
diversity
repeated_i
repeated_semicolon
repeated_o
repeated_comma
repeated_semicolon
repeated_exclamation
cleanliness
diversity
1168
subsets varied a lot, from one (Gender-Twitter-EN) to 16 (Age-Reviews-EN). Again,
we observed that most features in the subsets are IR-based. Surprisingly, readability
features (namely FKGL) appear in only two subsets for Age. Style and correctness at-
tributes also appear in the chosen subsets. Also, we noticed that some features that were
intended for age, have been selected as useful for gender and vice-versa.
4 Official Experiments
We treated gender and age classification separately. Thus, the features described in the
previous section were used to train one classifier for each corpus for gender and age
resulting in 14 classifiers. We used Weka [3] to build the machine learning models. A
number of algorithms was tested, namely: BayesNet, Logistic, MultilayerPerceptron,
SimpleLogistic, LogitBoost, RotationForest, and MetaMultiClass. We chose the algo-
rithm which got the best result for the training data using 10-fold cross-validation. To
make such choice, we analysed the results of the classifiers in two scenarios: using all
64 attributes and using just the attributes in the best subset.
The preprocessing consisted basically in tokenisation, removal of tags, and escape
characters. No stemming or stopword removal was performed. All training instances
were used to generate the model. No attempt to remove noise was taken.
Table 3 shows our official results for both training and test corpora in terms of accu-
racy. It also shows which classification algorithm was used and whether all attributes or
just a subset were used. Most classifiers (11 out of 14) used just the subset of attributes,
as their results on the training data outperformed (or got very close to) the results using
all attributes.
As expected, results on the training corpora were superior to the results on the test
corpora. The biggest drop was for Age-Blog-ES as in this corpus, in which accuracy
dropped by half. Interestingly, the results for three corpora were better on the test data
(Age-Twitter-ES, Age-Blogs-EN, and Gender-Twitter-ES). We still need to investigate
these differences further.
Table 3. Official Results
Age
Corpus Lang Training Test Classifier Attributes
Twitter EN 0.5261 0.3312 LogitBoost Subset
Twitter ES 0.5056 0.5222 RotationForest Subset
Blog EN 0.4558 0.4615 MultiClassClassifier Subset
Blog ES 0.5455 0.2500 LogitBoost Subset
SocialMedia EN 0.4251 0.3489 Logistic All
SocialMedia ES 0.4866 0.4382 Logistic Subset
Reviews EN 0.3762 0.3343 Logistic Subset
Gender
Corpus Lang Training Test Classifier Attributes
Twitter EN 0.7876 0.5714 Logistic Subset
Twitter ES 0.4494 0.5333 Logistic All
Blog EN 0.8299 0.6410 MultilayerPerceptron Subset
Blog ES 0.7955 0.5357 RotationForest Subset
SocialMedia EN 0.5704 0.5361 SimpleLogistic Subset
SocialMedia ES 0.7020 0.6307 SimpleLogistic All
Reviews EN 0.7103 0.6778 SimpleLogistic Subset
1169
0.15
0.1
0.05
0
age gender age gender age gender age gender age gender age gender age gender
-0.05
Twitter SocialMedia Blogs Reviews Twitter SocialMedia Blogs
-0.1 English Spanish
-0.15
-0.2
Figure 1. Comparison against the mean results of all participants
We also analysed our results compared against the mean of all participants. These
are shown in Figure 1. For 9 out of 14 cases, our results were above the mean. The
case with the biggest gain was Age-Blogs-EN, in which the advantage was of 31%. In 5
runs, our results were at or below the mean. Our worst scores were for Age-Blogs-ES,
in which our loss was of nearly 66%. Adding up all gains and losses, we get a positive
result of 10% in relation to the average.
5 Conclusion
This paper describes our participation in the Author Profiling task run in PAN 2014.
We used the training data to build classifiers using several machine learning algorithms.
Our focus was on exploring Information Retrieval-based features. The official results
show that our scores were above the mean for all participants in most cases (9 times out
of 14).
Author profiling is a challenging task. Consequently, there are many possibilities
for future work. As a first step, once the test data is released, we will further investigate
the cases in which our system fails or succeeds in the classification. The goal is to try
and establish patterns. We are also interested in testing methods for instance selection
to improve our classification models. In addition, we have treated gender and age clas-
sification separately as independent problems. However, since some attributes meant to
discriminate gender were found useful for age (and vice-versa), we wish to explore the
influence of both types of classification into each other.
Acknowledgements: This work has been partially supported by CNPq-Brazil (478979/2012-6).
We thank Anderson Kauer for his help in revising this paper. We thank Martin Potthast, Francisco
Rangel, and other members of the PAN organising team for their help in getting our software to
run.
1170
References
1. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of
an anonymous text. Commun. ACM 52(2), 119–123 (Feb 2009)
2. Gollub, T., Potthast, M., Beyer, A., Busse, M., Pardo, F.M.R., Rosso, P., Stamatatos, E., Stein,
B.: Recent trends in digital text forensics and its evaluation - plagiarism detection, author
identification, and author profiling. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B.
(eds.) CLEF. Lecture Notes in Computer Science, vol. 8138, pp. 282–302. Springer (2013)
3. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA
data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (Nov 2009)
4. Kincaid, J.P., Fishburne, R.P., Rogers, R.L., Chissom, B.S.: Derivation of New Readability
Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for
Navy Enlisted Personnel. Tech. rep., National Technical Information Service, Springfield,
Virginia (Feb 1975)
5. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author
profiling task at pan 2013. In: Notebook Papers of CLEF 2013 LABs and Workshops,
CLEF-2013, Valencia, Spain, September. pp. 23–26 (2013)
6. Weren, E.R.D., Kauer, A.U., Mizusaki, L., Moreira, V.P., Oliveira, J.P.M.D., Wives, L.:
Examining multiple features for author profiling. Journal of Information and Data
Management (JIDM) 5(1) (October 2014), to appear.
1171