=Paper=
{{Paper
|id=Vol-1866/invited_paper_11
|storemode=property
|title=Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter
|pdfUrl=https://ceur-ws.org/Vol-1866/invited_paper_11.pdf
|volume=Vol-1866
|authors=Francisco Manuel Rangel Pardo,Paolo Rosso,Martin Potthast,Benno Stein
|dblpUrl=https://dblp.org/rec/conf/clef/PardoRPS17
}}
==Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter==
Overview of the 5th Author Profiling Task at PAN 2017:
Gender and Language Variety Identification in Twitter
Francisco Rangel1,2 Paolo Rosso2 Martin Potthast3 Benno Stein3
1
Autoritas Consulting, S.A., Spain
2
PRHLT Research Center, Universitat Politècnica de València, Spain
3
Web Technology & Information Systems, Bauhaus-Universität Weimar, Germany
pan@webis.de http://pan.webis.de
Abstract This overview presents the framework and the results of the Author
Profiling task at PAN 2017. The objective of this year is to address gender and
language variety identification. For this purpose a corpus from Twitter has been
provided for four different languages: Arabic, English, Portuguese, and Spanish.
Altogether, the approaches of 22 participants are evaluated.
1 Introduction
The rise of social media provides new models of communication and social relation-
ships. These media allow to hide the real profile of the users who interact and generate
information. Therefore, the possibility of knowing social media users’ traits on the ba-
sis of what they share is a field of growing interest named author profiling. To infer the
authors’ gender, age, native language, dialects, or personality opens a world of possi-
bilities from the point of view of marketing, forensics, or security. For example from a
security viewpoint, to be able to determine the linguistic profile of a person who writes
a suspicious or threatening text may provide valuable background information to evalu-
ate the context (and possible reach) of the thread. Moreover, to know the demographics
of the author, such as her/his age and gender, or her/his cultural and social context (e.g.,
native language or/and dialect), with the attempt of profiling potential terrorists [54].
In the Author Profiling task at PAN 20131 [48], the identification of age and gender
relied on a large corpus collected from social media, both in English and Spanish. In
PAN 20142 [49], we continued focusing on age and gender aspects but, in addition,
compiled a corpus of four different genres, namely social media, blogs, Twitter, and
hotel reviews. Except for the hotel review subcorpus, which was available in English
only, all documents were provided in both English and Spanish. Note that most of the
existing research in computational linguistics [8] and social psychology [43] focuses
on the English language, and the question is whether the observed relations pertain to
other languages and genres as well. In this vein, in PAN 20153 [50], we included two
1
http://webis.de/research/events/pan-13/pan13-web/author-profiling.html
2
http://webis.de/research/events/pan-14/pan14-web/author-profiling.html
3
http://pan.webis.de/clef15/pan15-web/author-profiling.html
new languages, Italian and Dutch, besides a new subtask on personality recognition. In
PAN 20164 [52], we investigated the effect of the cross-genre evaluation, that is, when
the models are trained on one genre, namely Twitter, and evaluated on another genre
different than Twitter.
In PAN 20175 we introduce two novelties: (1) the language variety identification
together with the gender dimension; and (2) the Arabic and Portuguese languages (be-
sides English and Spanish).
The remainder of this paper is organised as follows. Section 2 covers the state of the
art, Section 3 describes the corpus and the evaluation measures, and Section 4 presents
the approaches submitted by the participants. Section 5 and 6 discuss results and draw
conclusions respectively.
2 Related Work
Pennebaker [44] investigated from a psycholinguistic viewpoint how the use of lan-
guage varies depending on personal traits such as the author’s gender. Concretely, they
found that, at least for English, women use the first person of singular more than men
because they are more self-conscious, whereas men use more determiners because they
speak about concrete things. On the basis of their findings, the authors built LIWC (Lin-
guistic Inquiry and Word Count) [43], one of the most used tools in author profiling.
Pioneer researchers such as Argamon et al. [8], Holmes & Meyerhoff [23], Burger et
al. [12], Koppel et al. [30] and Schler et al. [57] focused mainly on formal texts and
blogs, reporting accuracies over 75%-80% in most cases. However, nowadays the in-
vestigation is especially focused on social media such as Twitter or Facebook. In this
regard, it is worth mentioning the second order representation based on relationships
between documents and profiles used by the best performing team in three editions of
PAN [32, 33, 7]. Recently, the EmoGraph graph-based approach [47] tried to capture
how users convey verbal emotions in the morphosyntactic structure of the discourse, ob-
taining competitive results with the best performing systems at PAN 2013 and demon-
strating its robustness against genres and languages at PAN 2014 [46]. Moreover, the
authors in [60] investigated on the PAN-AP-2013 dataset a high variety of different
features and showed the contribution of information-retrieval-based features in age and
gender identification, and in [36] the authors approached the task with 3 million features
in a MapReduce configuration, obtaining high accuracies with fractions of processing
time. With respect to gender identification in other languages than English and Spanish,
it is worth to mention the following investigations. Estival et al. [16] focused on Arabic
emails and reported accuracies of 72.10%. Alsmearat et al. [5] reported an accuracy of
86.4% in Arabic newsletters, and an increase on accuracy up to 94% with an extension
of their work [4]. AlSukhni & Alequr [6] reported accuracies of 99.50% by improving
a bag-of-words model with the authors’ names in Arabic tweets.
This is the first time we have included the language variety identification in the au-
thor profiling task. There are a number of investigations with different languages such as
4
http://pan.webis.de/clef16/pan16-web/author-profiling.html
5
http://pan.webis.de/clef17/pan17-web/author-profiling.html
English [35], South-Slavic [31], Chinese [24], Persian and Dari [38], or Malay and In-
donesian [9]. With respect to Spanish, the authors in [37] investigated the identification
among Argentinian, Chilean, Colombian, Mexican, and Spanish in Twitter, reporting
accuracies about 60-70% with combinations of n-grams and language models. Also
in Spanish, the authors in [51] collected the HispaBlogs6 corpus with five varieties of
Spanish: Argentinian, Chilean, Mexican, Peruvian, and Spanish. They reported an ac-
curacy of 71.1% with a low-dimensionality representation in comparison to 72.2% and
70.8% obtained with Skip-grams and Sentence Vectors [17]. Focusing on Portuguese,
the authors in [61] collected 1,000 articles from well-known Brazilian7 and Portugal8
newsletters. They combined character and word n-grams and reported accuracies of
99.6% with word unigrams, 91.2% with word bigrams and 99.8% with character 4-
grams. With regard to Arabic, Sadat et al. [55] reported 98% of accuracy with n-grams
in 6 Arabic dialects: Egyptian, Iraqi, Gulf, Maghreb, Levantine, and Sudan. Elfardy &
Diab [15] reported 85.5% of accuracy discriminating between Egyptian and Modern
Standard Arabic with combinations of content and style-based features. The increasing
interest in Arabic dialects identification is attested by the eighteen teams participating
in the Arabic subtask of the third DSL track [2]9 , and by the 22 participants of this year
Author Profiling shared task at PAN.
3 Evaluation Framework
In this section we outline the construction of the corpus while covering particular prop-
erties, challenges, and novelties. Finally, the evaluation measures are described.
3.1 Corpus
The focus of this year task is on gender and language variety identification in Twitter.
Besides English and Spanish, this year for the first time, we have included Arabic and
Portuguese. To create the corpus, we have followed the next 7 steps.
Step 1. Languages and Varieties Selection
The following languages and varieties have been selected:
– Arabic: Egypt, Gulf, Levantine, Maghrebi.10
– English: Australia, Canada, Great Britain, Ireland, New Zealand, United States.
– Portuguese: Brazil, Portugal.
– Spanish: Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela.
6
https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs
7
http://www.folha.uol.com.br
8
http://www.dn.pt
9
Its difficulty is backed up by the obtained accuracies of about 50%.
10
The selection of these varieties responds to previous works [55]. Iraqui was also selected but
discarded due to the lack of enough tweets.
Step 2. Tweets per Region Retrieval
For each variety, we have selected the capital (or more populated cities) of the region
where this variety is used. The list of cities per variety is shown in Table 1. From the
city center, tweets in a radius of 10 kilometers have been retrieved.
Table 1. Cities selected as representative of the language varieties.
Language Variety City
Arabic Egypgt Cairo
Gulf Abu Dhabi, Doha, Kuwait, Manama, Mascate, Riyadh, Sana’a
Levantine Amman, Beirut, Damascus, Jerusalem
Maghrebi Algiers, Rabat, Tripoli, Tunis
English Australia Canberra, Sydney
Canada Toronto, Vancouver
Great Britain London, Edinburgh, Cardiff
Ireland Dublin
New Zealand Wellington
United States Washington
Portuguese Brazil Brasilia
Portugal Lisbon
Spanish Argentina Buenos Aires
Chile Santiago
Colombia Bogota
Mexico Mexico
Peru Lima
Spain Madrid
Venezuela Caracas
Step 3. Unique Authors Identification
Unique authors have been identified from the previous dataset. For each author, her/his
timeline has been retrieved. The timeline provides meta-data such as:
– Full name.
– Location, as a textual description or toponyms.
– Language identified by the author in her/his profile (this language may not corre-
spond to the language used by the author).
Step 4. Authors Selection
For each author, we ensure that the author contains at least 100 tweets with the following
conditions:
– Tweets are not retweets.
– Tweets are written in the corresponding language.
Authors who do not accomplish the previous conditions are discarded.
Step 5. Language Variety Annotation
An author is annotated with a corresponding language variety if:
Table 2. Languages and varieties. There are 500 authors per variety and gender, 300 for training
and 200 for test. Each author contains 100 tweets.
(AR) Arabic (EN) English (ES) Spanish (PT) Portuguese
Egypt Australia Argentina Brazil
Gulf Canada Chile Portugal
Levantine Great Britain Colombia
Maghrebi Ireland Mexico
New Zealand Peru
United States Spain
Venezuela
4,000 6,000 7,000 2,000
– It has been retrieved in the corresponding region.
– At least 80% of the locations provided as meta-data of her/his tweets coincide with
some of the toponyms for the corresponding region.
The main assumption is that a person who lives in a region uses this region variety.
This implies two assumptions:
– We assume that a person lives in a region when her/his location in all her/his time-
line reflects this location. The timeline is up to 3,200 tweets per author, what implies
in most cases at least a couple of years, so the assumption is feasible.
– We assume that social media language is dynamic and easily influenced, as opposed
to more formal ones such as newsletters. This means that it reflects the everyday
language and captures basic social and personality processes of the authors who
use it. In this sense, if there is a high number of immigrants or tourists in a region,
they may influence the region use of this language, and this may be a valuable clue
to detect the possible location of a person.
Step 6. Gender Annotation
Gender annotation has been done in two steps:
– Automatically, with the help of a dictionary of proper nouns (ambiguous nouns
have been discarded).
– Manually, by visiting each profile and looking at the photo, description, etc.
Step 7. Corpus construction
The final dataset is balanced in the number of tweets per variety and gender, and in the
number of tweets per author:
– 500 tweets per gender and variety.
– 100 tweets per author.
The dataset is divided into training/test in a 60/40 proportion, with 300 authors for
training and 200 authors for test. The corresponding languages and varieties are shown
in Table 2 along with the total number of authors for each subtask.
3.2 Performance Measures
For evaluation purposes, the accuracy for variety, gender, and joint identification per
language is calculated. Then, we average the results obtained per language (Eq. 1).
gender_AR + gender_EN + gender_ES + gender_PT
gender =
4
variety_AR + variety_EN + variety_ES + variety_PT
variety = (1)
4
joint_AR + joint_EN + joint_ES + joint_PT
joint =
4
The final ranking is calculated as the average of the above values:
gender + variety + joint
ranking = (2)
3
3.3 Baselines
To understand the complexity of the subtasks per language and with the aim to compare
the performances of the participants approaches, we propose the following baselines:
– BASELINE-stat. A statistical baseline that emulates random choice. The baseline
depends on the number of classes: two in case of gender identification, and from
two to seven in case of variety identification.
– BASELINE-bow. This method represents documents as a bag-of-words with the
1,000 most common words in the training set, weighted by absolute frequency of
occurrence. The texts are preprocessed as follows: lowercase words, removal of
punctuation signs and numbers, and removal of stop words for the corresponding
language.
– BASELINE-LDR [51]. This method represents documents on the basis of the prob-
ability distribution of occurrence of their words in the different classes. The key
concept of LDR is a weight, representing the probability of a term to belong to one
of the different categories: for gender (female vs. male) and for variety depending
on the language (e.g., Brazil vs. Portugal). The distribution of weights for a given
document should be closer to the weights of its corresponding category. LDR takes
advantage of the whole vocabulary.
3.4 Software Submissions
We asked for software submissions (as opposed to run submissions). Within software
submissions, participants submit executables of their author profiling softwares instead
of just the output (called “run”) of their softwares on a given test set. Our rationale to do
so is to increase the sustainability of our shared task and to allow for the re-evaluation
of approaches to Author Profiling later on, in particular, on future evaluation corpora.
To facilitate software submissions, we develop the TIRA experimentation platform [20,
21], which renders the handling of software submissions at scale as simple as handling
run submissions. Using TIRA, participants deploy their software on virtual machines at
our site, which allows us to keep them in a running state [22].
4 Overview of the Submitted Approaches
22 teams participated in the Author Profiling shared task and 20 of them submitted the
notebook paper.11 We analyse their approaches from three perspectives: preprocessing,
features to represent the authors’ texts and classification approaches.
4.1 Preprocessing
Various participants cleaned the contents to obtain plain text [26, 40, 53]. Most of
them removed or normalised Twitter specific elements such as URLs, user mentions
or hashtags [18, 1, 27, 29, 39, 41, 53, 56]. Some participants also lowercased the
words [18, 27, 29, 41], although in case of [41] the authors did not lowercase the char-
acters. The authors in [1] expanded contractions and the authors in [27, 40] removed
stop words. The authors in [53, 40] removed punctuation signs as well as the authors
in [56], who also removed numbers and out-of-alphabet words per language. Finally,
the authors in [27] removed short tweets.
4.2 Features
Traditionally, in previous editions of the author profiling task at PAN as well as in
the referred literature, features used for representing documents have been classified
between content and style-based. However, this year for the first time, more participants
have employed deep learning techniques. In this sense, it is interesting to differentiate
among traditional features and these new methods in order to test their performance
in the author profiling task. In this regard, the authors in [25, 29, 58, 45] represented
documents with word embeddings, whereas in [18] character embeddings have been
used. The authors in [41] mixed both word and character embeddings and the authors
in [45] also used traditional features such as tf-idf n-grams.
As traditional features, character and word n-grams have been widely used by the
participants [40, 3, 27, 39, 53, 42, 56, 14]. For example, the authors in [3] used character
n-grams with n between 2 and 7, and the authors in [27] used word n-grams with values
between 1 and 3 for the n. The authors in [39] combined character and word n-grams
with values of 3-4 for typed characters, 3-7 for untyped characters and 2-3 for words,
respectively. Similarly, the authors in [14] combined character and word n-grams with
values of n of 1-6 and 1-2 respectively. In case of the authors of [19], words with
frequency between 2 and 25 have been used, whereas in [11] the authors combined most
discriminative words per class with slang words, locations, brands and stylistic patterns.
11
Ganesh [19] and Bouazizi [11] teams did not submit a notebook paper, but sent us a brief
description of their approaches.
In [45, 56] tf-idf n-grams have been combined respectively with word embeddings, and
with beginning and ending character 2-grams. Finally, the authors in [42] used high
order character n-grams.
With respect to content features, the most commonly used have been top n terms by
gain ratio [28], bag-of-words [1, 59], the 100 most discriminant words per class from a
list of 500 topic words [26], LSA [27], specific lists of words for language variety [40].
Style features have been also used by some participants. For example, ratios of links,
hashtags or user mentions [3, 53], character flooding [3, 40, 53], and emoticons or/and
laugher expressions [1, 40]. Finally, the authors in [39] combined domain names with
different kinds of n-grams.
Emotional features have been used by the authors in [1], who combined emotions,
appraisal, admiration, positive/negative emoticons, and positive/negative words, and by
the authors in [40] who used emojis and sentiment words. The authors in [34] used
a variation of their second-order representation [33] based on user-document relation-
ships. Finally in [10], the authors who obtained the best overall result, used a combi-
nation of character n-grams (with n between 3 and 5) and tf-idf word n-grams (with n
between 1 and 2).
4.3 Classification Approaches
Most of the participants approached the task with traditional machine learning algo-
rithms such as logistic regression [25, 40, 45, 42], SVMs [3, 27, 34, 39, 59, 10, 53, 14,
19] and Random Forest [11]. Meanwhile most participants used these algorithms alone,
authors in [45, 14] ensembled different configurations. The authors in [27] used SVMs
for variety identification and Naive Bayes for gender identification. Three teams used
distance-based methods [1, 26, 28].
With respect to deep learning methods, the authors in [29] applied Recurrent Neural
Networks (RNN), whereas the authors in [56, 58] used Convolutional Neural Networks
(CNN). The authors in [41] explored both approaches (RNN and CNN) besides at-
tention mechanism, max-pooling layer, and fully-connected layer. The authors in [45]
combined traditional logistic regression with a Gaussian process trained with word em-
beddings. Finally, the authors in [18] applied Deep Averaging Networks.
5 Evaluation and Discussion of the Submitted Approaches
We divided the evaluation in two steps, providing an early bird option for those partic-
ipants who wanted to receive some feedback. There were 22 submissions for the final
evaluation. We show results separately for the evaluation in each subtask (gender, lan-
guage variety, and joint identification), as well as we analyse the special case of the
English language in a coarse-grained grouping.
5.1 Gender Evaluation
In this section we analyse the results for the gender identification subtask. As can be
seen in Table 3, the best results have been obtained for the Portuguese language with a
maximum accuracy of 87% [41] and an average result of 78%, about 7 and 3 percentage
points over the rest. However, results for the four languages are very similar in terms of
their distribution, as can be seen in Figure 1. The average values range between 72.10%
in case of Arabic and 78% in case of Portuguese. The most similar results among au-
thors have been obtained in English, although there are two outliers with lower results
than the rest: the authors in [1] with an accuracy of 54.13%, obtained with combina-
tions of content and style-based features learned with a distance-based method, and
Bouazizi’s team that achieved 61.21% with Random Forest and combinations of dis-
criminative words, slang, locations, brands and stylistic patterns. In case of Portuguese
there is also an outlier with lower accuracy than the rest, corresponding to the authors
in [26] with 61% of accuracy, obtained with top 100 most discriminant words per class
and a distance-based algorithm.
The best results per language have been obtained by the following teams: In Arabic,
the authors in [40] approached the task with combinations of character, word and POS
n-grams with emojis, character flooding, sentiment words, and specific lists of words
per variety, training their models with logistic regression and obtaining an accuracy of
80.31%. In case of English, the best result of 82.33% has been obtained by the au-
thors in [10]. The authors approached the task with combinations of character and tf-idf
word n-grams trained with an SVM. With respect to Portuguese, the best result has
been obtained by the authors in [41]. They achieved 87% accuracy with a deep learning
approach combining word and character embeddings with CNN, RNN, attention mech-
anism, max-pooling layer, and fully-connected layer. In case of Spanish, the authors
who obtained the best result of 83.21% are the same that obtained the best result in
English.
With respect to the provided baselines, especially in case of BOW and LDR that
both utilize information from the contents of the documents, we can observe that their
results are below the mean. As seen in previous editions of the task and according to
previous investigations [44, 47], gender discrimination is more related to how things are
said than to what it is said. In this sense, the best resulting approaches took advantadge
from both the style and contents with combinations of n-grams and other content and
style-based features, as well as in case of Portuguese with deep representations.
Figure 1. Distribution of results for gender identification in the different languages.
Table 3. Gender results.
Ranking Team Arabic English Portuguese Spanish Average
1 Basile et al. 0.8006 0.8233 0.8450 0.8321 0.8253
2 Martinc et al. 0.8031 0.8071 0.8600 0.8193 0.8224
3 Miura et al. 0.7644 0.8046 0.8700 0.8118 0.8127
4 Tellez et al. 0.7838 0.8054 0.8538 0.7957 0.8097
5 Lopez-Monroy et al. 0.7763 0.8171 0.8238 0.8014 0.8047
6 Poulston et al. 0.7738 0.7829 0.8388 0.7939 0.7974
7 Markov et al. 0.7719 0.8133 0.7863 0.8114 0.7957
8 Ogaltsov & Romanov 0.7213 0.7875 0.7988 0.7600 0.7669
9 Franco-Salvador et al. 0.7300 0.7958 0.7688 0.7721 0.7667
10 Sierra et al. 0.6819 0.7821 0.8225 0.7700 0.7641
11 Kodiyan et al. 0.7150 0.7888 0.7813 0.7271 0.7531
12 Ciobanu et al. 0.7131 0.7642 0.7713 0.7529 0.7504
13 Ganesh 0.6794 0.7829 0.7538 0.7207 0.7342
LDR-baseline 0.7044 0.7220 0.7863 0.7171 0.7325
14 Schaetti 0.6769 0.7483 0.7425 0.7150 0.7207
15 Kocher & Savoy 0.6913 0.7163 0.7788 0.6846 0.7178
16 Kheng et al. 0.6856 0.7546 0.6638 0.6968 0.7002
17 Ignatov et al. 0.6425 0.7446 0.6850 0.6946 0.6917
BOW-baseline 0.5300 0.7075 0.7812 0.6864 0.6763
18 Khan 0.5863 0.6692 0.6100 0.6354 0.6252
STAT-baseline 0.5000 0.5000 0.5000 0.5000 0.5000
19 Ribeiro-Oliveira et al. 0.7013 - 0.7650 - 0.3666
20 Alrifai et al. 0.7225 - - - 0.1806
21 Bouazizi - 0.6121 - - 0.1530
22 Adame et al. - 0.5413 - - 0.1353
Min 0.5863 0.5413 0.6100 0.6354 0.1353
Q1 0.6847 0.7474 0.7594 0.7164 0.6938
Median 0.7181 0.7829 0.7813 0.7650 0.7518
Mean 0.7210 0.7571 0.7800 0.7553 0.6588
SDev 0.0560 0.0729 0.0690 0.0554 0.2260
Q3 0.7724 0.8048 0.8313 0.8000 0.7970
Max 0.8031 0.8233 0.8700 0.8321 0.8253
5.2 Language Variety Evaluation
In this section we analyse the results for the language variety identification subtask. As
can be seen in Table 4, the best results have been obtained for the Portuguese language
with a maximum accuracy of 98.50% [59] and an average result of 97.31%.
The best results per language have been obtained by the following teams: In Arabic
and Spanish, the authors in [10] obtained 83.13% and 96.21% of accuracy respectively,
approaching the task with combinations of character and tf-idf word n-grams and an
SVM. With respect to English and Portuguese, the authors in [59] obtained 90.04% and
98.50% with an SVM.
The provided LDR baseline obtained almost the best result in all languages and
obtained the best overall result. Taking into account that basically this representation
measures the use of words per class, we can conclude that the identification of lan-
guage varieties is highly dependent to the usage of words. This is also supported by the
approaches used by the best performing teams.
Table 4. Language variety results.
Ranking Team Arabic English Portuguese Spanish Average
LDR-baseline 0.8250 0.8996 0.9875 0.9625 0.9187
1 Basile et al. 0.8313 0.8988 0.9813 0.9621 0.9184
2 Tellez et al. 0.8275 0.9004 0.9850 0.9554 0.9171
3 Martinc et al. 0.8288 0.8688 0.9838 0.9525 0.9085
4 Markov et al. 0.8169 0.8767 0.9850 0.9439 0.9056
5 Lopez-Monroy et al. 0.8119 0.8567 0.9825 0.9432 0.8986
6 Miura et al. 0.8125 0.8717 0.9813 0.9271 0.8982
7 Sierra et al. 0.7950 0.8392 0.9850 0.9450 0.8911
8 Schaetti 0.8131 0.8150 0.9838 0.9336 0.8864
9 Poulston et al. 0.7975 0.8038 0.9763 0.9368 0.8786
10 Ogaltsov & Romanov 0.7556 0.8092 0.9725 0.8989 0.8591
11 Ciobanu et al. 0.7569 0.7746 0.9788 0.8993 0.8524
12 Kodiyan et al. 0.7688 0.7908 0.9350 0.9143 0.8522
13 Kheng et al. 0.7544 0.7588 0.9750 0.9168 0.8513
14 Franco-Salvador et al. 0.7656 0.7588 0.9788 0.9000 0.8508
15 Kocher & Savoy 0.7188 0.6521 0.9725 0.7211 0.7661
16 Ganesh 0.7144 0.6021 0.9650 0.7689 0.7626
17 Ignatov et al. 0.4488 0.5813 0.9763 0.8032 0.7024
BOW-baseline 0.3394 0.6592 0.9712 0.7929 0.6907
18 Khan 0.5844 0.2779 0.9063 0.3496 0.5296
19 Ribeiro-Oliveira et al. 0.6713 - 0.9850 - 0.4141
STAT-baseline 0.2500 0.1667 0.5000 0.1429 0.2649
20 Alrifai et al. 0.7550 - - - 0.1888
21 Bouazizi - 0.3725 - - 0.0931
22 Adame et al. - 0.1904 - - 0.0476
Min 0.4488 0.1904 0.9063 0.3496 0.0476
Q1 0.7455 0.6396 0.9738 0.8990 0.7175
Median 0.7672 0.7973 0.9788 0.9220 0.8523
Mean 0.7514 0.7150 0.9731 0.8707 0.7215
SDev 0.0936 0.2098 0.0198 0.1466 0.2798
Q3 0.8126 0.8597 0.9838 0.9437 0.8964
Max 0.8313 0.9004 0.9850 0.9621 0.9184
As can be seen in Figure 2, the distribution of the results is varies significantly with
the language. In case of Portuguese, most approaches obtained very similar results, with
a difference of only 1% in the interquartile range. The distribution of results in case of
Arabic and Spanish are similar, with an interquartile range of 6.71% and 4.47% respec-
tively, although results for Spanish are higher (an average of 87.07% over 75.14%).
The most sparse distribution occurs for English, where the interquartile range embraces
22.01% with the lowest average accuracy of 71.50%. However, this lowest average can
be caused by the outlier [1], who obtained an accuracy of 19.04% with a distance-based
approach with combinations of bag-of-words and emotional features.
Figure 2. Distribution of results for language variety identification in the different languages.
5.3 Confusion Among Language Varieties
In this section the confusion among varieties of the same language is analysed. All
participants’ results have been analysed together; the Figures 3, 4, 5 and 6 show the
confusion matrix for Arabic, English, Portuguese, and Spanish respectively.
Figure 3. Confusion matrix for Arabic varieties.
In Figure 3 the confusion matrix for Arabic languages is shown: the overall confu-
sion among languages is at most 10.79% in case of Egypt to Maghrebi, and 10.34% in
case of Gulf to Egypt. The rest of the errors are below 10%, with the lowest value for
Egypt to Levantine (3.54%), or Levantine to Egypt (4.83%). The highest accuracy was
obtained for the Levantine variety (79.54%) that, together with the previous insights,
show us that this variety seems to be the less difficult to be identified. On the contrary,
the Gulf variety is the most difficult one, with an overall accuracy of 70.12% and the
highest confusions to other varieties.
Figure 4 shows the confusion matrix for the English language. The highest con-
fusion is from United States to Canada (11.80%), Ireland to Great Britain (11.81%),
and Australia to New Zealand (10.61%). These errors correspond to varieties with a
close geographical situation. The New Zealand variety is the less difficult to be iden-
tified (82.04%), whereas the most difficult is Australia (65.48%). It is noteworthy that
both varieties are geographically close, but the error from New Zealand to Australia is
very low (3.46%). The lowest confusion is from New Zealand to Ireland (1.7%), United
States to Ireland (2.24%), Ireland to Australia (2.50%), or Canada to Ireland (2.59%).
As can be concluded, the closer geographically two varieties are, the higher the confu-
sion between them is.
Figure 4. Confusion matrix for English varieties.
The confusion between the two varieties of Portuguese is shown in Figure 5. The
Brazilian variety is less difficult to be identified than the Portuguese one. The error from
Portugal to Brazil is 4.45%, whereas the error in the other direction is lower than 1%.
Figure 5. Confusion matrix for Portuguese varieties.
Figure 6. Confusion matrix for Spanish varieties.
In case of Spanish the confusion matrix among varieties is shown in Figure 6. It can
be observed that the varieties from Argentina and Spain are the less difficult to be identi-
fied, with accuracies of 92.15% and 92.33% respectively. On the contrary, the most dif-
ficult variety is Peru, where the accuracy drops to 81.5%, followed by Chile (84.79%).
The highest confusion occurs from Peru, Venezuela, and Mexico to Colombia, with er-
rors of 5.86%, 5.28%, and 4.32% respectively, followed by Peru to Argentina (4.01%),
Chile to Colombia (3.74%), and Chile to Argentina (3.25%). Colombia is the variety
destination of the majority of the confusions, and, again the geographical distribution
of varieties has some relationship with the confusion between them.
5.4 Coarse-Grained Evaluation of English Varieties
Due to the confusion among English varieties that share geographical proximitiy, we
have combined some of them in order to analyse how the participants’ approaches per-
formed. In particluar, we formed the following groups:
– American English: United States and Canada.
– European English: Great Britain and Ireland.
– Oceanic English: Australia and New Zealand.
Table 5. Coarse-grained English variety identification results.
Ranking Team Coarse-Grained Fine-Grained Difference
1 Basile et al. 0.9429 0.8988 0.0441
2 Tellez et al. 0.9379 0.9004 0.0375
3 Markov et al. 0.9292 0.8767 0.0525
4 Miura et al. 0.9279 0.8717 0.0562
5 Martinc et al. 0.9238 0.8688 0.0550
6 Lopez-Monroy et al. 0.9167 0.8567 0.0600
7 Sierra et al. 0.9004 0.8392 0.0612
8 Schaetti 0.8863 0.8150 0.0713
9 Ogaltsov & Romanov 0.8754 0.8092 0.0662
10 Poulston et al. 0.8746 0.8038 0.0708
11 Kodiyan et al. 0.8663 0.7908 0.0755
12 Franco-Salvador et al. 0.8654 0.7588 0.1066
13 Ciobanu et al. 0.8504 0.7746 0.0758
14 Kheng et al. 0.8388 0.7588 0.0800
15 Kocher & Savoy 0.7696 0.6521 0.1175
16 Ignatov et al. 0.7296 0.5813 0.1483
17 Ganesh 0.7238 0.6021 0.1217
18 Bouazizi 0.5217 0.3725 0.1492
19 Khan 0.4533 0.2779 0.1754
20 Adame et al. 0.3583 0.1904 0.1679
In Table 5 the results for this grouping (coarse-grained) in comparison to the original
varieties (fine-grained) are shown. As can be seen, results are better in case of coarse-
grained evaluation, although the difference is higher with systems that performed worst
in the fine-grained evaluation. For example, in case of Khan [26], the increase in ac-
curacy is 17.54%, whereas in case of Tellez et al. [59] the difference is only 3.75%.
Although the statistical test indicates that the differences are significant, such differ-
ences are not very high in case of the best performing teams with values around 5% of
improvement.
5.5 The Impact of the Gender on the Language Variety Identification
In this section we analyse the impact of the gender on language variety identification,
depending on the native language. To carry out this analysis, we have aggregated the
results obtained by all participants (without the baselines). Table 6 shows the accuracies
obtained in language variety identification depending on the gender. As can be seen,
except in the case of Spanish, it is easier to identifiy the variety properly in case of
females, although only in case of Arabic and Portuguese this difference is statistically
significant. In case of Spanish and English, the difference in accuracy between genders
is about 5% and 2% respectively, whereas in Arabic and Portuguese is about 7% and
2%.
Table 6. Language variety identification accuracies per gender (* indicates a significant difference
according to the Student t-test).
Language Female Male Difference
Arabic* 0.7909 0.7203 0.0706
English 0.7190 0.7168 0.0022
Portuguese* 0.9829 0.9633 0.0196
Spanish 0.8680 0.8733 -0.0053
5.6 The Difficulty of Gender Identification Depending on the Language Variety
In this section, the difficulty of identifying the author’s gender depending on the lan-
guage variety is analysed. We have followed the same methodology by accumulating
the results of all participants without the baselines. The results are shown in Table 7,
where statistical significance is marked with an asterisk.
In case of Arabic, females are easier to be identified in all varieties except in case of
Maghrebi, where in addition the difference between genders is the highest, with more
than 15% of accuracy. Only in case of Levantine the difference is not statistically sig-
nificant, with a difference of about 2%. With respect to English, there is not an easier
gender but it depends on the variety. However, only in case of Australia, New Zealand
and United States these diferences are significant, with values of about 5%, 6% and a
noteworthy 14%, respectively. Something similar happens with Spanish, where the dif-
ferences depend on the variety without a predominantly easier gender. In this language,
there are four out seven varieties with significant differences. The highest differences
occur with Argentina (14%) and Venezuela (10%), whereas with Chile and Spain these
differences are minor (about 6% and 7% respectively). In case of Portuguese there is
a clear and significant difference among genders in favour of females, where they are
easier to be distinguished independently of the variety and with an increase in accuracy
of more than 12% and 14%.
Table 7. Gender identification accuracies per language variety (* indicates a significant difference
according to a Student t-test).
Language Variety Female Male Difference Average
Arabic Egypt* 0.7426 0.6847 0.7137 0.0579
Gulf* 0.7932 0.6937 0.7435 0.0995
Levantine 0.7324 0.7108 0.7216 0.0216
Maghrebi* 0.6345 0.7847 0.7096 -0.1502
English Australia* 0.6871 0.7367 0.7119 -0.0496
Canada 0.7452 0.7412 0.7432 0.0040
Great Britain 0.7859 0.8063 0.7961 -0.0204
Ireland 0.8025 0.8015 0.8020 0.0010
New Zealand* 0.7514 0.8102 0.7808 -0.0588
United States* 0.6581 0.7962 0.7272 -0.1381
Portuguese Brazil* 0.8037 0.6834 0.7436 0.1203
Portugal* 0.8895 0.7432 0.8164 0.1463
Spanish Argentina* 0.8525 0.7097 0.7811 0.1428
Chile* 0.7528 0.8153 0.7841 -0.0625
Colombia 0.7836 0.7708 0.7772 0.0128
Mexico 0.7028 0.7242 0.7135 -0.0214
Peru 0.7900 0.7794 0.7847 0.0106
Spain* 0.7003 0.7697 0.7350 -0.0694
Venezuela* 0.6603 0.7625 0.7114 -0.1022
5.7 Gender and Language Variety Joint Evaluation
In this section a summary of the joint results per language is shown. Table 9 shows the
users’ performances when both gender and language variety are properly detected per
language. We can observe that the best results were achieved in Portuguese (85.75%),
followed by Spanish (80.36%), English (74.29%) and Arabic (68.31%). The difference
on accuracy among languages is very significant. Most of the participants obtained
better results than two of the baselines (BOW and statistical), although only 8 partici-
pants outperformed the LDR baseline. Furthermore, in case of Portuguese only 9 teams
outperformed the bag-of-words baseline, showing the power of simple words to dis-
criminate among varieties and genders in that language. On the contrary, this baseline
shows its inefficiency in case of Arabic, where the accuracy drops to values close to the
statistical baseline.
Looking at Figure 7, it can be observed that there are a couple of outliers obtaining
lower results. For Arabic, two participants obtained significantly worse results than
the rest. The authors in [25] approached the task with word embeddings and logistic
regression, obtaining 28.75% accuracy. The authors in [26] obtained 36.50% with the
100 most discriminant words per class and a distance-based algorithm. This approach
also obtained the outlier results in Portuguese (54.88%) and Spanish (21.89%). With
regard to English, the authors in [1] obtained 10.17% with combinations of bag-of-
words and series of emotional features learned with a distance-based approach.
Table 8. Joint results.
Ranking Team Arabic English Portuguese Spanish Average
1 Basile et al. 0.6831 0.7429 0.8288 0.8036 0.7646
2 Martinc et al. 0.6825 0.7042 0.8463 0.7850 0.7545
3 Tellez et al. 0.6713 0.7267 0.8425 0.7621 0.7507
4 Miura et al. 0.6419 0.6992 0.8575 0.7518 0.7376
5 Lopez-Monroy et al. 0.6475 0.7029 0.8100 0.7604 0.7302
6 Markov et al. 0.6525 0.7125 0.7750 0.7704 0.7276
7 Poulston et al. 0.6356 0.6254 0.8188 0.7471 0.7067
8 Sierra et al. 0.5694 0.6567 0.8113 0.7279 0.6913
LDR-baseline 0.5888 0.6357 0.7763 0.6943 0.6738
9 Ogaltsov & Romanov 0.5731 0.6450 0.7775 0.6846 0.6701
10 Franco-Salvador et al. 0.5688 0.6046 0.7525 0.7021 0.6570
11 Kodiyan et al. 0.5688 0.6263 0.7300 0.6646 0.6474
12 Ciobanu et al. 0.5619 0.5904 0.7575 0.6764 0.6466
13 Schaetti 0.5681 0.6150 0.7300 0.6718 0.6462
14 Kheng et al. 0.5475 0.5704 0.6475 0.6400 0.6014
15 Ganesh 0.5075 0.4713 0.7300 0.5614 0.5676
16 Kocher & Savoy 0.5206 0.4650 0.7575 0.4971 0.5601
BOW-baseline 0.1794 0.4713 0.7588 0.5561 0.4914
17 Ignatov et al. 0.2875 0.4333 0.6675 0.5593 0.4869
18 Khan 0.3650 0.1900 0.5488 0.2189 0.3307
19 Ribeiro-Oliveira et al. 0.4831 - 0.7538 - 0.3092
20 Alrifai et al. 0.5638 - - - 0.1410
STAT-baseline 0.1250 0.0833 0.2500 0.0714 0.1324
21 Bouazizi - 0.2479 - - 0.0620
22 Adame et al. - 0.1017 - - 0.0254
Min 0.2875 0.1017 0.5488 0.2189 0.0254
Q1 0.5408 0.4697 0.7300 0.6462 0.5052
Median 0.5688 0.6202 0.7575 0.6934 0.6470
Mean 0.5650 0.5566 0.7601 0.6658 0.5552
SDev 0.1010 0.1854 0.0768 0.1400 0.2308
Q3 0.6433 0.7001 0.8151 0.7582 0.7224
Max 0.6831 0.7429 0.8575 0.8036 0.7646
Figure 7. Distribution of results for joint identification in the different languages.
5.8 Final Ranking and Best Results
In Table 9 the final ranking is shown. The values shown in each column correspond to
the average of the accuracies for this subtask among the different languages. The rank-
ing shows that the best overall result (85.99%) has been obtained by Basile et al. [10],
who used an SVM trained with combinations of character and tf-idf n-grams, followed
by Martinc et al. [40] (85.31%), who used logistic regression with combinations of
character, word, and POS n-grams, emojis, sentiments, character flooding, and lists of
words per variety. The third best result has been obtained by Tellez et al. [59] (85.09%)
with an SVM. These three best results are not significantly different.
Table 9. Global ranking as average of each language average for subtask.
Ranking Team Gender Variety Joint Average
1 Basile et al. 0.8253 0.9184 0.8361 0.8599
2 Martinc et al. 0.8224 0.9085 0.8285 0.8531
3 Tellez et al. 0.8097 0.9171 0.8258 0.8509
4 Miura et al. 0.8127 0.8982 0.8162 0.8424
5 Lopez-Monroy et al. 0.8047 0.8986 0.8111 0.8381
6 Markov et al. 0.7957 0.9056 0.8097 0.8370
7 Poulston et al. 0.7974 0.8786 0.7942 0.8234
8 Sierra et al. 0.7641 0.8911 0.7822 0.8125
LDR-baseline 0.7325 0.9187 0.7750 0.8087
9 Ogaltsov & Romanov 0.7669 0.8591 0.7653 0.7971
10 Franco-Salvador et al. 0.7667 0.8508 0.7582 0.7919
11 Schaetti 0.7207 0.8864 0.7511 0.7861
12 Kodiyan et al. 0.7531 0.8522 0.7509 0.7854
13 Ciobanu et al. 0.7504 0.8524 0.7498 0.7842
14 Kheng et al. 0.7002 0.8513 0.7176 0.7564
15 Ganesh 0.7342 0.7626 0.6881 0.7283
16 Kocher & Savoy 0.7178 0.7661 0.6813 0.7217
17 Ignatov et al. 0.6917 0.7024 0.6270 0.6737
BOW-baseline 0.6763 0.6907 0.6195 0.6622
18 Khan 0.6252 0.5296 0.4952 0.5500
19 Ribeiro-Oliveira et al. 0.3666 0.4141 0.3092 0.3633
STAT-baseline 0.5000 0.2649 0.2991 0.3547
20 Alrifai et al. 0.1806 0.1888 0.1701 0.1798
21 Bouazizi 0.1530 0.0931 0.1027 0.1163
22 Adame et al. 0.1353 0.0476 0.0695 0.0841
Min 0.1353 0.0476 0.0695 0.0841
Q1 0.6938 0.7175 0.6406 0.6857
Median 0.7518 0.8523 0.7510 0.7857
Mean 0.6588 0.7215 0.6427 0.6743
SDev 0.2260 0.2798 0.2472 0.2502
Q3 0.7970 0.8964 0.8058 0.8336
Max 0.8253 0.9184 0.8361 0.8599
As can be seen in Figure 8, there are several outliers with lower results than the
rest. These outliers correspond in almost all tasks with the same authors, who applied
SVMs trained with style-based features such as character flooding or different ratios of
hashtags and mentions [53, 3], a distance-based method with a high variety of emotional
features combined with bag-of-words [1].
Figure 8. Distribution of results in the different tasks.
In Table 10 the best results per language and task are shown. We can observe that
for both the gender and variety subtasks, the best results were achieved in Portuguese,
followed by Spanish, English and Arabic. In case of gender identification, the accura-
cies are between 80.31% in case of Arabic and 87% in case of Portuguese, whereas the
difference is higher for language variety identification, where the worst results obtained
in Arabic is 83.13% (4 varieties), against a 98.38% obtained in Portuguese (2 varieties).
Results for Spanish (7 varieties) (96.21%) are close to Portuguese, while in English (6
varieties) they fall to 89.88%.
Table 10. Best results per language and task.
Language Joint Gender Variety
Arabic 0.6831 0.8031 0.8313
English 0.7429 0.8233 0.8988
Spanish 0.8036 0.8321 0.9621
Portuguese 0.8575 0.8700 0.9838
6 Conclusion
In this paper we presented the results of the 5th International Author Profiling Shared
Task at PAN 2017 within CLEF 2017. The participants had to identify gender and lan-
guage variety from Twitter authors collected in four different languages: Arabic, En-
glish, Portuguese, and Spanish.
The participants used different feature sets to approach the problem: content-based
(among others: bag of words, word n-grams, term vectors, named entities, dictionary
words, slang words, contractions, sentiment words) and stylistic-based (among others:
frequencies, punctuations, POS, Twitter specific elements, readability measures). For
the first time, deep learning approaches have been used: Recurrent Neural Networks,
Convolutional Neural Networks, as well as word and character embeddings. It is diffi-
cult to highlight the contribution of each particular feature since the participants used
many of them. However, deep learning approaches did not obtain the best results.
In both subtasks as well as in the joint identification, the best results have been
obtained for Portuguese. In case of gender identification, average results are quite sim-
ilar among languages, with the lowest result for Arabic (72.10%), followed by Span-
ish (75.53%), English (75.71%), and Portuguese (78.00%). In case of language variety
the worst average result has been obtained for English (71.50%), followed by Arabic
(75.14%), Spanish (87.07%), and Portuguese (97.31%). However, in this case the dif-
ference between the worst and the best results is much higher (25.81% vs. 5.9%).
By analysing the error when discriminating among varieties, we found interesting
facts: For example, in Arabic, the most difficult to be identified is the Gulf variety,
whereas the easiest one is the Levantine. In case of English, the highest confusion
occurs with varieties that share the same geographical location: America, Europe, or
Oceania. In case of Portuguese, where only two varieties are analysed, the asymmetry
in the confusion matrix is noteworthy: most errors occur when Portuguese variety is
confused with Brazilian variety (4.45%), unlike when Brazilian is confused with Por-
tugal variety (0.93%). In case of Spanish, most confusions refer to Colombia, although
the most confused variety is the from Peru. On the contrary, the easiest varieties to be
identified are from Argentina and Spain.
Due to the geographical impact on the English varieties identification, we have car-
ried out a coarse-grained evaluation by combining varieties per continent: America,
Europe, and Oceania. Although results increase with statistical significance, the differ-
ences are not very high in case of the best performing approaches (3.75%) .
We have analysed the impact of the gender in the language variety identification,
showing that in Arabic and Portuguese the difference between genders is statistically
significant. Similarly, we have analysed the difficulty of gender identification depending
on the language variety, obtaining different insights with respect to both the easiest
gender to be identified and the significance of these results. For example, for most
Arabic varieties, females are less difficult to be identified (except for Maghrebi), as
well as in case of Portuguese. In case of Spanish and English, both genders appear to
be not so difficult depending on the variety.
Th joint results show a similar distribution than the final evaluation. In this regard,
Basile et al. [10], Martinc et al. [40], and Tellez et al. [59] obtained the best result, with-
out significant difference among them. They approached the task with SVMs trained
with combinations of character and tf-idf n-grams, logistic regression with combina-
tions of character, word and POS n-grams, emojis, sentiments, character flooding, and
lists of words per variety, learned with an SVM.
Acknowledgements
Our special thanks go to all of PAN’s participants, and to MeaningCloud12 for sponsor-
ing also this edition of the author profiling shared task award. The first author acknowl-
edges the SomEMBED TIN2015-71147-C2-1-P MINECO research project. The work
12
http://www.meaningcloud.com/
on the data in Arabic as well as this publication were made possible by NPRP grant
#9-175-1-033 from the Qatar National Research Fund (a member of Qatar Foundation).
The statements made herein are solely the responsibility of the first two authors.
Bibliography
[1] Yaritza Adame, Daniel Castro, Reynier Ortega, and Rafael Mu noz. Author
profiling, instance-based similarity classification. In Cappellato et al. [13].
[2] Ahmed Ali, Peter Bell, and Steve Renals. Automatic dialect detection in arabic
broadcast speech. In Interspeech, 2015.
[3] Khaled Alrifai, Ghaida Rebdawi, and Nada Ghneim. Arabic tweeps gender and
dialect prediction. In Cappellato et al. [13].
[4] Kholoud Alsmearat, Mahmoud Al-Ayyoub, and Riyad Al-Shalabi. An extensive
study of the bag-of-words approach for gender identification of arabic articles. In
2014 IEEE/ACS 11th International Conference on Computer Systems and
Applications (AICCSA), pages 601–608. IEEE, 2014.
[5] Kholoud Alsmearat, Mohammed Shehab, Mahmoud Al-Ayyoub, Riyad
Al-Shalabi, and Ghassan Kanaan. Emotion analysis of arabic articles and its
impact on identifying the author’s gender. In Computer Systems and Applications
(AICCSA), 2015 IEEE/ACS 12th International Conference on, 2015.
[6] Emad AlSukhni and Qasem Alequr. Investigating the use of machine learning
algorithms in detecting gender of the arabic tweet author. International Journal
of Advanced Computer Science & Applications, 1(7):319–328, 2016.
[7] Miguel-Angel Álvarez-Carmona, A.-Pastor López-Monroy, Manuel
Montes-Y-Gómez, Luis Villaseñor-Pineda, and Hugo Jair-Escalante. Inaoe’s
participation at pan’15: author profiling task—notebook for pan at clef 2015.
2015.
[8] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni.
Gender, genre, and writing style in formal written texts. TEXT, 23:321–346,
2003.
[9] Ranaivo-Malançon Bali. Automatic identification of close languages–case study:
malay and indonesian. ECTI Transaction on Computer and Information
Technology, 2(2):126–133, 2006.
[10] Agenlo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel
Haagsma, and Malvina Nissim. Is there life beyond n-grams? a simple
svm-based author profiling system. In Cappellato et al. [13].
[11] Mondher Bouazizi and Ohtsuki Tomoaki. Participation at the author profiling
shared task at pan at clef’17
http://pan.webis.de/clef17/pan17-web/author-profiling.html. 2017.
[12] John D. Burger, John Henderson, George Kim, and Guido Zarrella.
Discriminating gender on twitter. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, EMNLP ’11, pages 1301–1309,
Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
[13] Linda Cappellato, Nicola Ferro, Lorraine Goeuriot, and Thomas Mandl, editors.
CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org),
ISSN 1613-0073, http://ceur-ws.org/Vol-/, 2017. CLEF and CEUR-WS.org.
[14] Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, and Liviu P. Dinu.
Including dialects and language varieties in author profiling. In Cappellato et al.
[13].
[15] Heba Elfardy and Mona T Diab. Sentence level dialect identification in arabic. In
ACL (2), pages 456–461, 2013.
[16] Dominique Estival, Tanja Gaustad, Ben Hutchinson, Son Bao Pham, and Will
Radford. Author profiling for english and arabic emails. 2008.
[17] Marc Franco-Salvador, Francisco Rangel, Paolo Rosso, Mariona Taulé, and
M Antònia Martí. Language variety identification using distributed
representations of words and documents. In Experimental IR meets
multilinguality, multimodality, and interaction, pages 28–40. Springer, 2015.
[18] Marc Franco-Salvador, Nataliia Plotnikova, Neha Pawar, and Yassine Benajiba.
Subword-based deep averaging networks for author profiling in social media. In
Cappellato et al. [13].
[19] Barathi Ganesh and Anand Kumar. Participation at the author profiling shared
task at pan at clef (http://pan.webis.de/clef17/pan17-web/author-profiling.html).
2017.
[20] Tim Gollub, Benno Stein, and Steven Burrows. Ousting ivory tower research:
towards a web framework for providing experiments as a service. In Bill Hersh,
Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, 35th International
ACM Conference on Research and Development in Information Retrieval (SIGIR
12), pages 1125–1126. ACM, August 2012. ISBN 978-1-4503-1472-5.
[21] Tim Gollub, Benno Stein, Steven Burrows, and Dennis Hoppe. TIRA:
Configuring, executing, and disseminating information retrieval experiments. In
A Min Tjoa, Stephen Liddle, Klaus-Dieter Schewe, and Xiaofang Zhou, editors,
9th International Workshop on Text-based Information Retrieval (TIR 12) at
DEXA, pages 151–155, Los Alamitos, California, September 2012. IEEE. ISBN
978-1-4673-2621-6.
[22] Tim Gollub, Martin Potthast, Anna Beyer, Matthias Busse, Francisco Rangel,
Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Recent trends in digital
text forensics and its evaluation. In Pamela Forner, Henning Müller, Roberto
Paredes, Paolo Rosso, and Benno Stein, editors, Information Access Evaluation
meets Multilinguality, Multimodality, and Visualization. 4th International
Conference of the CLEF Initiative (CLEF 13), pages 282–302, Berlin Heidelberg
New York, September 2013. Springer. ISBN 978-3-642-40801-4.
[23] Janet Holmes and Miriam Meyerhoff. The handbook of language and gender.
Blackwell Handbooks in Linguistics. Wiley, 2003.
[24] Chu-Ren Huang and Lung-Hao Lee. Contrastive approach towards text source
classification based on top-bag-of-word similarity. In PACLIC, pages 404–410,
2008.
[25] Andrey Ignatov, Liliya Akhtyamova, and John Cardiff. Twitter author profiling
using word embeddings and logistic regression. In Cappellato et al. [13].
[26] Jamal Ahmad Khan. Author profile prediction using trend and word frequency
based analysis in text. In Cappellato et al. [13].
[27] Guillaume Kheng, Léa Laporte, and Michael Granitzer. Insa lyon and uni pasau’s
participation at pan@clef’17: Author profiling task. In Cappellato et al. [13].
[28] Mirco Kocher and Jacques Savoy. Unine at clef 2017: Author profiling
reasoning. In Cappellato et al. [13].
[29] Don Kodiyan, Florin Hardegger, Mark Cieliebak, and Stephan Neuhaus. Author
profiling with bidirectional rnns using attention with grus. In Cappellato et al.
[13].
[30] Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. Automatically
categorizing written texts by author gender. literary and linguistic computing
17(4), 2002.
[31] Nikola Ljubesic, Nives Mikelic, and Damir Boras. Language indentification:
how to distinguish similar languages? In 2007 29th International Conference on
Information Technology Interfaces, pages 541–546. IEEE, 2007.
[32] A. Pastor Lopez-Monroy, Manuel Montes-Y-Gomez, Hugo Jair Escalante, Luis
Villasenor-Pineda, and Esau Villatoro-Tello. INAOE’s participation at PAN’13:
author profiling task—Notebook for PAN at CLEF 2013. In Pamela Forner,
Roberto Navigli, and Dan Tufis, editors, CLEF 2013 Evaluation Labs and
Workshop – Working Notes Papers, 23-26 September, Valencia, Spain, September
2013.
[33] A. Pastor López-Monroy, Manuel Montes y Gómez, Hugo Jair-Escalante, and
Luis Villase nor Pineda. Using intra-profile information for author
profiling—Notebook for PAN at CLEF 2014. In L. Cappellato, N. Ferro,
M. Halvey, and W. Kraaij, editors, CLEF 2014 Evaluation Labs and Workshop –
Working Notes Papers, 15-18 September, Sheffield, UK, September 2014.
[34] A. Pastor López-Monroy, Manuel Montes y Gómez, Hugo Jair-Escalante,
Luis Villase nor Pineda, and Thamar Solorio. Uh-inaoe participation at pan17:
Author profiling. In Cappellato et al. [13].
[35] Marco Lui and Paul Cook. Classifying english documents by national dialect. In
Proceedings of the Australasian Language Technology Association Workshop,
pages 5–15. Citeseer, 2013.
[36] Suraj Maharjan, Prasha Shrestha, Thamar Solorio, and Ragib Hasan. A
straightforward author profiling approach in mapreduce. In Advances in Artificial
Intelligence. Iberamia, pages 95–107, 2014.
[37] Wolfgang Maier and Carlos Gómez-Rodrıguez. Language variety identification
in spanish tweets. LT4CloseLang 2014, page 25, 2014.
[38] Shervin Malmasi, Mark Dras, et al. Automatic language identification for persian
and dari texts. In Proceedings of PACLING, pages 59–64, 2015.
[39] Ilia Markov, Helena Gómez-Adorno, and Grigori Sidorov. Language- and
subtask-dependent feature selection and classifier parameter tuning for author
profiling. In Cappellato et al. [13].
[40] Matej Martinc, Iza Škrjanec, Katja Zupan, and Senja Pollak. Pan 2017: Author
profiling - gender and language variety prediction. In Cappellato et al. [13].
[41] Yasuhide Miura, Tomoki Taniguchi, Motoki Taniguchi, and Tomoko Ohkuma.
Author profiling with word+character neural attention network. In Cappellato
et al. [13].
[42] Alexander Ogaltsov and Alexey Romanov. Language variety and gender
classification for author profiling in pan 2017. In Cappellato et al. [13].
[43] James W. Pennebaker. The secret life of pronouns: what our words say about us.
Bloomsbury USA, 2013.
[44] James W. Pennebaker, Mathias R. Mehl, and Kate G. Niederhoffer.
Psychological aspects of natural language use: our words, our selves. Annual
review of psychology, 54(1):547–577, 2003.
[45] Adam Poulston, Zeerak Waseem, and Mark Stevenson. Using tf-idf n-gram and
word embedding cluster ensembles for author profiling. In Cappellato et al. [13].
[46] Francisco Rangel and Paolo Rosso. On the multilingual and genre robustness of
emographs for author profiling in social media. In 6th international conference
of CLEF on experimental IR meets multilinguality, multimodality, and
interaction, pages 274–280. Springer-Verlag, LNCS(9283), 2015.
[47] Francisco Rangel and Paolo Rosso. On the impact of emotions on author
profiling. Information processing & management, 52(1):73–92, 2016.
[48] Francisco Rangel, Paolo Rosso, Moshe Moshe Koppel, Efstathios Stamatatos,
and Giacomo Inches. Overview of the author profiling task at pan 2013. In
Forner P., Navigli R., Tufis D. (Eds.), CLEF 2013 labs and workshops, notebook
papers. CEUR-WS.org, vol. 1179, 2013.
[49] Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin
Trenkmann, Benno Stein, Ben Verhoeven, and Walter Daelemans. Overview of
the 2nd author profiling task at pan 2014. In Cappellato L., Ferro N., Halvey M.,
Kraaij W. (Eds.) CLEF 2014 labs and workshops, notebook papers.
CEUR-WS.org, vol. 1180, 2014.
[50] Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein, and Walter
Daelemans. Overview of the 3rd author profiling task at pan 2015. In Cappellato
L., Ferro N., Jones G., San Juan E. (Eds.) CLEF 2015 labs and workshops,
notebook papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1391, 2015.
[51] Francisco Rangel, Paolo Rosso, and Marc Franco-Salvador. A low
dimensionality representation for language variety identification. In 17th
International Conference on Intelligent Text Processing and Computational
Linguistics, CICLing. Springer-Verlag, LNCS, arXiv:1705.10754, 2016.
[52] Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin
Potthast, and Benno Stein. Overview of the 4th author profiling task at PAN
2016: cross-genre evaluations. In Working Notes Papers of the CLEF 2016
Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org,
September 2016.
[53] Rodrigo Ribeiro-Oliveira and Rosalvo Ferreira de Oliveira-Neto. Using character
n-grams and style features for gender and language variety identification. In
Cappellato et al. [13].
[54] Charles A Russell and Bowman H Miller. Profile of a terrorist. Studies in
Conflict & Terrorism, 1(1):17–34, 1977.
[55] Fatiha Sadat, Farnazeh Kazemi, and Atefeh Farzindar. Automatic identification
of arabic language varieties and dialects in social media. Proceedings of
SocialNLP, page 22, 2014.
[56] Nils Schaetti. Unine at clef 2017: Tf-idf and deep-learning for author profiling.
In Cappellato et al. [13].
[57] Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker.
Effects of age and gender on blogging. In AAAI Spring Symposium:
Computational Approaches to Analyzing Weblogs, pages 199–205. AAAI, 2006.
[58] Sebastian Sierra, Manuel Montes y Gómez, Thamar Solorio, and Fabio A.
González. Convolutional neural networks for author profiling in pan 2017. In
Cappellato et al. [13].
[59] Eric S. Tellez, Sabino Miranda-Jiménez, Mario Graff, and Daniela Moctezuma.
Gender and language variety identification with microtc. In Cappellato et al. [13].
[60] Edson Weren, Anderson Kauer, Lucas Mizusaki, Viviane Moreira, Palazzo
de Oliveira, and Leandro Wives. Examining multiple features for author
profiling. In Journal of Information and Data Management, pages 266–279,
2014.
[61] Marcos Zampieri and Binyam Gebrekidan Gebre. Automatic identification of
language varieties: the case of portuguese. In The 11th conference on natural
language processing (KONVENS), pages 233–237. Österreichischen Gesellschaft
für Artificial Intelligende (ÖGAI), 2012.