<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francisco Rangel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Potthast</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Montes-y-Gómez</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benno Stein</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Autoritas Consulting</institution>
          ,
          <addr-line>S.A.</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>INAOE</institution>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leipzig University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>PRHLT Research Center, Universitat Politècnica de València</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Web Technology &amp; Information Systems, Bauhaus-Universität Weimar</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This overview presents the framework and the results of the Author Profiling shared task at PAN 2018. The objective of this year's task is to address gender identification from a multimodal perspective, where not only texts but also images are given. For this purpose a corpus with Twitter data has been provided, covering the languages Arabic, English, and Spanish. Altogether, the approaches of 23 participants are evaluated.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Author profiling is the analysis of shared content in order to predict different attributes
of authors such as gender, age, personality, native language, or political orientation.
Supported by the huge amount of information that is available on social media
platforms, author profiling has gained a lot of interest. Being able to infer an author’s
gender, age, native language, dialects, or personality opens a world of possibilities—among
others in marketing, where companies may analyze online reviews to improve targeted
advertising, or in forensics, where the profile of authors could be used as valuable
additional evidence in criminal investigations, and in security, where knowing the
demographics of social media users (age and gender), as well as cultural and social context
such as native language and dialects, may help to identify potential terrorists [51].</p>
      <p>In the following we provide a historical outline of previous editions of this task.
In the Author Profiling task at PAN 20131 [45], the identification of age and gender
relied on a large corpus collected from social media, both for English and Spanish. In
PAN 20142 [46], we continued focusing on age and gender aspects but, in addition,
compiled a corpus of four different genres, namely social media, blogs, Twitter, and
hotel reviews. Except for the hotel review subcorpus, which was available for English
only, all documents were provided in both English and Spanish. Note that most of the
1 http://webis.de/research/events/pan-13/pan13-web/author-profiling.html
2 http://webis.de/research/events/pan-14/pan14-web/author-profiling.html
existing research in computational linguistics [6] and social psychology [40] focuses
on the English language, and the question is whether the observed relations pertain to
other languages and genres as well. In this vein, in PAN 20153 [47], we included two
new languages, Italian and Dutch, besides a new subtask on personality recognition in
Twitter. In PAN 20164 [50], we investigated the effect of cross-genre information: the
models are trained on a certain genre (here: Twitter) and evaluated on another genre
different than Twitter. In PAN 20175 [19], we considered the language variety
identification together with the gender dimension. We evaluated this new subtask in four
languages: Arabic, English, Portuguese and Spanish.</p>
      <p>Social media data cover a wide range of modalities such as text, images, audio, and
video, all of which containing useful information to be exploited for extracting valuable
insights from users. Consequently, the objective of this year’s evaluation6 is to address
gender identification from a multimodal perspective: not only texts but also images
are given. For this purpose a corpus with Twitter data has been provided, covering the
languages: Arabic, English, and Spanish.</p>
      <p>The remainder of this paper is organized as follows. Section 2 covers the state of the
art, Section 3 describes the corpus and the evaluation measures, and Section 4 presents
the approaches submitted by the participants. Sections 5 and 6 discuss results and draw
conclusions respectively.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>The relationship between personal traits and the use of language has been widely
studied by the psycholinguistics Pennebaker [41]. He analysed how the use of the language
varies depending on personal traits. For example, in regards to the authors’ gender,
he found out that in English women use more negations or first persons, because they
are more self-concientious, whereas men use more prepositions in order to describe
their environment. These finding are the basis of LIWC (Linguistic Inquiery and Word
Count) [40] that is one of the most used tools in author profiling.</p>
      <p>Initial investigations in author profiling [6, 26, 13, 28, 54] focused mainly on
formal texts and blogs. Their reported accuracies ranged from 75% to 80%. Nevertheless,
nowadays researchers focused mainly on social media, where the language is more
spontaneous and less formal. It should be highlighted the contribution of different
researchers that used the PAN datasets. For example, the authors in [35] showed how
to deal with a large dataset such as the PAN-AP-2013 with 3 million features with a
MapReduce configuration. With the same dataset, the authors in [67] showed the
contribution of information retrieval-based features. Following Pennebaker findings about the
relationship between emotions and gender, the authors in [44] proposed the EmoGraph
graph-based approach to capture how users convey verbal emotions in the
morphosyntactic structure of the discourse and showed competitive results with the best
performing systems at PAN-2013 and demonstrating the robustness of the approach against
3 http://pan.webis.de/clef15/pan15-web/author-profiling.html
4 http://pan.webis.de/clef16/pan16-web/author-profiling.html
5 http://pan.webis.de/clef17/pan17-web/author-profiling.html
6 https://pan.webis.de/clef18/pan18-web/author-profiling.html
genres and languages at PAN-2014 [43]. Recently, Bayot and Gonçalves [10] used the
PAN-AP-2016 dataset to show that word embeddings worked better in case of gender
identification than TF-IDF. Finally, it is worth mentioning the second order
representation based on relationships between documents and profiles used by the best performing
team in three editions of PAN [30, 31, 4], as well as the performance of the combination
of n-grams as shown by the authors [9] of the best performing team at PAN 2017.</p>
      <p>The investigation in Arabic is more scarce and most of the research focused on other
genres than social media. For example, Estival et al. [18] focused on Arabic emails. The
authors reported accuracies of 72.10%. Similarly, Alsmearat et al. [2] focused on Arabic
newsletters. They initially reported an accuracy of 86.4% that was increased to 94% in
an extension of their work [1]. With respect to social media, AlSukhni &amp; Alequr [3]
focused on Arabic tweets and they reported accuracies of 99.50%. They improved a
bag-of-words model with the use of the Twitter authors’ names.</p>
      <p>The use of visual features for author profiling has been less studied. A common
approach for gender identification is the use of frontal facial images [37, 60, 17]. The
authors in [37] trained SVM with 1,755 low resolution thumbnail faces (21x12 pixels)
from the FERET face database7 obtaining an error of 3.4%. The authors in [60] used
Principal Component Analysis to represent each image in a smaller dimensional space,
reducing the error from 17.7% to 11.3% with a neural network. The authors in [17]
experimented with 120 combinations of automatic face detection, face alignment and
gender classification. They found out that the automatic face alignment did not increase
the gender classification rates, whereas the manual alignment did. The authors
evaluated several machine learning algorithms, obtaining the best results with SVM. They
also saw that the classification did not depend on the size of the images. Recently, user
annotated data have been used more and more. For example, Twitter has been used as
repository to learn and evaluate gender identification systems. In this sense, the authors
in [34] used automatic image annotations and the authors in [56] proposed a Multi-task
Bilinear Model to combine the visual concept detector with the feature extractor to
predict gender in Twitter. Similarly, the authors in [8] used 56 image aesthetic features to
gender identification in 24,000 images provided by 120 FlickR users, obtaining 82.50%
of accuracy.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Framework</title>
      <p>The purpose of this section is to introduce the technical background. We outline the
construction of the corpus, introduce the performance measures and baselines, and
describe the idea of so-called software submissions.
3.1</p>
      <sec id="sec-3-1">
        <title>Corpus</title>
        <p>The focus of this year’s task is on gender identification in Twitter from a multimodal
perspective: besides textual information, the participants are provided also with images.
The task is framed as a multilingual task, covering the languages Arabic, English, and
Spanish.
7 https://www.nist.gov/programs-projects/face-recognition-technology-feret</p>
        <p>The PAN-AP-2018 corpus is based on the PAN-AP-2017 corpus [49], extended by
images that have been shared in the respective Twitter timelines. More specifically,
PAN-AP-2018 contains those authors from the PAN-AP-2017 corpus who still have a
Twitter account and who have shared at least 10 images. Table 1 overviews the key
figures of the corpus. Moreover, the corpus is balanced with regard to gender and it
contains 100 tweets per author.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Performance Measures</title>
        <p>The participants were asked to submit per author three predictions according to the
following modalities: a) text-based, b) image-based, and c) a combination of both. It
was allowed to approach the task in a favoured language and a favoured modality;
however, we encouraged them to participate in all languages and all modalities.8</p>
        <p>
          For each language and for each modality the accuracy was computed. Note that the
accuracy of the combined approach has been chosen as overall accuracy for the given
language; if only the textual approach was submitted, its accuracy has been used. The
final ranking has been calculated as the average accuracy per language as defined by
the following equation:
ranking =
accar + accen + acces
3
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Baselines</title>
        <p>In order to assess the complexity of the subtasks per language and to compare the
performances of the participants approaches, we propose the following baselines:
– BASELINE-stat. A statistical baseline that emulates random choice. As there are
two classes and the number of instances is balanced, the random choice baseline is
50% accuracy. This baseline applies for both modalities, images and texts.
– BASELINE-bow. To approach the textual modality, we have represented the
documents under a bag-of-words model with the 5,000 most common words in the
training set, weighted by absolute frequency. The texts are preprocessed as
follows: lowercase words, removal of punctuation signs and numbers, and removal of
stop words for the corresponding language.
8 From the 23 participants, 22 participated in the Arabic and Spanish tasks, and all of them in
the English tasks. All of them approached the task with text features, where 12 participants
also used images.
– BASELINE-rgb. To approach the image modality, we represent the photos as
follows. For each author, we obtain the RGB color for each pixel in his/her photos.
We represent the author with the following descriptive statistics of the RGB values:
minimum, maximum, mean, median, and standard deviation.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Software Submissions</title>
        <p>We asked for software submissions (as opposed to run submissions). Within software
submissions, participants submit executables of their author profiling softwares instead
of just the output (also called “run”) of their softwares on a given test set. Our
rationale to do so is to increase the sustainability of our shared task and to allow for the
re-evaluation of approaches to Author Profiling later on, and, in particular, on future
evaluation corpora. To facilitate software submissions, we develop the TIRA
experimentation platform [21, 22], which renders the handling of software submissions at
scale as simple as handling run submissions. Using TIRA, participants deploy their
software on virtual machines at our site, which allows us to keep them in a running
state [23].
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Overview of the Submitted Approaches</title>
      <p>This year, 23 teams participated in the Author Profiling shared task and 22 of them
submitted the notebook paper.9 We analyse their approaches from three perspectives:
preprocessing, features to represent the authors’ texts, and classification approaches.
4.1</p>
      <sec id="sec-4-1">
        <title>Preprocessing</title>
        <p>Various participants cleaned the textual contents to obtain plain text. Most of them
removed or normalised Twitter-specific elements such as URLs, user mentions, or
hashtags [15, 61, 59, 42, 53, 24, 66, 36, 65, 38, 29]. Some participants also lowercased the
words [66, 65, 38, 11, 29, 59, 53, 24]. The authors in [15, 59, 24, 65] removed
punctuation signs; character flooding has been removed by the authors in [15, 42]. Stopwords
have been removed by the authors in [15, 42, 24, 65], and contractions and
abbreviations have been expanded by the authors in [59, 42]. The authors in [15] applied specific
preprocessing to Arabic texts, such as normalisation and diacritics removal.</p>
        <p>Only three participants preprocessed images. The authors in [61] applied direct
resizing and resizing with cropping, as well as normalisation by subtracting the average
RGB value per language. The authors in [36] rescaled all images to 64x64 and used only
those containing human faces, while the authors in [57] rescaled all images to 224 pixel
width, maintaining the aspect ratio.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Features</title>
        <p>In previous editions of the author profiling task at PAN as well as in the referred
literature, features used for representing text documents have been distinguished as either
9 Hacohen-Kerner et al. described in their working note the participation of two teams.
content-based or style-based. However, this year several participants have employed
deep learning techniques. It is interesting to differentiate among traditional features and
these new methods in order to compare their performance in the author profiling task.
While the authors in [36, 65, 11, 33, 61] represented documents with word embeddings,
the authors in [53] used character embeddings. Moreover, the authors in [59, 52, 33]
also used traditional features such as character, word, and/or POS n-grams. The authors
in [39] combined word embeddings for English as well as stylistic features; however,
for Spanish and Arabic they used LSA instead of word embeddings.</p>
        <p>Traditional features such as character and word n-grams have been widely used
[66, 62, 38, 29, 16, 24, 59, 15]. Style features have been also used by some
participants [39, 27, 24]. For example, the authors in [39] used the counts of stopwords,
punctuation marks, emoticons, and slang words (only for English). The authors in [27]
combined POS tags n-grams with syntactic dependencies to model the use of
amplifiers, verbal constructions, pronouns, subjects and objects, types of adverbials, as well
as the use of interjections and profanity. The authors in [24] counted the average
number of characters and the average number of words per tweet. The authors in [66] also
used emojis, whereas the authors in [20] used only the skewness calculated from a
variation of the Low Dimensionality Statistical Embedding (LDSE) [48]. The authors in [5]
combined ensembles of word and character n-grams with bag-of-terms and second
order features [30, 31, 32], which relates documents with authors’ profiles.</p>
        <p>With respect to the representation of images several approaches have been
presented. For example, some participants tried to detect faces in images [59, 15, 65].
In this regard, the authors in [65] used face vectors from images that contained only
faces. Besides faces the authors in [15] detected also objects and quantified local
binary patterns and color histograms. Other authors used image resources, such as [39],
who applied an image captioning system [64]. Similarly, the authors in [38] used a
known image feature extraction tool [7] to obtain features about the number of faces in
the images, as well as the expressed emotions or their gender. The authors in [5] used
ImageNet [58] to obtain VGG1610 features, and the authors in [53] built a
languageindependent model with TorchVision.11 The authors in [61] also used a pre-trained
Convolutional Neural Network (CNN) based on VGG16. Other participants approached the
task with their own set of features, such as the authors in [24] who combined three sets
of characteristics: Shift, RGB histogram, and VGG. The authors in [62] designed a
variant of the Bag-of-Visual-Words (BoVW) by using the DAISY [63] feature descriptor
and encoded the images by the set of visual words.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Classification Approaches</title>
        <p>Regarding the deep learning approaches, the authors with the overall highest
accuracy [61] used Recurrent Neural Networks (RNN) for texts and CNN for images. CNNs
have also been used by the authors in [5, 53, 55, 36], while RNNs have also been used
by the authors in [11]. Interestingly, the authors in [53] used CNN only for texts and
ResNet18 [25] for images. In the same vein, the authors in [65] approached the images
10 Visual Geometry Group: http://www.robots.ox.ac.uk/˜vgg/research/very_deep
11 https://pytorch.org/docs/stable/torchvision/index.html
with SVM but used Bi-LSTM for texts. The authors in [59] used CNN for images and
an ensemble of Naive Bayes and RNN for texts. Finally, the authors in [42] approached
the task with dense neural networks.</p>
        <p>Some participants still used traditional machine learning algorithms such as logistic
regression [52, 24, 66, 38], SVMs [33, 5, 15, 39, 62, 65], multilayer perceptron [24],
a basic feed-forward network [29], and distance-based methods [62, 27]. It is worth to
mention the approach in [20], who used a simple IF condition with respect to only one
feature, allowing the system to process the whole dataset in seconds while achieving a
decent performance.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation and Discussion of the Submitted Approaches</title>
      <p>Although we encouraged to consider both modalities, some participants approached
the problem with text features only. We present the results separately to account for this
fact.
5.1</p>
      <sec id="sec-5-1">
        <title>Gender Identification with Text Features</title>
        <p>As can be seen in Table 2, the best results were obtained for English (82.21%) [16] and
Spanish (82%) [16], although being only slightly better than for Arabic (81.70%) [62].
This similarity is also reflected by the mean accuracies, which are 74.85% for Arabic,
76.93% for English, and 75.46% for Spanish. Taking a closer look at the distributions
(Figure 1) shows a different characteristic for English: the median is higher and
approximately equal to the Q3 of the other languages, while the interquartile range is
smaller. The similarity in the mean value is due to the two outliers (55.21% [27] and
66.580% [52]). This fact is highlighted in the density chart (Figure 2), where the curve
for the English language is more skewed to the right and the kurtosis is higher since
there are more results concentrated around 80%.</p>
        <p>The best result for Arabic (81,70%) is from the authors in [62]; they performed
several preprocessing steps and trained an SVM with word n-grams, character n-grams,
and skip-grams of different lengths and different weighing schemes such as boolean, tf,
and tf-idf. There is no statistical significance with respect to the second (81.20%) [57]
and third (80.90%) [16] best results. The authors approached the task with character
n-grams and combinations of different types of n-grams. The best result for English
(82.21%) comes from the authors in [16]. There is no statistical significance with the
second (81.21%) [62] and third (81.16%) [38] best results. The authors in [38] used
Logistic Regression with word and character n-grams. Finally, for Spanish, the best result
(82%) is from the authors in [16]. Again, there is no statistical significance regarding
the second (80.36%) [65] and third (80.27%) [38] best systems. The authors in [65]
used a bi-LSTM with pre-trained word embeddings.</p>
        <p>With respect to the provided baselines, we can discard the statistical one since its
results are much lower than those obtained by the participants. The BOW baseline is at
rank 17 out of 22 in the overall ranking.12 Furthermore, for Arabic the obtained result
12 The system of Kalgren et al. is not count since they participated in the English tasks only.
(74.80%) is very close to the mean (74.85%), while 9 participants are below. For English
and Spanish, most participants were better than the baseline. For English, the obtained
result (74.11%) is lower than the mean (76.93%) and even lower than the Q1 (76.34%),
with 4 participants below (including the aforementioned outliers [27, 52]). For Spanish,
the obtained result (72.55%) is below the mean (75.46%) and the Q1 (73.70%), with
5 participants below (including one outlier).
As can be seen in Table 3, the best results were achieved for English (81.63%), with
statistical significance over Spanish (77.32%) and Arabic (77.80%). All best results
stem from the authors in [61], who used a pre-trained CNN on the basis of ImageNet.
Despite this higher value for the best obtained result for English, the distributions of
accuracies are very similar for the three languages, as can be seen in the Figures 3 and 4.
The mean values are of 62.37%, 63.41%, and 61.86% for Arabic, English, and Spanish
respectively, with standard deviations below 10% and following a normal distribution.</p>
        <p>For Arabic, the second best result (72.80%) has been obtained by the authors in [57],
who used VGG16 and ResNet50 from ImageNet. The third best result (70.10%) has
been obtained by the authors in [15]. Besides color histograms they have detected
faces, objects, and local binary patterns. Although there is no statistical significance
between them at 95% of confidence, there is with respect to the best result (not at 99%).
For English, the second (74.42%) and third (69.63%) best results are from the authors
in [57] and [15] respectively. In both cases the difference is statistically significant.
Similarly, for Spanish the second (71%) and third (68.05%) best results are from the
authors in [57] and [15] respectively. Again, the difference is statistically significant.</p>
        <p>As before, we can discard the statistical baseline. Similarly, most of the participants
have achieved better results than the RGB baseline (52.60% on average); two
participants achieved slightly lower results (50.23% and 50.22%) [24]). For all languages the
baseline (54.10%, 51.79%, and 51.91%) is below the respective Q1s (55.57%, 56.89%,
and 56.40%). Also note that this baseline is only slightly better than the statistical one,
we shows that it is not suitable for the task.
0.4923
0.5640
0.6052
0.6186
0.0869
0.6833
0.7732
0.1528
2.0109
0.7356
0.7872
0.7274
0.6926
0.6796
0.6745
0.6349
0.6068
0.5686
0.5658
0.5637
0.5260
0.5023
0.5022
0.5000
We now analyse how images can help to tackle the gender identification task. Table 4
shows the basic statistics about the improvement (in %) for the different languages. On
average, the improvement is very small (0.76% and 1.01% for Arabic and English), or
even negative (-0.06%) for of Spanish. However, looking at Figure 5 it can be seen that
some systems perform much better such as Takahashi et al., who achieved an
improvement of 7.73% for English.</p>
        <p>The tables 5, 6, and 7 show the accuracies obtained with texts, with images, with
their combination, and the percentage of improvement for Arabic, English, and Spanish
respectively. Similarly, the Figures 6, 7, and 8 show for the same languages the density
of the improvement distribution over text classification.</p>
        <p>Table 5 shows the results for Arabic. As can be seen in Figure 6 the results do
not follow a normal distribution; the improvement of most of the participants is
between 0.53% and -0.26%, whereas three users obtain higher improvements: 1.82% [61],
2.93% [5], and 3.36% [39]. It is noteworthy that the systems that obtained the highest
results tried to capture semantic features from images, and not only faces or colors.
For example, Gopal-Patra et al. [39] used an image captioning system [39], Aragon &amp;
Lopez [5] ImageNet to obtain VGG16 features, and Takahashi et al. [61] a pre-trained
CNN also on the basis of ImageNet.</p>
        <p>The distribution of improvements for English is even less normal, as can be seen in
Figure 7. There are three groups of systems (see Table 6): i) systems with improvements
between 0.72% and deteriorations of -4.65%, ii) one system with an improvement of
2.37% [39], and iii) one system with an improvement of 7.73% [61]. Similar to Arabic,
the best results have been achieved by systems that exploit semantic features [61, 39].
Furthermore, the less negative results have been achieved either with the use of
ImageNet and VGG16 features [5] or with the combination of face recognition, object
detection, local binary patterns, and color histograms [15].</p>
        <p>For Spanish the systems’ improvements follows a normal distribution, having two
spikes in both extremes. In particular, there is i) one system whose deterioration is
4.47% [57], ii) a group of users with improvement/deterioration between -1.30% and
1.62%, and iii) one system with 3.75% of improvement [61]. In this regard, the best
result has been obtained by Takahashi et al. with a pre-trained CNN from ImageNet,
followed by the use of an image captioning system [39], the combination of faces,
objects, and local binary patterns with color histograms [15], and the use of ImageNet
to obtain VGG16 features [5].
This year 23 teams participated in the shared task; Table 8 shows the overall
performance per language and user’s ranking. The best results have been obtained for English
(85.84%), followed by Spanish (82%), and Arabic (81.80%).</p>
        <p>The overall best result (81.98%) is from the authors in [61] who approached the
task with deep neural networks. For text processing, they used word embeddings from
a stream of tweets with FastText skip-grams and trained a Recurrent Neural Network.
For images, they used a pre-trained Convolutional Neural Network. They combined
both approaches with a fusion component. The authors in [16] got the second best
result on average (81.70%) by approaching the task only from the textual perspective.
They used an SVM with different types of word and character n-grams. The third best
overall result (80.68%) stems from the authors in [62]. They used an SVM with
combinations of word and character n-grams for texts and a variant of the Bag of Visual
Words for images, combining both predictions with a convex linear combination.
According to t-Student, there is no statistical significance among the three approaches.
This is also supported by the Bayesian Signed-Rank test [12] between Takahashi et al.
and Daneshvar, as shown in Figure 9. However, for Takahashi et al. and Tellez et al.,
the probability of the first system to perform better (62.96%) is higher than the sum of
being equal (20.64%) or worse (16.39%), as shown in Figure 10. The complete results
of this test are presented in the Appendix B.</p>
        <p>With respect to the different languages, the best results have been obtained by the
same authors. The best results for Arabic (81.80%) stem from the authors in [62], the
best results for English (85.84%) from the authors in [61], and the best results for
Spanish (82%) from the authors in [16]. Note that the only result that is significantly higher
is the one obtained for English (85.84%).</p>
        <p>Table 9 shows the best results per language and modality. The results achieved with
the textual approach are higher than the results obtained with images, although being
very similar to those for English. It should be highlighted that the best results were
obtained by combining texts and images, where in the case of English the improvement
is higher.
In this paper we presented the results of the 6th International Author Profiling Shared
Task at PAN 2018, hosted at CLEF 2018. The participants had to identify the gender
from Twitter authors, considering both a multimodal and a multilingual perspective: the
provided data contains both tweets and images and cover the three languages Arabic,
English, and Spanish.</p>
        <p>The participants used different approaches to tackle the task, with deep learning
approaches prevailing. However, the best results regarding the textual subtask have been
obtained with combinations of different types of n-grams and traditional machine
learning algorithms such as SVM and Logistic Regression. Only the second best result for
Spanish was obtained with a bi-LSTM, which has been trained with word embeddings.</p>
        <p>For the classification of images the approaches can be grouped in three types: i)
approaches based on face recognition, ii) approaches based on pre-trained models and
image processing tools such as ImageNet, and iii) approaches with “hand-crafted”
features such as color histograms and bag-of-visual-words. Regarding the second type,
the best results were obtained with semantic features extracted from the images.
Approaches based on face recognition do not belong to the best, which may be rooted in
the fact that many images do not show faces—and if, the contained faces do not depict
the author.</p>
        <p>According to the achieved results, text features discriminate better between genders
than do images. However, the combined use of both modalities provides insights: On
average, there is no improvement when images are used, which is due to the low
performance of some inferior approaches. However, for more elaborated representations,
which obtain semantics from the images with the use of tools such as ImageNet, the
improvement is up to 7.73% for English (taking into account that the accuracy obtained
only with text features is even high).</p>
        <p>The best results in the shared tasks are over 80% on average, with the highest result
for English (85.84%) [61], followed by Spanish (82%) [16], and Arabic (81.80%) [62].
Takahashi et al. [61] approached the task with deep learning techniques: word
embeddings and RNN for texts and ImageNet-based CNN for images. Daneshvar [16]
approached the task using the textual modality only. The author trained an SVM with
combinations of word and character n-grams. Finally, Tellez et al. [62] used SVM
with different kinds of n-grams, combined with a variant of the Bag of Visual Words
(BoVW) using the DAISY feature descriptor. Altogether, traditional approaches still
remain competitive, while some new approaches based on deep learning are acquiring
strength.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Acknowledgements</title>
        <p>Our special thanks goes to all PAN participants for providing high-quality
submission, and to MeaningCloud13 for sponsoring the author profiling shared task award.
The first author acknowledges the SomEMBED TIN2015-71147-C2-1-P MINECO
research project. The third author acknowledges the CONACyT FC-2016/2410. The work
on the data for Arabic as well as this publication were made possible by NPRP grant
#9-175-1-033 from the Qatar National Research Fund (a member of Qatar Foundation).
Responsible for the statements made herein are the first two authors.
13 http://www.meaningcloud.com/
[4] Miguel-Angel Álvarez-Carmona, A.-Pastor López-Monroy, Manuel
Montes-Y-Gómez, Luis Villaseñor-Pineda, and Hugo Jair-Escalante. Inaoe’s
participation at pan’15: author profiling task—notebook for pan at clef 2015.
2015.
[5] Mario Ezra Aragón and A.-Pastor López-Monroy. A straightforward multimodal
approach for author profiling. In Patrice Bellot, Chiraz Trabelsi, Josiane Mothe,
Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda Cappellato, and
Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and
Interaction. Proceedings of the Ninth International Conference of the CLEF
Association (CLEF 2018), September 2018.
[6] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni.</p>
        <p>Gender, genre, and writing style in formal written texts. TEXT, 23:321–346,
2003.
[7] Octavio Arriaga, Matias Valdenegro-Toro, and Paul Plöger. Real-time
convolutional neural networks for emotion and gender classification. arXiv
preprint arXiv:1710.07557, 2017.
[8] Samiul Azam and Marina Gavrilova. Gender prediction using individual
perceptual image aesthetics. 2016.
[9] Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel
Haagsma, and Malvina Nissim. N-gram: New groningen author-profiling model.
arXiv preprint arXiv:1707.03764, 2017.
[10] Roy Bayot and Teresa Gonçalves. Multilingual author profiling using word
embedding averages and svms. In Software, Knowledge, Information
Management &amp; Applications (SKIMA), 2016 10th International Conference on,
pages 382–386. IEEE, 2016.
[11] Roy Khristopher Bayot and Teresa Gon calves. Multilingual author profiling
using lstms. In Patrice Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh,
Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda Cappellato, and Nicola Ferro,
editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction.
Proceedings of the Ninth International Conference of the CLEF Association
(CLEF 2018), September 2018.
[12] A. Benavoli, F. Mangili, G. Corani, M. Zaffalon, and F. Ruggeri. A Bayesian
Wilcoxon signed-rank test based on the Dirichlet process. In Proceedings of the
30th International Conference on Machine Learning (ICML 2014), pages 1–9,
2014. URL http://www.idsia.ch/ alessio/benavoli2014a.pdf.
[13] John D. Burger, John Henderson, George Kim, and Guido Zarrella.</p>
        <p>Discriminating gender on twitter. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, EMNLP ’11, pages 1301–1309,
Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
[14] Linda Cappellato, Nicola Ferro, Lorraine Goeuriot, and Thomas Mandl, editors.</p>
        <p>
          CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org),
ISSN 1613-0073, http://ceur-ws.org/Vol-/, 2017. CLEF and CEUR-WS.org.
[15] Giovanni Ciccone, Arthur Sultan, Léa Laporte, Elöd Egyed-Zsigmond, Alaa
Alhamzeh, and Michael Granitzer. Stacked gender prediction from tweet texts
and images. In Patrice Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh,
Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda Cappellato, and Nicola Ferro,
editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction.
Proceedings of the Ninth International Conference of the CLEF Association
(CLEF 2018), September 2018.
[16] Saman Daneshvar. Gender identification in twitter using n-grams and lsa. In
Patrice Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie,
Laure Soulier, Eric Sanjuan, Linda Cappellato, and Nicola Ferro, editors,
Experimental IR Meets Multilinguality, Multimodality, and Interaction.
Proceedings of the Ninth International Conference of the CLEF Association
(CLEF 2018), September 2018.
[17] Makinen Erno, Roope Raisamo, et al. Evaluation of gender classification
methods with automatically detected and aligned faces. IEEE Transactions on
Pattern Analysis &amp; Machine Intelligence, (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ):541–547, 2007.
[18] Dominique Estival, Tanja Gaustad, Ben Hutchinson, Son Bao Pham, and Will
        </p>
        <p>Radford. Author profiling for english and arabic emails. 2008.
[19] Francisco Manuel, Rangel Pardo, Paolo Rosso, Martin Potthast, and Benno Stein.</p>
        <p>Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language
Variety Identification in Twitter. In Linda Cappellato, Nicola Ferro, Lorraine
Goeuriot, and Thomad Mandl, editors, Working Notes Papers of the CLEF 2017
Evaluation Labs, volume 1866 of CEUR Workshop Proceedings. CLEF and
CEUR-WS.org, September 2017. URL http://ceur-ws.org/Vol-1866/.
[20] Òscar Garibo-Orts. A big data approach to gender classification in twitter. In
Patrice Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie,
Laure Soulier, Eric Sanjuan, Linda Cappellato, and Nicola Ferro, editors,
Experimental IR Meets Multilinguality, Multimodality, and Interaction.
Proceedings of the Ninth International Conference of the CLEF Association
(CLEF 2018), September 2018.
[21] Tim Gollub, Benno Stein, and Steven Burrows. Ousting ivory tower research:
towards a web framework for providing experiments as a service. In Bill Hersh,
Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, 35th International
ACM Conference on Research and Development in Information Retrieval (SIGIR
12), pages 1125–1126. ACM, August 2012. ISBN 978-1-4503-1472-5.
[22] Tim Gollub, Benno Stein, Steven Burrows, and Dennis Hoppe. TIRA:
Configuring, executing, and disseminating information retrieval experiments. In
A Min Tjoa, Stephen Liddle, Klaus-Dieter Schewe, and Xiaofang Zhou, editors,
9th International Workshop on Text-based Information Retrieval (TIR 12) at
DEXA, pages 151–155, Los Alamitos, California, September 2012. IEEE. ISBN
978-1-4673-2621-6.
[23] Tim Gollub, Martin Potthast, Anna Beyer, Matthias Busse, Francisco Rangel,
Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Recent trends in digital
text forensics and its evaluation. In Pamela Forner, Henning Müller, Roberto
Paredes, Paolo Rosso, and Benno Stein, editors, Information Access Evaluation
meets Multilinguality, Multimodality, and Visualization. 4th International
Conference of the CLEF Initiative (CLEF 13), pages 282–302, Berlin Heidelberg
New York, September 2013. Springer. ISBN 978-3-642-40801-4.
[24] Yaakov HaCohen-Kerner, Yair Yigal, Elyashiv Shayovitz, Daniel Miller 1, and
Toby Breckon. Author profiling: Gender prediction from tweets and images. In
Patrice Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie,
Laure Soulier, Eric Sanjuan, Linda Cappellato, and Nicola Ferro, editors,
Experimental IR Meets Multilinguality, Multimodality, and Interaction.
Proceedings of the Ninth International Conference of the CLEF Association
(CLEF 2018), September 2018.
[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[26] Janet Holmes and Miriam Meyerhoff. The handbook of language and gender.</p>
        <p>Blackwell Handbooks in Linguistics. Wiley, 2003.
[27] Jussi Karlgren, Lewis Esposito, Chantal Gratton, and Pentti Kanerva. Authorship
profiling without using topical information. In Patrice Bellot, Chiraz Trabelsi,
Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda
Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Ninth International
Conference of the CLEF Association (CLEF 2018), September 2018.
[28] Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. Automatically
categorizing written texts by author gender. literary and linguistic computing
17(4), 2002.
[29] Rick Kosse, Youri Schuur, and Guido Cnossen. Mixing traditional methods with
neural networks for gender prediction. In Patrice Bellot, Chiraz Trabelsi, Josiane
Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda
Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Ninth International
Conference of the CLEF Association (CLEF 2018), September 2018.
[30] A. Pastor Lopez-Monroy, Manuel Montes-Y-Gomez, Hugo Jair Escalante, Luis
Villasenor-Pineda, and Esau Villatoro-Tello. INAOE’s participation at PAN’13:
author profiling task—Notebook for PAN at CLEF 2013. In Pamela Forner,
Roberto Navigli, and Dan Tufis, editors, CLEF 2013 Evaluation Labs and
Workshop – Working Notes Papers, 23-26 September, Valencia, Spain, September
2013.
[31] A. Pastor López-Monroy, Manuel Montes y Gómez, Hugo Jair-Escalante, and
Luis Villase nor Pineda. Using intra-profile information for author
profiling—Notebook for PAN at CLEF 2014. In L. Cappellato, N. Ferro,
M. Halvey, and W. Kraaij, editors, CLEF 2014 Evaluation Labs and Workshop –
Working Notes Papers, 15-18 September, Sheffield, UK, September 2014.
[32] A. Pastor López-Monroy, Manuel Montes y Gómez, Hugo Jair-Escalante,
Luis Villase nor Pineda, and Thamar Solorio. Uh-inaoe participation at pan17:
Author profiling. In Cappellato et al. [14].
[33] Roberto López-Santillán, Luis-Carlos González-Gurrola, and Graciela
Ramírez-Alonso. Custom document embeddings via the centroids method:
Gender classification in an author profiling task. In Patrice Bellot, Chiraz
Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric
Sanjuan, Linda Cappellato, and Nicola Ferro, editors, Experimental IR Meets
Multilinguality, Multimodality, and Interaction. Proceedings of the Ninth
International Conference of the CLEF Association (CLEF 2018), September
2018.
[34] Xiaojun Ma, Yukihiro Tsuboshita, and Noriji Kato. Gender estimation for sns
user profiling using automatic image annotation. In Multimedia and Expo
Workshops (ICMEW), 2014 IEEE International Conference on, pages 1–6. IEEE,
2014.
[35] Suraj Maharjan, Prasha Shrestha, Thamar Solorio, and Ragib Hasan. A
straightforward author profiling approach in mapreduce. In Advances in Artificial
Intelligence. Iberamia, pages 95–107, 2014.
[36] Matej Martinc, Blazˆ Sˆ krlj, and Senja Pollak. Multilingual gender classification
with multi-view deep learning. In Patrice Bellot, Chiraz Trabelsi, Josiane Mothe,
Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda Cappellato, and
Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and
Interaction. Proceedings of the Ninth International Conference of the CLEF
Association (CLEF 2018), September 2018.
[37] Baback Moghaddam and Ming-Hsuan Yang. Gender classification with support
vector machines. In Automatic Face and Gesture Recognition, 2000.</p>
        <p>Proceedings. Fourth IEEE International Conference on, pages 306–311. IEEE,
2000.
[38] Moniek Nieuwenhuis and Jeroen Wilkens. Twitter text and image gender
classification with a logistic regression n-gram model. In Patrice Bellot, Chiraz
Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric
Sanjuan, Linda Cappellato, and Nicola Ferro, editors, Experimental IR Meets
Multilinguality, Multimodality, and Interaction. Proceedings of the Ninth
International Conference of the CLEF Association (CLEF 2018), September
2018.
[39] Braja Gopal Patra, Kumar Gourav Das, and Dipankar Das. Multimodal author
profiling for arabic, english, and spanish. In Patrice Bellot, Chiraz Trabelsi,
Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda
Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Ninth International
Conference of the CLEF Association (CLEF 2018), September 2018.
[40] James W. Pennebaker. The secret life of pronouns: what our words say about us.</p>
        <p>Bloomsbury USA, 2013.
[41] James W. Pennebaker, Mathias R. Mehl, and Kate G. Niederhoffer.</p>
        <p>
          Psychological aspects of natural language use: our words, our selves. Annual
review of psychology, 54(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ):547–577, 2003.
[42] Kashyap Raiyani, Paulo Quaresma Teresa Gonc˜alves, and Vitor Beires-Nogueira.
        </p>
        <p>
          Multi-language neural network model with advance preprocessor for gender
classification over social media. In Patrice Bellot, Chiraz Trabelsi, Josiane
Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda
Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Ninth International
Conference of the CLEF Association (CLEF 2018), September 2018.
[43] Francisco Rangel and Paolo Rosso. On the multilingual and genre robustness of
emographs for author profiling in social media. In 6th international conference
of CLEF on experimental IR meets multilinguality, multimodality, and
interaction, pages 274–280. Springer-Verlag, LNCS(9283), 2015.
[44] Francisco Rangel and Paolo Rosso. On the impact of emotions on author
profiling. Information processing &amp; management, 52(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ):73–92, 2016.
[45] Francisco Rangel, Paolo Rosso, Moshe Moshe Koppel, Efstathios Stamatatos,
and Giacomo Inches. Overview of the author profiling task at pan 2013. In
Forner P., Navigli R., Tufis D. (Eds.), CLEF 2013 labs and workshops, notebook
papers. CEUR-WS.org, vol. 1179, 2013.
[46] Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin
Trenkmann, Benno Stein, Ben Verhoeven, and Walter Daelemans. Overview of
the 2nd author profiling task at pan 2014. In Cappellato L., Ferro N., Halvey M.,
Kraaij W. (Eds.) CLEF 2014 labs and workshops, notebook papers.
        </p>
        <p>CEUR-WS.org, vol. 1180, 2014.
[47] Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein, and Walter
Daelemans. Overview of the 3rd author profiling task at pan 2015. In Cappellato
L., Ferro N., Jones G., San Juan E. (Eds.) CLEF 2015 labs and workshops,
notebook papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1391, 2015.
[48] Francisco Rangel, Paolo Rosso, and Marc Franco-Salvador. A low
dimensionality representation for language variety identification. In 17th
International Conference on Intelligent Text Processing and Computational
Linguistics, CICLing. Springer-Verlag, LNCS, arXiv:1705.10754, 2016.
[49] Francisco Rangel, Paolo Rosso, Martin Potthast, and Benno Stein. Overview of
the 5th Author Profiling Task at PAN 2017: Gender and Language Variety
Identification in Twitter. In Cappellato L., Ferro N., Goeuriot L, Mandl T. (Eds.)
CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop
Proceedings. CEUR-WS.org, vol. 1866., CEUR Workshop Proceedings. CLEF
and CEUR-WS.org, September 2016.
[50] Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin
Potthast, and Benno Stein. Overview of the 4th author profiling task at PAN
2016: cross-genre evaluations. In Working Notes Papers of the CLEF 2016
Evaluation Labs, CEUR Workshop Proceedings. CLEF and CEUR-WS.org,
September 2016.
[51] Charles A Russell and Bowman H Miller. Profile of a terrorist. Studies in</p>
        <p>
          Conflict &amp; Terrorism, 1(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ):17–34, 1977.
[52] Rafael-Felipe Sandroni-Dias and Ivandré Paraboni. Author profiling using word
embeddings with subword information. In Patrice Bellot, Chiraz Trabelsi,
Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda
Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Ninth International
Conference of the CLEF Association (CLEF 2018), September 2018.
[53] Nils Schaetti. Unine at clef 2018: Character-based convolutional neural network
and resnet18 for twitter author profiling. In Patrice Bellot, Chiraz Trabelsi,
Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda
Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Ninth International
Conference of the CLEF Association (CLEF 2018), September 2018.
[54] Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker.
        </p>
        <p>Effects of age and gender on blogging. In AAAI Spring Symposium:</p>
        <p>Computational Approaches to Analyzing Weblogs, pages 199–205. AAAI, 2006.
[55] Erhan Sezerer, Ozan Polatbilek, Özge Sevgili, and Selma Tekir. Gender
prediction from tweets with convolutional neural networks. In Patrice Bellot,
Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier,
Eric Sanjuan, Linda Cappellato, and Nicola Ferro, editors, Experimental IR
Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Ninth
International Conference of the CLEF Association (CLEF 2018), September
2018.
[56] Ryosuke Shigenaka, Yukihiro Tsuboshita, and Noriji Kato. Content-aware
multi-task neural networks for user gender inference based on social media
images. In Multimedia (ISM), 2016 IEEE International Symposium on, pages
169–172. IEEE, 2016.
[57] Sebastián Sierra-Loaiza and Fabio A. González. Combining textual and visual
representations for multimodal author profiling. In Patrice Bellot, Chiraz
Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric
Sanjuan, Linda Cappellato, and Nicola Ferro, editors, Experimental IR Meets
Multilinguality, Multimodality, and Interaction. Proceedings of the Ninth
International Conference of the CLEF Association (CLEF 2018), September
2018.
[58] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[59] Luka Stout, Robert Musters, and Chris Pool. Author profiling based on text and
images. In Patrice Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh,
Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda Cappellato, and Nicola Ferro,
editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction.
Proceedings of the Ninth International Conference of the CLEF Association
(CLEF 2018), September 2018.
[60] Zehang Sun, George Bebis, Xiaojing Yuan, and Sushil J Louis. Genetic feature
subset selection for gender classification: A comparison study. In Applications of
Computer Vision, 2002.(WACV 2002). Proceedings. Sixth IEEE Workshop on,
pages 165–170. IEEE, 2002.
[61] Takumi Takahashi, Takuji Tahara, Koki Nagatani, Yasuhide Miura, Tomoki
Taniguchi, and Tomoko Ohkuma. Text and image synergy with feature cross
technique for gender identification. In Patrice Bellot, Chiraz Trabelsi, Josiane
Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda
Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Ninth International
Conference of the CLEF Association (CLEF 2018), September 2018.
[62] Eric S. Tellez, Sabino Miranda-Jiménez, Daniela Moctezuma, Mario Graff,
Vladimir Salgado, and José Ortiz-Bejar. Gender identification through
multi-modal tweet analysis using microtc and bag of visual words. In Patrice
Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure
Soulier, Eric Sanjuan, Linda Cappellato, and Nicola Ferro, editors, Experimental
IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the
Ninth International Conference of the CLEF Association (CLEF 2018),
September 2018.
[63] Engin Tola, Vincent Lepetit, and Pascal Fua. Daisy: An efficient dense descriptor
applied to wide-baseline stereo. IEEE transactions on pattern analysis and
machine intelligence, 32(5):815–830, 2010.
[64] Satoshi Tsutsui and David Crandall. Using artificial tokens to control languages
for multilingual image caption generation. arXiv preprint arXiv:1706.06275,
2017.
[65] Robert Veenhoven, Stan Snijders, Daniël van der Hall, and Rik van Noord. Using
translated data to improve deep learning author profiling models. In Patrice
Bellot, Chiraz Trabelsi, Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure
Soulier, Eric Sanjuan, Linda Cappellato, and Nicola Ferro, editors, Experimental
IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the
Ninth International Conference of the CLEF Association (CLEF 2018),
September 2018.
[66] Pius von Däniken, Ralf Grubenmann, and Mark Cieliebak. Word unigram
weighing for author profiling at pan 2018. In Patrice Bellot, Chiraz Trabelsi,
Josiane Mothe, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Eric Sanjuan, Linda
Cappellato, and Nicola Ferro, editors, Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Ninth International
Conference of the CLEF Association (CLEF 2018), September 2018.
[67] Edson Weren, Anderson Kauer, Lucas Mizusaki, Viviane Moreira, Palazzo
de Oliveira, and Leandro Wives. Examining multiple features for author
profiling. In Journal of Information and Data Management, pages 266–279,
2014.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Appendix A</title>
    </sec>
    <sec id="sec-7">
      <title>Pairwise Comparison of all Systems</title>
      <p>For all subsequent tables, the significance levels are encoded as follows:
Symbol</p>
      <p>Significance Level
=
*
**
***
Table A1. Significance of accuracy differences between system pairs. Textual modality in Arabic.
Table A3. Significance of accuracy differences between system pairs. Combined modality in
Arabic.
Table A4. Significance of accuracy differences between system pairs. Textual modality in
English.
Table A6. Significance of accuracy differences between system pairs. Combined modality in
English.
Table A7. Significance of accuracy differences between system pairs. Textual modality in
Spanish.
Table A9. Significance of accuracy differences between system pairs. Combined modality in
Spanish.</p>
    </sec>
    <sec id="sec-8">
      <title>Appendix B</title>
    </sec>
    <sec id="sec-9">
      <title>Bayesian Signed-Rank Test Among Systems</title>
      <p>Team (B)
Martinc
Veenhoven
López-Santillán
Hacohen (A)
Gopal-Patra
Hacohen (B)
Stout
Von Däniken
Schaetti
Aragon
Bayot
Garibo
Sezerer
Raiyani
Sandroni
Veenhoven
López-Santillán
Hacohen (A)
Gopal-Patra
Hacohen (B)
Stout
Von Däniken
Schaetti
Aragon
Bayot
Garibo
Sezerer
Raiyani
Sandroni
López-Santillán
Hacohen (A)
Gopal-Patra
Hacohen (B)
Stout
Von Däniken
Schaetti
Aragon
Bayot
Garibo
Sezerer
Raiyani
Sandroni
Hacohen (A)
Gopal-Patra
Hacohen (B)
Stout
Von Däniken
Schaetti
Aragon
Bayot
Garibo
Sezerer
Raiyani
Sandroni
Team(A)
Aragon
Aragon
Aragon
Aragon
Aragon
Bayot
Bayot
Bayot
Bayot
Garibo
Garibo
Garibo
Sezerer
Sezerer
Raiyani</p>
      <p>Team (B)
Bayot
Garibo
Sezerer
Raiyani
Sandroni
Garibo
Sezerer
Raiyani
Sandroni
Sezerer
Raiyani
Sandroni
Raiyani
Sandroni
Sandroni
In Table A10 the correspondence between team names in TIRA and working notes
authors is presented.</p>
      <p>Working note author
Aragon &amp; Lopez
Bayot &amp; Gonçalves
Daneshvar
Garibo
Gopal-Patra et al.</p>
      <p>Karlgren et al.</p>
      <p>Ciccone et al.</p>
      <p>López-Santillán et al.</p>
      <p>Martinc et al.</p>
      <p>Hacohen-Kerner et al. (A)
Tellez et al.</p>
      <p>Stout et al.</p>
      <p>Raiyani et al.</p>
      <p>Sandroni-Dias and Paraboni
Schaetti
Kosse et al.</p>
      <p>Sierra-Loaiza &amp; González
Veenhoven et al.</p>
      <p>Takahashi et al.</p>
      <p>Sezerer et al.</p>
      <p>Nieuwenhuis &amp; Wilkens
von Däniken et al.</p>
      <p>Hacohen-Kerner et al. (B)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Kholoud</given-names>
            <surname>Alsmearat</surname>
          </string-name>
          , Mahmoud Al-Ayyoub, and
          <string-name>
            <given-names>Riyad</given-names>
            <surname>Al-Shalabi</surname>
          </string-name>
          .
          <article-title>An extensive study of the bag-of-words approach for gender identification of arabic articles</article-title>
          .
          <source>In 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA)</source>
          , pages
          <fpage>601</fpage>
          -
          <lpage>608</lpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Kholoud</given-names>
            <surname>Alsmearat</surname>
          </string-name>
          , Mohammed Shehab, Mahmoud Al-Ayyoub,
          <string-name>
            <given-names>Riyad</given-names>
            <surname>Al-Shalabi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ghassan</given-names>
            <surname>Kanaan</surname>
          </string-name>
          .
          <article-title>Emotion analysis of arabic articles and its impact on identifying the author's gender</article-title>
          .
          <source>In Computer Systems and Applications (AICCSA)</source>
          ,
          <year>2015</year>
          IEEE/ACS 12th International Conference on,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Emad</given-names>
            <surname>AlSukhni and Qasem Alequr</surname>
          </string-name>
          .
          <article-title>Investigating the use of machine learning algorithms in detecting gender of the arabic tweet author</article-title>
          .
          <source>International Journal of Advanced Computer Science &amp; Applications</source>
          ,
          <volume>1</volume>
          (
          <issue>7</issue>
          ):
          <fpage>319</fpage>
          -
          <lpage>328</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>