=Paper=
{{Paper
|id=Vol-2125/invited_paper_15
|storemode=property
|title=Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter
|pdfUrl=https://ceur-ws.org/Vol-2125/invited_paper_15.pdf
|volume=Vol-2125
|authors=Francisco Rangel,Paolo Rosso,Manuel Montes-y-Gómez,Martin Potthast,Benno Stein
|dblpUrl=https://dblp.org/rec/conf/clef/PardoRMPS18
}}
==Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter==
Overview of the 6th Author Profiling Task at PAN 2018:
Multimodal Gender Identification in Twitter
Francisco Rangel1,2 Paolo Rosso2 Manuel Montes-y-Gómez3
Martin Potthast4 Benno Stein5
1
Autoritas Consulting, S.A., Spain
2
PRHLT Research Center, Universitat Politècnica de València, Spain
3
INAOE, Mexico
4
Leipzig University, Germany
5
Web Technology & Information Systems, Bauhaus-Universität Weimar, Germany
pan@webis.de http://pan.webis.de
Abstract This overview presents the framework and the results of the Author
Profiling shared task at PAN 2018. The objective of this year’s task is to address
gender identification from a multimodal perspective, where not only texts but also
images are given. For this purpose a corpus with Twitter data has been provided,
covering the languages Arabic, English, and Spanish. Altogether, the approaches
of 23 participants are evaluated.
1 Introduction
Author profiling is the analysis of shared content in order to predict different attributes
of authors such as gender, age, personality, native language, or political orientation.
Supported by the huge amount of information that is available on social media plat-
forms, author profiling has gained a lot of interest. Being able to infer an author’s gen-
der, age, native language, dialects, or personality opens a world of possibilities—among
others in marketing, where companies may analyze online reviews to improve targeted
advertising, or in forensics, where the profile of authors could be used as valuable ad-
ditional evidence in criminal investigations, and in security, where knowing the demo-
graphics of social media users (age and gender), as well as cultural and social context
such as native language and dialects, may help to identify potential terrorists [51].
In the following we provide a historical outline of previous editions of this task.
In the Author Profiling task at PAN 20131 [45], the identification of age and gender
relied on a large corpus collected from social media, both for English and Spanish. In
PAN 20142 [46], we continued focusing on age and gender aspects but, in addition,
compiled a corpus of four different genres, namely social media, blogs, Twitter, and
hotel reviews. Except for the hotel review subcorpus, which was available for English
only, all documents were provided in both English and Spanish. Note that most of the
1
http://webis.de/research/events/pan-13/pan13-web/author-profiling.html
2
http://webis.de/research/events/pan-14/pan14-web/author-profiling.html
existing research in computational linguistics [6] and social psychology [40] focuses
on the English language, and the question is whether the observed relations pertain to
other languages and genres as well. In this vein, in PAN 20153 [47], we included two
new languages, Italian and Dutch, besides a new subtask on personality recognition in
Twitter. In PAN 20164 [50], we investigated the effect of cross-genre information: the
models are trained on a certain genre (here: Twitter) and evaluated on another genre
different than Twitter. In PAN 20175 [19], we considered the language variety iden-
tification together with the gender dimension. We evaluated this new subtask in four
languages: Arabic, English, Portuguese and Spanish.
Social media data cover a wide range of modalities such as text, images, audio, and
video, all of which containing useful information to be exploited for extracting valuable
insights from users. Consequently, the objective of this year’s evaluation6 is to address
gender identification from a multimodal perspective: not only texts but also images
are given. For this purpose a corpus with Twitter data has been provided, covering the
languages: Arabic, English, and Spanish.
The remainder of this paper is organized as follows. Section 2 covers the state of the
art, Section 3 describes the corpus and the evaluation measures, and Section 4 presents
the approaches submitted by the participants. Sections 5 and 6 discuss results and draw
conclusions respectively.
2 Related Work
The relationship between personal traits and the use of language has been widely stud-
ied by the psycholinguistics Pennebaker [41]. He analysed how the use of the language
varies depending on personal traits. For example, in regards to the authors’ gender,
he found out that in English women use more negations or first persons, because they
are more self-concientious, whereas men use more prepositions in order to describe
their environment. These finding are the basis of LIWC (Linguistic Inquiery and Word
Count) [40] that is one of the most used tools in author profiling.
Initial investigations in author profiling [6, 26, 13, 28, 54] focused mainly on for-
mal texts and blogs. Their reported accuracies ranged from 75% to 80%. Nevertheless,
nowadays researchers focused mainly on social media, where the language is more
spontaneous and less formal. It should be highlighted the contribution of different re-
searchers that used the PAN datasets. For example, the authors in [35] showed how
to deal with a large dataset such as the PAN-AP-2013 with 3 million features with a
MapReduce configuration. With the same dataset, the authors in [67] showed the contri-
bution of information retrieval-based features. Following Pennebaker findings about the
relationship between emotions and gender, the authors in [44] proposed the EmoGraph
graph-based approach to capture how users convey verbal emotions in the morphosyn-
tactic structure of the discourse and showed competitive results with the best perform-
ing systems at PAN-2013 and demonstrating the robustness of the approach against
3
http://pan.webis.de/clef15/pan15-web/author-profiling.html
4
http://pan.webis.de/clef16/pan16-web/author-profiling.html
5
http://pan.webis.de/clef17/pan17-web/author-profiling.html
6
https://pan.webis.de/clef18/pan18-web/author-profiling.html
genres and languages at PAN-2014 [43]. Recently, Bayot and Gonçalves [10] used the
PAN-AP-2016 dataset to show that word embeddings worked better in case of gender
identification than TF-IDF. Finally, it is worth mentioning the second order representa-
tion based on relationships between documents and profiles used by the best performing
team in three editions of PAN [30, 31, 4], as well as the performance of the combination
of n-grams as shown by the authors [9] of the best performing team at PAN 2017.
The investigation in Arabic is more scarce and most of the research focused on other
genres than social media. For example, Estival et al. [18] focused on Arabic emails. The
authors reported accuracies of 72.10%. Similarly, Alsmearat et al. [2] focused on Arabic
newsletters. They initially reported an accuracy of 86.4% that was increased to 94% in
an extension of their work [1]. With respect to social media, AlSukhni & Alequr [3]
focused on Arabic tweets and they reported accuracies of 99.50%. They improved a
bag-of-words model with the use of the Twitter authors’ names.
The use of visual features for author profiling has been less studied. A common
approach for gender identification is the use of frontal facial images [37, 60, 17]. The
authors in [37] trained SVM with 1,755 low resolution thumbnail faces (21x12 pixels)
from the FERET face database7 obtaining an error of 3.4%. The authors in [60] used
Principal Component Analysis to represent each image in a smaller dimensional space,
reducing the error from 17.7% to 11.3% with a neural network. The authors in [17]
experimented with 120 combinations of automatic face detection, face alignment and
gender classification. They found out that the automatic face alignment did not increase
the gender classification rates, whereas the manual alignment did. The authors evalu-
ated several machine learning algorithms, obtaining the best results with SVM. They
also saw that the classification did not depend on the size of the images. Recently, user
annotated data have been used more and more. For example, Twitter has been used as
repository to learn and evaluate gender identification systems. In this sense, the authors
in [34] used automatic image annotations and the authors in [56] proposed a Multi-task
Bilinear Model to combine the visual concept detector with the feature extractor to pre-
dict gender in Twitter. Similarly, the authors in [8] used 56 image aesthetic features to
gender identification in 24,000 images provided by 120 FlickR users, obtaining 82.50%
of accuracy.
3 Evaluation Framework
The purpose of this section is to introduce the technical background. We outline the
construction of the corpus, introduce the performance measures and baselines, and de-
scribe the idea of so-called software submissions.
3.1 Corpus
The focus of this year’s task is on gender identification in Twitter from a multimodal
perspective: besides textual information, the participants are provided also with images.
The task is framed as a multilingual task, covering the languages Arabic, English, and
Spanish.
7
https://www.nist.gov/programs-projects/face-recognition-technology-feret
Table 1. Number of authors per language and subset. The corpus is balanced regarding gender
and contains 100 tweets and 10 images per author.
(AR) Arabic (EN) English (ES) Spanish Total
Training 1,500 3,000 3,000 7,500
Test 1,000 1,900 2,200 5,100
Total 2,500 4,900 5,200 12,600
The PAN-AP-2018 corpus is based on the PAN-AP-2017 corpus [49], extended by
images that have been shared in the respective Twitter timelines. More specifically,
PAN-AP-2018 contains those authors from the PAN-AP-2017 corpus who still have a
Twitter account and who have shared at least 10 images. Table 1 overviews the key
figures of the corpus. Moreover, the corpus is balanced with regard to gender and it
contains 100 tweets per author.
3.2 Performance Measures
The participants were asked to submit per author three predictions according to the
following modalities: a) text-based, b) image-based, and c) a combination of both. It
was allowed to approach the task in a favoured language and a favoured modality;
however, we encouraged them to participate in all languages and all modalities.8
For each language and for each modality the accuracy was computed. Note that the
accuracy of the combined approach has been chosen as overall accuracy for the given
language; if only the textual approach was submitted, its accuracy has been used. The
final ranking has been calculated as the average accuracy per language as defined by
the following equation:
accar + accen + acces
ranking = (1)
3
3.3 Baselines
In order to assess the complexity of the subtasks per language and to compare the per-
formances of the participants approaches, we propose the following baselines:
– BASELINE-stat. A statistical baseline that emulates random choice. As there are
two classes and the number of instances is balanced, the random choice baseline is
50% accuracy. This baseline applies for both modalities, images and texts.
– BASELINE-bow. To approach the textual modality, we have represented the doc-
uments under a bag-of-words model with the 5,000 most common words in the
training set, weighted by absolute frequency. The texts are preprocessed as fol-
lows: lowercase words, removal of punctuation signs and numbers, and removal of
stop words for the corresponding language.
8
From the 23 participants, 22 participated in the Arabic and Spanish tasks, and all of them in
the English tasks. All of them approached the task with text features, where 12 participants
also used images.
– BASELINE-rgb. To approach the image modality, we represent the photos as fol-
lows. For each author, we obtain the RGB color for each pixel in his/her photos.
We represent the author with the following descriptive statistics of the RGB values:
minimum, maximum, mean, median, and standard deviation.
3.4 Software Submissions
We asked for software submissions (as opposed to run submissions). Within software
submissions, participants submit executables of their author profiling softwares instead
of just the output (also called “run”) of their softwares on a given test set. Our ratio-
nale to do so is to increase the sustainability of our shared task and to allow for the
re-evaluation of approaches to Author Profiling later on, and, in particular, on future
evaluation corpora. To facilitate software submissions, we develop the TIRA experi-
mentation platform [21, 22], which renders the handling of software submissions at
scale as simple as handling run submissions. Using TIRA, participants deploy their
software on virtual machines at our site, which allows us to keep them in a running
state [23].
4 Overview of the Submitted Approaches
This year, 23 teams participated in the Author Profiling shared task and 22 of them
submitted the notebook paper.9 We analyse their approaches from three perspectives:
preprocessing, features to represent the authors’ texts, and classification approaches.
4.1 Preprocessing
Various participants cleaned the textual contents to obtain plain text. Most of them re-
moved or normalised Twitter-specific elements such as URLs, user mentions, or hash-
tags [15, 61, 59, 42, 53, 24, 66, 36, 65, 38, 29]. Some participants also lowercased the
words [66, 65, 38, 11, 29, 59, 53, 24]. The authors in [15, 59, 24, 65] removed punctu-
ation signs; character flooding has been removed by the authors in [15, 42]. Stopwords
have been removed by the authors in [15, 42, 24, 65], and contractions and abbrevia-
tions have been expanded by the authors in [59, 42]. The authors in [15] applied specific
preprocessing to Arabic texts, such as normalisation and diacritics removal.
Only three participants preprocessed images. The authors in [61] applied direct re-
sizing and resizing with cropping, as well as normalisation by subtracting the average
RGB value per language. The authors in [36] rescaled all images to 64x64 and used only
those containing human faces, while the authors in [57] rescaled all images to 224 pixel
width, maintaining the aspect ratio.
4.2 Features
In previous editions of the author profiling task at PAN as well as in the referred liter-
ature, features used for representing text documents have been distinguished as either
9
Hacohen-Kerner et al. described in their working note the participation of two teams.
content-based or style-based. However, this year several participants have employed
deep learning techniques. It is interesting to differentiate among traditional features and
these new methods in order to compare their performance in the author profiling task.
While the authors in [36, 65, 11, 33, 61] represented documents with word embeddings,
the authors in [53] used character embeddings. Moreover, the authors in [59, 52, 33]
also used traditional features such as character, word, and/or POS n-grams. The authors
in [39] combined word embeddings for English as well as stylistic features; however,
for Spanish and Arabic they used LSA instead of word embeddings.
Traditional features such as character and word n-grams have been widely used
[66, 62, 38, 29, 16, 24, 59, 15]. Style features have been also used by some partic-
ipants [39, 27, 24]. For example, the authors in [39] used the counts of stopwords,
punctuation marks, emoticons, and slang words (only for English). The authors in [27]
combined POS tags n-grams with syntactic dependencies to model the use of ampli-
fiers, verbal constructions, pronouns, subjects and objects, types of adverbials, as well
as the use of interjections and profanity. The authors in [24] counted the average num-
ber of characters and the average number of words per tweet. The authors in [66] also
used emojis, whereas the authors in [20] used only the skewness calculated from a vari-
ation of the Low Dimensionality Statistical Embedding (LDSE) [48]. The authors in [5]
combined ensembles of word and character n-grams with bag-of-terms and second or-
der features [30, 31, 32], which relates documents with authors’ profiles.
With respect to the representation of images several approaches have been pre-
sented. For example, some participants tried to detect faces in images [59, 15, 65].
In this regard, the authors in [65] used face vectors from images that contained only
faces. Besides faces the authors in [15] detected also objects and quantified local bi-
nary patterns and color histograms. Other authors used image resources, such as [39],
who applied an image captioning system [64]. Similarly, the authors in [38] used a
known image feature extraction tool [7] to obtain features about the number of faces in
the images, as well as the expressed emotions or their gender. The authors in [5] used
ImageNet [58] to obtain VGG1610 features, and the authors in [53] built a language-
independent model with TorchVision.11 The authors in [61] also used a pre-trained Con-
volutional Neural Network (CNN) based on VGG16. Other participants approached the
task with their own set of features, such as the authors in [24] who combined three sets
of characteristics: Shift, RGB histogram, and VGG. The authors in [62] designed a vari-
ant of the Bag-of-Visual-Words (BoVW) by using the DAISY [63] feature descriptor
and encoded the images by the set of visual words.
4.3 Classification Approaches
Regarding the deep learning approaches, the authors with the overall highest accu-
racy [61] used Recurrent Neural Networks (RNN) for texts and CNN for images. CNNs
have also been used by the authors in [5, 53, 55, 36], while RNNs have also been used
by the authors in [11]. Interestingly, the authors in [53] used CNN only for texts and
ResNet18 [25] for images. In the same vein, the authors in [65] approached the images
10
Visual Geometry Group: http://www.robots.ox.ac.uk/˜vgg/research/very_deep
11
https://pytorch.org/docs/stable/torchvision/index.html
with SVM but used Bi-LSTM for texts. The authors in [59] used CNN for images and
an ensemble of Naive Bayes and RNN for texts. Finally, the authors in [42] approached
the task with dense neural networks.
Some participants still used traditional machine learning algorithms such as logistic
regression [52, 24, 66, 38], SVMs [33, 5, 15, 39, 62, 65], multilayer perceptron [24],
a basic feed-forward network [29], and distance-based methods [62, 27]. It is worth to
mention the approach in [20], who used a simple IF condition with respect to only one
feature, allowing the system to process the whole dataset in seconds while achieving a
decent performance.
5 Evaluation and Discussion of the Submitted Approaches
Although we encouraged to consider both modalities, some participants approached
the problem with text features only. We present the results separately to account for this
fact.
5.1 Gender Identification with Text Features
As can be seen in Table 2, the best results were obtained for English (82.21%) [16] and
Spanish (82%) [16], although being only slightly better than for Arabic (81.70%) [62].
This similarity is also reflected by the mean accuracies, which are 74.85% for Arabic,
76.93% for English, and 75.46% for Spanish. Taking a closer look at the distributions
(Figure 1) shows a different characteristic for English: the median is higher and ap-
proximately equal to the Q3 of the other languages, while the interquartile range is
smaller. The similarity in the mean value is due to the two outliers (55.21% [27] and
66.580% [52]). This fact is highlighted in the density chart (Figure 2), where the curve
for the English language is more skewed to the right and the kurtosis is higher since
there are more results concentrated around 80%.
The best result for Arabic (81,70%) is from the authors in [62]; they performed sev-
eral preprocessing steps and trained an SVM with word n-grams, character n-grams,
and skip-grams of different lengths and different weighing schemes such as boolean, tf,
and tf-idf. There is no statistical significance with respect to the second (81.20%) [57]
and third (80.90%) [16] best results. The authors approached the task with character
n-grams and combinations of different types of n-grams. The best result for English
(82.21%) comes from the authors in [16]. There is no statistical significance with the
second (81.21%) [62] and third (81.16%) [38] best results. The authors in [38] used Lo-
gistic Regression with word and character n-grams. Finally, for Spanish, the best result
(82%) is from the authors in [16]. Again, there is no statistical significance regarding
the second (80.36%) [65] and third (80.27%) [38] best systems. The authors in [65]
used a bi-LSTM with pre-trained word embeddings.
With respect to the provided baselines, we can discard the statistical one since its
results are much lower than those obtained by the participants. The BOW baseline is at
rank 17 out of 22 in the overall ranking.12 Furthermore, for Arabic the obtained result
12
The system of Kalgren et al. is not count since they participated in the English tasks only.
Table 2. Accuracy per language in the gender identification task with text features.
Ranking Team Arabic English Spanish Average
1 Daneshvar 0.8090 0.8221 0.8200 0.8170
2 Tellez et al. 0.8170 0.8121 0.8005 0.8099
3 Nieuwenhuis & Wilkens 0.7830 0.8116 0.8027 0.7991
4 Sierra-Loaiza & González 0.8120 0.8011 0.7827 0.7986
5 Ciccone et al. 0.7910 0.8074 0.7959 0.7981
6 Kosse et al. 0.7920 0.8074 0.7918 0.7971
7 Takahashi et al. 0.7710 0.7968 0.7864 0.7847
8 Veenhoven et al. 0.7490 0.7926 0.8036 0.7817
9 Martinc et al 0.7760 0.7900 0.7782 0.7814
10 López-Santillán et al. 0.7760 0.7847 0.7677 0.7761
11 Hacohen-Kerner et al. (B) 0.7590 0.7911 0.7650 0.7717
12 Hacohen-Kerner et al. (A) 0.7590 0.7911 0.7650 0.7717
13 Stout et al. 0.7600 0.7853 0.7405 0.7619
14 Gopal-Patra et al. 0.7430 0.7558 0.7586 0.7525
15 von Däniken et al. 0.7320 0.7742 0.7464 0.7509
16 Schaetti 0.7390 0.7711 0.7359 0.7487
baseline-bow 0.7480 0.7411 0.7255 0.7382
17 Aragon & Lopez 0.6480 0.7963 0.7686 0.7376
18 Bayot & Gonçalves 0.6760 0.7716 0.6873 0.7116
19 Garibo 0.6750 0.7363 0.7164 0.7092
20 Sezerer et al. 0.6920 0.7495 0.6655 0.7023
21 Raiyani et al. 0.7220 0.7279 0.6436 0.6978
22 Sandroni-Dias & Paraboni 0.6870 0.6658 0.6782 0.6770
baseline-stats 0.5000 0.5000 0.5000 0.5000
23 Karlgren et al. - 0.5521 - -
Min 0.6480 0.5521 0.6436 0.6770
Q1 0.7245 0.7634 0.7370 0.7404
Median 0.7590 0.7900 0.7663 0.7717
Mean 0.7485 0.7693 0.7546 0.7608
SDev 0.0480 0.0586 0.0487 0.0399
Q3 0.7812 0.7990 0.7904 0.7940
Max 0.8170 0.8221 0.8200 0.8170
Skewness -0.5191 -2.5275 -0.8785 -0.5855
Kurtosis 2.2985 9.5425 2.7640 2.2513
Normality (p-value) 0.4126 0.0006 0.0757 0.1942
(74.80%) is very close to the mean (74.85%), while 9 participants are below. For English
and Spanish, most participants were better than the baseline. For English, the obtained
result (74.11%) is lower than the mean (76.93%) and even lower than the Q1 (76.34%),
with 4 participants below (including the aforementioned outliers [27, 52]). For Spanish,
the obtained result (72.55%) is below the mean (75.46%) and the Q1 (73.70%), with
5 participants below (including one outlier).
Figure 1. Distribution of the results for gender identification in the different languages when
using text features only.
Figure 2. Density of the results for the gender identification in the different languages.
5.2 Gender Identification with Images
As can be seen in Table 3, the best results were achieved for English (81.63%), with
statistical significance over Spanish (77.32%) and Arabic (77.80%). All best results
stem from the authors in [61], who used a pre-trained CNN on the basis of ImageNet.
Despite this higher value for the best obtained result for English, the distributions of
accuracies are very similar for the three languages, as can be seen in the Figures 3 and 4.
The mean values are of 62.37%, 63.41%, and 61.86% for Arabic, English, and Spanish
respectively, with standard deviations below 10% and following a normal distribution.
For Arabic, the second best result (72.80%) has been obtained by the authors in [57],
who used VGG16 and ResNet50 from ImageNet. The third best result (70.10%) has
been obtained by the authors in [15]. Besides color histograms they have detected
faces, objects, and local binary patterns. Although there is no statistical significance
between them at 95% of confidence, there is with respect to the best result (not at 99%).
For English, the second (74.42%) and third (69.63%) best results are from the authors
in [57] and [15] respectively. In both cases the difference is statistically significant.
Similarly, for Spanish the second (71%) and third (68.05%) best results are from the
authors in [57] and [15] respectively. Again, the difference is statistically significant.
As before, we can discard the statistical baseline. Similarly, most of the participants
have achieved better results than the RGB baseline (52.60% on average); two partici-
pants achieved slightly lower results (50.23% and 50.22%) [24]). For all languages the
baseline (54.10%, 51.79%, and 51.91%) is below the respective Q1s (55.57%, 56.89%,
and 56.40%). Also note that this baseline is only slightly better than the statistical one,
we shows that it is not suitable for the task.
Table 3. Accuracy per language in the gender identification task with images.
Ranking Team Arabic English Spanish Average
1 Takahashi et al. 0.7720 0.8163 0.7732 0.7872
2 Sierra-Loaiza & González 0.7280 0.7442 0.7100 0.7274
3 Ciccone et al. 0.7010 0.6963 0.6805 0.6926
4 Aragon & Lopez 0.6800 0.6921 0.6668 0.6796
5 Gopal-Patra et al. 0.6570 0.6747 0.6918 0.6745
6 Stout et al. 0.6230 0.6584 0.6232 0.6349
7 Nieuwenhuis & Wilkens 0.6230 0.6100 0.5873 0.6068
8 Tellez et al. 0.5900 0.5468 0.5691 0.5686
9 Schaetti 0.5430 0.5763 0.5782 0.5658
10 Martinc et al. 0.5600 0.5826 0.5486 0.5637
baseline-rgb 0.5410 0.5179 0.5191 0.5260
11 Hacohen-Kerner et al. (B) 0.5100 0.4942 0.5027 0.5023
12 Hacohen-Kerner et al. (A) 0.4970 0.5174 0.4923 0.5022
baseline-stats 0.5000 0.5000 0.5000 0.5000
Min 0.4970 0.4942 0.4923 0.5022
Q1 0.5557 0.5689 0.5640 0.5653
Median 0.6230 0.6342 0.6052 0.6209
Mean 0.6237 0.6341 0.6186 0.6255
SDev 0.0873 0.0964 0.0869 0.0893
Q3 0.6853 0.6932 0.6833 0.6828
Max 0.7720 0.8163 0.7732 0.7872
Skewness 0.1079 0.2716 0.1528 0.1984
Kurtosis 1.9374 2.2109 2.0109 2.0636
Normality (p-value) 0.9836 0.9031 0.7356 0.5964
Figure 3. Distribution of the results for gender identification in the different languages when
using images only.
Figure 4. Density of the results for gender identification in the different languages.
5.3 Combined Approaches
We now analyse how images can help to tackle the gender identification task. Table 4
shows the basic statistics about the improvement (in %) for the different languages. On
average, the improvement is very small (0.76% and 1.01% for Arabic and English), or
even negative (-0.06%) for of Spanish. However, looking at Figure 5 it can be seen that
some systems perform much better such as Takahashi et al., who achieved an improve-
ment of 7.73% for English.
Table 4. Distribution of the improvement over text classification in the different languages.
Arabic English Spanish
Min -0.2635 -0.6526 -4.4717
Q1 -0.0616 -0.0647 -0.6613
Median 0.3185 0.4249 0.0257
Mean 0.7613 1.0102 -0.0609
SDev 1.2513 2.2473 1.9087
Q3 0.8487 0.6788 0.4898
Max 3.3647 7.7309 3.7513
Skewness 1.2095 2.4716 -0.3778
Kurtosis 2.9616 8.0027 4.4883
Normality (p-value) 0.0010 0.0000 0.1316
Figure 5. Distribution of the percentage of improvement over text classification.
The tables 5, 6, and 7 show the accuracies obtained with texts, with images, with
their combination, and the percentage of improvement for Arabic, English, and Spanish
respectively. Similarly, the Figures 6, 7, and 8 show for the same languages the density
of the improvement distribution over text classification.
Table 5 shows the results for Arabic. As can be seen in Figure 6 the results do
not follow a normal distribution; the improvement of most of the participants is be-
tween 0.53% and -0.26%, whereas three users obtain higher improvements: 1.82% [61],
2.93% [5], and 3.36% [39]. It is noteworthy that the systems that obtained the highest
results tried to capture semantic features from images, and not only faces or colors.
For example, Gopal-Patra et al. [39] used an image captioning system [39], Aragon &
Lopez [5] ImageNet to obtain VGG16 features, and Takahashi et al. [61] a pre-trained
CNN also on the basis of ImageNet.
Table 5. Improvement over text classification for Arabic.
Team Texts Images Combined Improvement
Gopal-Patra et al. 0.7430 0.6570 0.7680 3.3647%
Aragon & Lopez 0.6480 0.6800 0.6670 2.9321%
Takahashi et al. 0.7710 0.7720 0.7850 1.8158%
Stout et al. 0.7600 0.6230 0.7640 0.5263%
Nieuwenhuis & Wilkens 0.7830 0.6230 0.7870 0.5109%
Ciccone et al. 0.7910 0.7010 0.7940 0.3793%
Martinc et al 0.7760 0.5600 0.7780 0.2577%
Tellez et al. 0.8170 0.5900 0.8180 0.1224%
Schaetti 0.7390 0.5430 0.7390 0.0000%
Sierra-Loaiza & González 0.8120 0.7280 0.8100 -0.2463%
Hacohen-Kerner et al. (B) 0.7590 0.5100 0.7570 -0.2635%
Hacohen-Kerner et al. (A) 0.7590 0.4970 0.7570 -0.2635%
Figure 6. Density of the distribution of improvement over text classification for Arabic.
The distribution of improvements for English is even less normal, as can be seen in
Figure 7. There are three groups of systems (see Table 6): i) systems with improvements
between 0.72% and deteriorations of -4.65%, ii) one system with an improvement of
2.37% [39], and iii) one system with an improvement of 7.73% [61]. Similar to Arabic,
the best results have been achieved by systems that exploit semantic features [61, 39].
Furthermore, the less negative results have been achieved either with the use of Im-
ageNet and VGG16 features [5] or with the combination of face recognition, object
detection, local binary patterns, and color histograms [15].
Table 6. Improvement over text classification for English.
Team Texts Images Combined Improvement
Takahashi et al. 0.7968 0.8163 0.8584 7.7309
Gopal-Patra et al. 0.7558 0.6747 0.7737 2.3684
Ciccone et al. 0.8074 0.6963 0.8132 0.7184
Aragon & Lopez 0.7963 0.6921 0.8016 0.6656
Sierra-Loaiza & González 0.8011 0.7442 0.8063 0.6491
Hacohen-Kerner et al. (A) 0.7911 0.5174 0.7947 0.4551
Stout et al. 0.7853 0.6584 0.7884 0.3948
Martinc et al. 0.7900 0.5826 0.7926 0.3291
Schaetti 0.7711 0.5763 0.7711 0.0000
Nieuwenhuis & Wilkens 0.8116 0.6100 0.8095 -0.2587
Hacohen-Kerner et al. (B) 0.7911 0.4942 0.7889 -0.2781
Tellez et al. 0.8121 0.5468 0.8068 -0.6526
Figure 7. Density of the distribution of improvement over text classification for English.
For Spanish the systems’ improvements follows a normal distribution, having two
spikes in both extremes. In particular, there is i) one system whose deterioration is -
4.47% [57], ii) a group of users with improvement/deterioration between -1.30% and
1.62%, and iii) one system with 3.75% of improvement [61]. In this regard, the best
result has been obtained by Takahashi et al. with a pre-trained CNN from ImageNet,
followed by the use of an image captioning system [39], the combination of faces,
objects, and local binary patterns with color histograms [15], and the use of ImageNet
to obtain VGG16 features [5].
Table 7. Improvement over text classification for Spanish.
Team Texts Images Combined Improvement
Takahashi et al. 0.7864 0.7732 0.8159 3.7513
Gopal-Patra et al. 0.7586 0.6918 0.7709 1.6214
Ciccone et al. 0.7959 0.6805 0.8000 0.5151
Aragon & Lopez 0.7686 0.6668 0.7723 0.4814
Stout et al. 0.7405 0.6232 0.7432 0.3646
Martinc et al. 0.7782 0.5486 0.7786 0.0514
Schaetti 0.7359 0.5782 0.7359 0.0000
Hacohen-Kerner et al. (A) 0.7650 0.4923 0.7623 -0.3529
Tellez et al. 0.8005 0.5691 0.7955 -0.6246
Hacohen-Kerner et al. (B) 0.7650 0.5027 0.7591 -0.7712
Nieuwenhuis & Wilkens 0.8027 0.5873 0.7923 -1.2956
Sierra-Loaiza & González 0.7827 0.7100 0.7477 -4.4717
Figure 8. Density of the distribution of improvement over text classification for Spanish.
5.4 Final Ranking and Best Results
This year 23 teams participated in the shared task; Table 8 shows the overall perfor-
mance per language and user’s ranking. The best results have been obtained for English
(85.84%), followed by Spanish (82%), and Arabic (81.80%).
Table 8. Accuracy per language and global ranking as average per language.
Ranking Team Arabic English Spanish Average
1 Takahashi et al. 0.7850 0.8584 0.8159 0.8198
2 Daneshvar 0.8090 0.8221 0.8200 0.8170
3 Tellez et al. 0.8180 0.8068 0.7955 0.8068
4 Ciccone et al. 0.7940 0.8132 0.8000 0.8024
5 Kosse et al. 0.7920 0.8074 0.7918 0.7971
6 Nieuwenhuis & Wilkens 0.7870 0.8095 0.7923 0.7963
7 Sierra-Loaiza & González 0.8100 0.8063 0.7477 0.7880
8 Martinc et al. 0.7780 0.7926 0.7786 0.7831
9 Veenhoven et al. 0.7490 0.7926 0.8036 0.7817
10 López-Santillán et al. 0.7760 0.7847 0.7677 0.7761
11 Hacohen-Kerner et al. (A) 0.7570 0.7947 0.7623 0.7713
12 Gopal-Patra et al. 0.7680 0.7737 0.7709 0.7709
13 Hacohen-Kerner et al. (B) 0.7570 0.7889 0.7591 0.7683
14 Stout et al. 0.7640 0.7884 0.7432 0.7652
15 von Däniken et al. 0.7320 0.7742 0.7464 0.7509
16 Schaetti 0.7390 0.7711 0.7359 0.7487
17 Aragon & Lopez 0.6670 0.8016 0.7723 0.7470
18 Bayot & Gonçalves 0.6760 0.7716 0.6873 0.7116
19 Garibo 0.6750 0.7363 0.7164 0.7092
20 Sezerer et al. 0.6920 0.7495 0.6655 0.7023
21 Raiyani et al. 0.7220 0.7279 0.6436 0.6978
22 Sandroni-Dias & Paraboni 0.6870 0.6658 0.6782 0.6770
23 Karlgren et al. - 0.5521 - -
Min 0.6670 0.5521 0.6436 0.6770
Q1 0.7245 0.7713 0.7377 0.7474
Median 0.7605 0.7889 0.7650 0.7711
Mean 0.7515 0.7735 0.7543 0.7631
SDev 0.0471 0.0614 0.0493 0.0409
Q3 0.7865 0.8065 0.7922 0.7942
Max 0.8180 0.8584 0.8200 0.8198
Skewness -0.4908 -2.2563 -0.7807 -0.6090
Kurtosis 2.0346 8.7093 2.6912 2.3341
Normality (p-value) 0.3490 0.0002 0.3341 0.1717
The overall best result (81.98%) is from the authors in [61] who approached the
task with deep neural networks. For text processing, they used word embeddings from
a stream of tweets with FastText skip-grams and trained a Recurrent Neural Network.
For images, they used a pre-trained Convolutional Neural Network. They combined
both approaches with a fusion component. The authors in [16] got the second best re-
sult on average (81.70%) by approaching the task only from the textual perspective.
They used an SVM with different types of word and character n-grams. The third best
overall result (80.68%) stems from the authors in [62]. They used an SVM with com-
binations of word and character n-grams for texts and a variant of the Bag of Visual
Words for images, combining both predictions with a convex linear combination. Ac-
cording to t-Student, there is no statistical significance among the three approaches.
This is also supported by the Bayesian Signed-Rank test [12] between Takahashi et al.
and Daneshvar, as shown in Figure 9. However, for Takahashi et al. and Tellez et al.,
the probability of the first system to perform better (62.96%) is higher than the sum of
being equal (20.64%) or worse (16.39%), as shown in Figure 10. The complete results
of this test are presented in the Appendix B.
Figure 9. Bayesian Signed-Rank Test Figure 10. Bayesian Signed-Rank Test
between Takahashi et al. and Danesh- between Takahashi et al. and Tellez et
var. P(A>B)=0.3416; P(A=B)=0.3191; al.. P(A>B)=0.6296; P(A=B)=0.2064;
P(A 0.05 ∼ not significant
* 0.05 ≥ p > 0.01 ∼ significant
** 0.01 ≥ p > 0.001 ∼ very significant
*** p ≤ 0.001 ∼ highly significant
Hacohen-Kerner (A)
Hacohen-Kerner (B)
Lopez-Santillan
Sandroni-Dias
Sierra-Loaiza
Von-Daniken
Nieuwenhuis
Veenhoven
Daneshvar
Takahashi
Ciccone
Schaetti
Martinc
Raiyani
Aragon
Sezerer
Garibo
Gopal
Tellez
Kosse
Bayot
Stout
Aragon = *** *** = *** *** *** *** *** *** *** *** * *** * *** *** *** *** *** ***
Bayot *** *** = *** *** *** *** *** *** *** ** = *** = *** *** *** *** *** ***
Ciccone = *** *** * * = = = = *** *** *** *** = * = * ** ***
Daneshvar *** *** *** *** = ** ** ** *** *** *** *** = *** *** = *** ***
Garibo *** *** *** *** *** *** *** * = *** = *** *** *** *** *** **
Gopal = = *** * * ** = ** = ** *** = * *** = =
Hacohen-Kerner (A) = ** = = = * *** = *** *** = = *** = =
Hacohen-Kerner (B) ** = = = * *** = *** *** = = *** = =
Kosse = = = *** *** *** *** = * = * ** ***
Lopez-Santillan = = *** *** * *** ** = = ** = **
Martinc = *** *** * *** ** = = ** = **
Nieuwenhuis *** *** *** *** ** = = ** * ***
Raiyani * = = *** ** ** *** = =
Sandroni-Dias ** = *** *** *** *** *** **
Schaetti ** *** = * *** = =
Sezerer *** *** *** *** ** *
Sierra-Loaiza *** ** = *** ***
Stout = *** = *
Takahashi *** = **
Tellez *** ***
Veenhoven =
Von-Daniken
Table A1. Significance of accuracy differences between system pairs. Textual modality in Arabic.
Hacohen-Kerner (A)
Hacohen-Kerner (B)
Sierra-Loaiza
Nieuwenhuis
Takahashi
Ciccone
Schaetti
Martinc
Aragon
Gopal
Tellez
Stout
Aragon = = *** *** *** ** *** ** ** *** ***
Ciccone * *** *** *** *** *** = *** *** ***
Gopal *** *** *** = *** *** = *** **
Hacohen-Kerner (A) = ** *** = *** *** *** ***
Hacohen-Kerner (B) * *** = *** *** *** ***
Martinc ** = *** ** *** =
Nieuwenhuis *** *** = *** =
Schaetti *** *** *** *
Sierra-Loaiza *** *** ***
Stout *** =
Takahashi ***
Tellez
Table A2. Significance of accuracy differences between system pairs. Image modality in Arabic.
Hacohen-Kerner (A)
Hacohen-Kerner (B)
Sierra-Loaiza
Nieuwenhuis
Takahashi
Ciccone
Schaetti
Martinc
Aragon
Gopal
Tellez
Stout
Aragon *** *** *** *** *** *** *** *** *** *** ***
Ciccone = ** ** = = *** = * = =
Gopal = = = = = ** = = ***
Hacohen-Kerner (A) = = * = *** = = ***
Hacohen-Kerner (B) = * = *** = = ***
Martinc = ** * = = **
Nieuwenhuis *** = = = *
Schaetti *** = ** ***
Sierra-Loaiza ** = =
Stout = ***
Takahashi **
Tellez
Table A3. Significance of accuracy differences between system pairs. Combined modality in
Arabic.
Hacohen-Kerner (A)
Hacohen-Kerner (B)
Lopez-Santillan
Sandroni-Dias
Sierra-Loaiza
Von-Daniken
Nieuwenhuis
Veenhoven
Daneshvar
Takahashi
Karlgren
Ciccone
Schaetti
Martinc
Raiyani
Aragon
Sezerer
Garibo
Gopal
Tellez
Kosse
Bayot
Stout
Aragon ** = ** *** *** = = *** = = = = *** *** * *** = = = = = *
Bayot *** *** ** = = = *** *** = = *** *** *** = * ** = ** *** * =
Ciccone * *** *** = = *** = ** * = *** *** *** *** = * = = = ***
Daneshvar *** *** *** *** *** = *** *** = *** *** *** *** * *** ** = *** ***
Garibo = *** *** *** *** *** *** *** = *** ** = *** *** *** *** *** ***
Gopal ** ** *** *** ** ** *** ** *** = = *** ** *** *** *** =
Hacohen-Kerner (A) = *** = = = * *** *** = *** = = = * = =
Hacohen-Kerner (B) *** = = = * *** *** * *** = = = * = =
Karlgren *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Kosse ** * = *** *** *** *** = * = = = ***
Lopez-Santillan = *** *** *** = ** = = = ** = =
Martinc * *** *** = *** = = = * = =
Nieuwenhuis *** *** *** *** = ** = = * ***
Raiyani *** *** = *** *** *** *** *** ***
Sandroni-Dias *** *** *** *** *** *** *** ***
Schaetti * ** = ** *** * =
Sezerer *** *** *** *** *** *
Sierra-Loaiza = = = = **
Stout = ** = =
Takahashi = = *
Tellez * ***
Veenhoven =
Von-Daniken
Table A4. Significance of accuracy differences between system pairs. Textual modality in En-
glish.
Hacohen-Kerner (A)
Hacohen-Kerner (B)
Sierra-Loaiza
Nieuwenhuis
Takahashi
Ciccone
Schaetti
Martinc
Aragon
Gopal
Tellez
Stout
Aragon = = *** *** *** *** *** *** * *** ***
Ciccone = *** *** *** *** *** *** ** *** ***
Gopal *** *** *** *** *** *** = *** ***
Hacohen-Kerner (A) = *** *** *** *** *** *** =
Hacohen-Kerner (B) *** *** *** *** *** *** **
Martinc = = *** *** *** *
Nieuwenhuis * *** ** *** ***
Schaetti *** *** *** =
Sierra-Loaiza *** *** ***
Stout *** ***
Takahashi ***
Tellez
Table A5. Significance of accuracy differences between system pairs. Image modality in English.
Hacohen-Kerner (A)
Hacohen-Kerner (B)
Sierra-Loaiza
Nieuwenhuis
Takahashi
Ciccone
Schaetti
Martinc
Aragon
Gopal
Tellez
Stout
Aragon = ** = = = = ** = = *** =
Ciccone *** * ** * = *** = ** *** =
Gopal = = = *** = ** = *** **
Hacohen-Kerner (A) = = = * = = *** =
Hacohen-Kerner (B) = * = = = *** =
Martinc = * = = *** =
Nieuwenhuis *** = * *** =
Schaetti ** = *** ***
Sierra-Loaiza = *** =
Stout *** =
Takahashi ***
Tellez
Table A6. Significance of accuracy differences between system pairs. Combined modality in
English.
Hacohen-Kerner (A)
Hacohen-Kerner (B)
Lopez-Santillan
Sandroni-Dias
Sierra-Loaiza
Von-Daniken
Nieuwenhuis
Veenhoven
Daneshvar
Takahashi
Ciccone
Schaetti
Martinc
Raiyani
Aragon
Sezerer
Garibo
Gopal
Tellez
Kosse
Bayot
Stout
Aragon *** ** *** *** = = = * = = *** *** *** *** *** = ** = *** *** *
Bayot *** *** ** *** *** *** *** *** *** *** *** = *** = *** *** *** *** *** ***
Ciccone *** *** *** *** *** = ** * = *** *** *** *** = *** = = = ***
Daneshvar *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** * * ***
Garibo *** *** *** *** *** *** *** *** ** = *** *** * *** *** *** **
Gopal = = *** = * *** *** *** * *** * = ** *** *** =
Hacohen-Kerner (A) = ** = = *** *** *** ** *** * * * *** *** =
Hacohen-Kerner (B) ** = = *** *** *** ** *** * * * *** *** =
Kosse ** = = *** *** *** *** = *** = = = ***
Lopez-Santillan = *** *** *** ** *** = ** * *** *** *
Martinc ** *** *** *** *** = *** = * ** ***
Nieuwenhuis *** *** *** *** * *** = = = ***
Raiyani ** *** = *** *** *** *** *** ***
Sandroni-Dias *** = *** *** *** *** *** ***
Schaetti *** *** = *** *** *** =
Sezerer *** *** *** *** *** ***
Sierra-Loaiza *** = = * ***
Stout *** *** *** =
Takahashi = = ***
Tellez = ***
Veenhoven ***
Von-Daniken
Table A7. Significance of accuracy differences between system pairs. Textual modality in Span-
ish.
Hacohen-Kerner (A)
Hacohen-Kerner (B)
Sierra-Loaiza
Nieuwenhuis
Takahashi
Ciccone
Schaetti
Martinc
Aragon
Gopal
Tellez
Stout
Aragon = * *** *** *** *** *** *** *** *** ***
Ciccone = *** *** *** *** *** ** *** *** ***
Gopal *** *** *** *** *** = *** *** ***
Hacohen-Kerner (A) = *** *** *** *** *** *** ***
Hacohen-Kerner (B) ** *** *** *** *** *** ***
Martinc ** *** *** *** *** =
Nieuwenhuis = *** * *** =
Schaetti *** *** *** =
Sierra-Loaiza *** *** ***
Stout *** ***
Takahashi ***
Tellez
Table A8. Significance of accuracy differences between system pairs. Image modality in Spanish.
Hacohen-Kerner (A)
Hacohen-Kerner (B)
Sierra-Loaiza
Nieuwenhuis
Takahashi
Ciccone
Schaetti
Martinc
Aragon
Gopal
Tellez
Stout
Aragon ** = = = = * *** * ** *** *
Ciccone ** *** *** ** = *** *** *** = =
Gopal = = = * ** * ** *** *
Hacohen-Kerner (A) = = *** * = = *** **
Hacohen-Kerner (B) * *** * = = *** ***
Martinc = *** ** ** *** =
Nieuwenhuis *** *** *** * =
Schaetti = = *** ***
Sierra-Loaiza = *** ***
Stout *** ***
Takahashi *
Tellez
Table A9. Significance of accuracy differences between system pairs. Combined modality in
Spanish.
Appendix B Bayesian Signed-Rank Test Among Systems
Team(A) Team (B) P(A>B) P(A=B) P(AB) P(A=B) P(AB) P(A=B) P(AB) P(A=B) P(AB) P(A=B) P(A