The CERTH-UNITN Participation @ Verifying Multimedia
                         Use 2015

     Christina Boididou1 , Symeon Papadopoulos1 , Duc-Tien Dang-Nguyen2 , Giulia Boato2 , and
                                      Yiannis Kompatsiaris1
                   1
                       Information Technologies Institute, CERTH, Greece. [boididou,papadop,ikom]@iti.gr
                                 2
                                     University of Trento, Italy. [dangnguyen,boato]@disi.unitn.it


ABSTRACT
                                                                         Table 1: List of features used in the experiments.
We propose an approach that predicts whether a tweet,                            Feature set Description
which is accompanied by multimedia content (image/video),                        TB-base        Baseline tweet-based
is trustworthy or deceptive. We test different combinations                      TB-ext         Extended tweet-based
of quality and trust-oriented features (tweet-based, user-                       UB-base        Baseline user-based
based and forensics) in tandem with a standard classification                    UB-ext         Extended user-based
and an agreement-retraining technique, with the goal of pre-                     FOR            Forensic features
dicting the most likely label (fake or real) for each tweet.
The experiments carried out on the Verifying Multimedia
Use dataset show that the best performance is achieved
                                                                     using information and metadata about the user posting (or
when using all available features in combination with the
                                                                     retweeting) the tweet, c) multimedia forensics features, which
agreement-retraining method.
                                                                     are computed based on the image that accompanies the
                                                                     tweet. We test two variants of the first two sets of features: i)
1.    INTRODUCTION                                                   baseline (base), which correspond to the features shared by
   Since social media have gained momentum over the years            the organisers, and ii) extended (ext), which include a few
as a fast and real-time means of sharing news, a huge amount         new features. The forensics features include both the ones
of information is constantly flowing through it, quickly reach-      distributed by the organisers and some additional ones.
ing massive numbers of readers. Thus, it can easily become           TB-ext: We extract additional features based on the tweet
viral and affect public opinion and sentiment. This has mo-          text, such as the presence of a word, symbol or external
tivated a number of malicious efforts to spread misleading           link. We also use language-specific binary features that cor-
content, highlighting the need for fast verification. In this        respond to the presence of specific terms; for languages, in
setting, the goal of Verifying Multimedia Use task is to au-         which we cannot manage to define such terms, we consider
tomatically predict whether a tweet that shares multimedia           the values of these features missing. We perform language
content is misleading (referred to as fake) or trustworthy           detection with a publicly available library1 . We add a fea-
(real) [1]. To this end, we make use of the tweet text con-          ture for the number of slang words in a text, using slang
tent, a set of tweet- and user-based features and multimedia         lists in English2 and Spanish3 . For the number of nouns,
forensic features for the images embedded in the tweet.              we use the Stanford parser4 to assign parts of speech to
   In our work, we present an extension of our original ap-          each word (supported only in English). For the readability
proach [2], combining different sets of the aforementioned           of the text, we use the Flesch Reading Ease method5 , which
features. The conducted experiments include plain classifi-          computes the complexity of a piece of text as a score in the
cation models and an agreement-retraining method that uses           interval [0, 100] (0: hard-to-read, 100: easy-to-read).
part of its own predictions as new training samples with the         UB-ext: We extract user-specific features such as the number
goal of adapting to the new event. In the next sections, we          of media content, the account age and others that refer
present in detail the adopted methodology.                           to the information that the profile shares. For example, we
                                                                     check whether the user declares his/her geographic location
                                                                     and whether the location can be matched to a city name
2.    SYSTEM OVERVIEW                                                from the Geonames dataset6 .
                                                                        Next, for both TB and UB features, we adopt trust-oriented
2.1    Features                                                      features for the links shared, through the tweet itself (TB) or
  The approach uses three types of features: a) tweet-based
                                                                     1
(TB), which make use of information coming from the tweet              https://code.google.com/p/language-detection/
                                                                     2
and its metadata, b) user-based (UB), which are computed               http://onlineslangdictionary.com/word-list/0-a/
                                                                     3
                                                                       http://www.languagerealm.com/spanish/spanishslang.php
                                                                     4
                                                                       http://nlp.stanford.edu/software/lex-parser.shtml
                                                                     5
Copyright is held by the author/owner(s).                              http://simple.wikipedia.org/wiki/Flesch_Reading_Ease
                                                                     6
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany            http://download.geonames.org/export/dump/cities1000.zip
                                                                                       Table 3: Results.
                                                                                      Recall Precision F-score
                                                                             RUN-1    0.794     0.733    0.762
                                                                             RUN-2    0.749     0.994    0.854
                                                                             RUN-3    0.922     0.736    0.819
                                                                             RUN-4    0.798     0.860    0.828
                                                                             RUN-5    0.969     0.861    0.911


                                                                    high likelihood, we use them as training samples to build
                                                                    a new classifier for classifying the disagreed samples. To
Figure 1: Overview of agreement-based retraining.                   this end, in step (b), we add the agreed samples to the best
                                                                    performing of the two initial models, CL1 , CL2 (comparing
                                                                    them on the basis of their performance when doing cross-
Table 2: Description of runs. For SSL-AR case, the two
                                                                    validation on the training set). The goal of this method is
sets of features used for building the two classifiers are
                                                                    to retrain the initial model and make it adaptable to any
mentioned. For example, in RUN-3 the TB-base + FOR are
                                                                    specific characteristics of the new event. In that way, the
used for CL1 and the UB-base for CL2 .
      Run      Features     Learning                                model can predict more accurately the values of the samples
      RUN-1    SL           TB-base                                 for which CL1 , CL2 did not agree in the first step.
      RUN-2    SL           TB-base+FOR
      RUN-3    SSL-AR       (TB-base+FOR) + UB-base
                                                                    2.3   Bagging
      RUN-4    SL           TB-ext+UB-ext+FOR                          Due to the unequal number of fake and real tweets, we
      RUN-5    SSL-AR       (TB-ext+FOR) + UB-ext                   exploit only a part of the data while building a model. In
                                                                    order to take advantage of the whole training dataset, we use
                                                                    bagging that tends to improve the accuracy of the method,
the user profile (UB). The WOT metric7 is a score indicating        as it produces predictions using the average result of nu-
how trustworthy a website is, using reputation ratings by           merous predictors. Bagging creates m different subsets of
Web users. We also include the in-degree and harmonic               the training set, including equal number of samples for each
centralities, rankings computed based on the links of the           class (some samples may appear in multiple subsets), lead-
web forming a graph8 . Trust analysis of the links is also          ing to the creation of m instances of CL1 and CL2 classifiers
done using four Web metrics provided by the Alexa API9 .            (m = 9). The final prediction for each of the testing samples
FOR: For each image, the additional forensics features are          is calculated using the majority vote of the m predictions.
extracted from the provided BAG feature based on the maps
obtained from AJPG and NAJPG. First, a binary map is created        3.    SUBMITTED RUNS AND RESULTS
by thresholding the AJPG map (we use 0.6 as the threshold),
then the largest region is selected as object and the rest of the      The five runs submitted explore different combinations
map is considered as the background. For both regions, seven        of features and the use of a standard supervised learning
descriptive statistics (maximum, minimum, mean, median,             scheme (SL) versus the newly proposed agreement-based re-
most frequent value, standard deviation, and variance) are          training (SSL-AR). The specific run configurations are spec-
computed from the BAG values and concatenated to a 14-              ified in Table 2.
dimensional vector. We apply the same process on the NAJPG             RUN-1, RUN-2 and RUN-4 are built using a plain classifica-
map to obtain a second feature vector.                              tion model. RUN-3 and RUN-5 are built with the agreement-
                                                                    based retraining technique, in which we build CL1 and CL2
2.2    Agreement-based retraining method                            (Figure 1) by using the sets of features specified in Table 2.
                                                                    All models use a Random Forest classifier from the Weka
   The main extension of this system compared to [2] in-
                                                                    implementation.
cludes an agreement-based retraining step in order to im-
                                                                       Table 3 presents the performance of each run. In terms
prove the prediction accuracy for unseen events. This is
                                                                    of F-score, which is the primary evaluation metric of the
motivated by a similar approach implemented in [3] (for the
                                                                    task, RUN-5 achieved the best score when using the ext and
problem of polarity classification). Figure 1 illustrates the
                                                                    the FOR features with the SSL-AR technique. As we observe,
adopted process. In step (a), we build two classifiers CL1 ,
                                                                    RUN-2 in which the FOR features are added, performed quite
CL2 based on the training set, each classifier built on dif-
                                                                    better than RUN-1, which uses just the TB-base features.
ferent types of features, and we combine their outputs in a
                                                                    Comparing RUN-4 and RUN-5, one may observe the consid-
Semi-Supervised Learning (SSL) fashion. We compare the
                                                                    erable performance benefit stemming from the use of the
two predictions for each sample of the test set, and depend-
                                                                    SSL-AR approach, as it is the only difference between the
ing on their agreement, we divide the test set in two subsets,
                                                                    two runs (the same sets of features are used). Additionally,
the agreed and disagreed samples. These two subsets are
                                                                    it is important to note the contribution of the ext features,
treated differently by the classification framework.
                                                                    as RUN-5 (ext) performs better than RUN-3 (base).
   Assuming that the agreed predictions are correct with
7
  https://www.mywot.com/
8
  http://wwwranking.webdatacommons.org/more.html
                                                                    4.    ACKNOWLEDGEMENTS
9                                                                     This work is supported by the REVEAL project, partially
  http://data.alexa.com/data?cli=10&dat=snbamz&url=
google.gr                                                           funded by the European Commission (FP7-610928).
5.   REFERENCES                                                     verification in social multimedia. In Proceedings of the
[1] C. Boididou, K. Andreadou, S. Papadopoulos, D.-T.               Companion Publication of the 23rd International Conference
    Dang-Nguyen, G. Boato, M. Riegler, and Y. Kompatsiaris.         on World Wide Web Companion, pages 743–748, 2014.
    Verifying multimedia use at mediaeval 2015. In MediaEval    [3] A. Tsakalidis, S. Papadopoulos, and I. Kompatsiaris. An
    2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany,              ensemble model for cross-domain polarity classification on
    2015.                                                           twitter. In Web Information Systems Engineering–WISE
[2] C. Boididou, S. Papadopoulos, Y. Kompatsiaris,                  2014, pages 168–177. Springer, 2014.
    S. Schifferes, and N. Newman. Challenges of computational