MediaEval 2016: A multimodal system for the Verifying
                     Multimedia Use task

                                  Cédric Maigrot1 , Vincent Claveau2 , Ewa Kijak1 , and Ronan Sicre2

                                       1,2
                                             IRISA, 1 Univ. of Rennes 1, 2 CNRS, Rennes, France ,
                   Cedric.Maigrot@irisa.fr, Vincent.Claveau@irisa.fr, Ewa.Kijak@irisa.fr, Ronan.Sicre@irisa.fr


ABSTRACT                                                                associated images are classified as real ; if at least one of the
This paper presents a multi-modal hoax detection system                 images is classified as fake, the tweet is considered as fake.
composed of text, source, and image analysis. As hoax can               2.1      Text-based nearest neighbors prediction
be very diverse, we want to analyze several modalities to
better detect them. This system is applied in the context of               This approach exploits the textual contents of the tweets
the Verifying Multimedia Use task of MediaEval 2016. Ex-                and do not rely on any external data apart from the training
periments show the performance of each separated modality               set. As previously explained, a tweet is classified based on
as well as their combination.                                           the images it contains; an image is described by the con-
                                                                        catenated texts of every tweet containing this image.The
                                                                        idea here is to capture similar comments between an un-
1.    INTRODUCTION                                                      known image and an image from the training set (such as
   Social Networks (SN) have been of increasing importance              It’s photoshopped ) or similar genres of comments (presence
in our daily lives. When studying SN, one interesting as-               of smileys, slang/journalistic languages...).
pect is the publication propagation, e.g. news, facts, or                  Let us note Iq such a description for an unknown image,
any information considered as important and shared across               and {Idi } the training set of image descriptions. The class
communities. A major characteristic of the propagation is               of Iq is decided based on the classes of the k similar image
its speed. However, users rarely verify the veracity of the             descriptions in {Idi }. In practice, to compute the similari-
shared information. Moreover, verified false information is             ties, we use a state-of-the-art information retrieval approach
often shared and spreading can not be contained [11, 9].                called Okapi-BM25 [5]. A language-detection system (based
   Therefore, we are studying how to verify directly the ve-            on the Google translate service1 ) is used to detect non En-
racity of any information. Our goal is to create systems that           glish tweets, which are then translated into English with
can inform users before sharing false information. Conse-               Google translate. As another preprocessing, we use ortho-
quently, we are extremely interested in the Verification Mul-           graphic and smiley normalization tools developed in-house.
timedia Use task of MediaEval 2016, which aims at classify-             The parameter k was set to 1 by cross-validation.
ing Twitter publications to detect fake information [2]. Con-
sidering the nature of tweet data, diverse information com-             2.2      Trusted sources prediction
ing from the message and its meta-data can be extracted.                   This approach, already used by [4], is conceptually the
We explored in this work the predictive power of various                simplest but rely on external (static) knowledge. As for the
features. We propose different approaches based on text in-             previous run, prediction is made at the image level, and
formation, source credibility, and image content.                       an image is represented as the concatenation of every tweet
                                                                        (translated in English if needed) in which it appears. The
                                                                        prediction is made by detecting trustworthy sources in the
2.    APPROACHES                                                        image description. Two types of sources are searched: 1) a
  We propose four approaches: text-based (run-T), source-               known news-related organism; 2) an explicit citation of the
based (run-S), image-based (run-I), and the combination                 source of the image. For the first types, we gathered lists
of the three approaches (run-C). For all of these methods               of press agencies in the world, newspapers (mostly French
the prediction is first made at the image-level, then prop-             and English ones), news TV networks (French and English
agated to the tweets that contains the image, according to              ones). For the second types, we manually defined some pat-
the following rule: the tweet is predicted as real if all the           terns, like photographed by + Name, captured by + Name,
                                                                        etc. Finally, an image is classified as fake by default, unless
                                                                        a trustworthy source is found in its text description.

                                                                        2.3      Image retrieval prediction
                                                                          In this approach only the image content is used to provide
                                                                        a prediction, at the image level. Note that some tweets do
                                                                        not contain images but videos; such tweets are thus labeled
                                                                        as unknown.
Copyright is held by the author/owner(s)                                1
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands           https://translate.google.com/
                                                                                                                      94.63%
   Images from the Verification Multimedia Use task are


                                                                                                                     92.42%
                                                                                             92.23%


                                                                                                                                    91.22%
                                                                                                                    90.3%
classified using external information. We perform image


                                                                                                                                 82.47%
retrieval, which consists in querying a database of known


                                                                                        75.57%


                                                                                                                               75.25%
fake/real images to discover already known fake images.
The database is built by collecting images from 5 specialized


                                                                                    63.98%
websites, i.e www.hoaxbuster.com/, hoax-busters.org, urban-
legends.about.com, snopes.com, and www.hoax-slayer.com/.


                                                                                                           49.18%
The set contains around 500 original images and 7500 fake


                                                                                                        40.25%
                                                                       Score in %
samples.


                                                                                                      34.07%
   Generic image descriptors are computed using the very
deep Convolutional Neural Networks (CNN) [8]. First, we
apply the convolutional layers [10] of the network on im-
ages scaled to a standard size of 544 × 544. Then, the two
first fully connected layers are kernelized and applied, on
the output feature map, producing a new 11 × 11 × 4096
dimensional feature map. Finally, average pooling followed
by l2 -normalization is performed, giving a 4096-dimensional                         run-T             run-I  run-S             run-C
descriptor [3, 7, 6]. Once all images descriptors are obtained,
                                                                                                        Approaches
cosine similarity is computed between the query and all im-
ages from the database. If the highest similarity is higher
than a threshold of 0.9 (set on the training dataset), then       Figure 1: Recall (red), precision (green) and F-
the query receives the label of the most similar image. Oth-      Measure (blue) scores of the fake class on the test
erwise, the query is labeled as unknown.                          set.

2.4    Combination
                                                                  ages (the rest are associated with video content), meaning
   This last approach aims at combining the three preceding       that the image approach is evaluated only on this portion
ones in a late fusion process. Thus, for a given image, it        of the dataset. Therefore, recall and F-score are directly
takes as input the predictions given by the three systems         impacted. Secondly, the reference database that we built
describe above. As before, the final prediction on the image      is small and unbalanced, resulting in a high number of un-
is then propagated to the tweets containing it.                   known labels in the predictions. Thirdly, the base does not
   Instead of using a simple fusion process (for instance, a      always contain the original images and small modifications
majority vote), we try to automatically build a fusion model      between forged image and its original version can be consid-
fine-tuned to the task. We thus use a machine learning algo-      ered as similar. Finally, images shared on SN often present
rithm, namely boosting (adaboost.MH) over decision trees          specific editing characteristics, as visible added watermarks
[1], which takes as input the predictions of the three previous   like fake,rumor or real, circles, text annotations, etc. Such
approaches, and also the scores associated to these predic-       edits impair the similarity computation between images.
tions (for run-T and run-I). The parameters of the machine           Concerning the run-C, we note that the combination using
learning algorithm are set by cross-validation on the training    late fusion does not offer any gain, and perform even worse
data: the number of iterations for boosting is 500 and the        than the run-S alone. This result is disappointing, as it
depth of the trees is 3. Finally, the fusion model is learned     differs from what we evaluated on the training set by cross-
on the whole training set; it is then used on the test set        validation. It may be explained by an overfitting problem
images.                                                           when learning the fusion model, and by the lower precision
                                                                  (compared to the one estimated on training set) obtained by
3.    RESULTS                                                     the run-I which is used as input.
   The four approaches are applied on the MediaEval 2016
test set and results are reported in Figure 1. The test set       4.   CONCLUSION
is composed of 2228 Twitter messages associated with 130            A multi-modal hoax detection system based on text, source,
images. Moreover, 65% and 26% of the tweets of the devel-         and image analysis is presented. This system uses different
opment and test set respectively are associated with a single     categories of external knowledge: static and general ones,
event.                                                            such as press agency lists, and dynamic and dedicated ones
   We observe that the approach based on the source trust-        such as hoax listing websites, etc. Our evaluation conforts
worthiness level (run-S) outperforms the text-based approach      previous results on the good performance of the source anal-
(run-T), which outperforms the image-based approach (run-         ysis; conversely, the image approach shows poor results. Yet,
I). We can see that the text-based approach competes with         we still consider this later approach as promising; several im-
the source-based approach in terms of recall. It means that       provements are foreseen to improve both the database and
the text approach tends to classify every tweet as fake. This     the content comparison. Finally, multimodality remains a
may be explained by the fact that the training set is unbal-      challenge, as integrating different sources of knowledge may
anced as it contains 3 times more fake than real.                 result in performance loss.
   We note that the prediction based on the image approach
has several drawbacks and performs poorly. In particular,
the precision is low compared to what we estimated on the
                                                                  5.   ACKNOWLEDGEMENTS
training set. Several explanations can be given. First, only        This work is partly supported by the Direction Générale
86% of the test tweets are associated with one or more im-        de l’Armement, France (DGA).
References
 [1] Nathalie Camelin Antoine Laurent and Christian Ray-
     mond. Boosting bonsai trees for efficient features com-
     bination : application to speaker role identification. In
     Proc. of InterSpeech, 2014.
 [2] Christina Boididou, Symeon Papadopoulos, Duc-Tien
     Dang-Nguyen, Giulia Boato, Michael Riegler, Stuart E.
     Middleton, Katerina Andreadou, and Yiannis Kompat-
     siaris. Verifying multimedia use at mediaeval 2016.
     In Working Notes Proceedings of the MediaEval 2016
     Workshop, 2016.
 [3] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos,
     and Andrea Vedaldi. Deep filter banks for texture recog-
     nition, description, and segmentation. International
     Journal of Computer Vision, 118(1):65–94, 2016.
 [4] Stuart Middleton. Extracting attributed verification
     and debunking reports from social media: mediaeval-
     2015 trust and credibility analysis of image and video.
     2015.
 [5] Stephen E. Robertson, Steve Walker, and Micheline
     Hancock-Beaulieu. Okapi at TREC-7: Automatic Ad
     Hoc, Filtering, VLC and Interactive. In Proc. of the
     7th Text Retrieval Conference, TREC-7, pages 199–210,
     1998.
 [6] Ronan Sicre and Hervé Jégou. Memory vectors for par-
     ticular object retrieval with multiple queries. In Pro-
     ceedings of the 5th ACM on International Conference
     on Multimedia Retrieval, pages 479–482. ACM, 2015.
 [7] Ronan Sicre and Frédéric Jurie. Discriminative part
     model for visual recognition. Computer Vision and Im-
     age Understanding, 141:28–37, 2015.
 [8] Karen Simonyan and Andrew Zisserman. Very deep
     convolutional networks for large-scale image recogni-
     tion. arXiv preprint arXiv:1409.1556, 2014.
 [9] Hokky Situngkir. Spread of hoax in social media. 2011.
[10] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. Partic-
     ular object retrieval with integral max-pooling of cnn
     activations. In ICLR, 2016.
[11] Jaewon Yang and Jure Leskovec. Modeling informa-
     tion diffusion in implicit networks. In 2010 IEEE In-
     ternational Conference on Data Mining, pages 599–608.
     IEEE, 2010.