Verifying Multimedia Use at MediaEval 2016

 Christina Boididou1 , Symeon Papadopoulos1 , Duc-Tien Dang-Nguyen2 , Giulia Boato2 , Michael
          Riegler3 , Stuart E. Middleton4 , Andreas Petlund3 , and Yiannis Kompatsiaris1
                    1
                        Information Technologies Institute, CERTH, Greece. [boididou,papadop,ikom]@iti.gr
                                    2
                                        University of Trento, Italy. [dangnguyen,boato]@disi.unitn.it
                          3
                              Simula Research Laboratory, Norway. michael@simula.no,apetlund@ifi.uio.no
           4
               University of Southampton IT Innovation Centre, Southampton, UK. sem@it-innovation.soton.ac.uk


ABSTRACT
This paper provides an overview of the Verifying Multime-
dia Use task that takes places as part of the 2016 MediaEval
Benchmark. The task motivates the development of auto-
mated techniques for detecting manipulated and misleading
use of web multimedia content. Splicing, tampering and
reposting videos and images are examples of manipulation
                                                                               (a)             (b)                (c)
that are part of the task definition. For the 2016 edition
                                                                        Figure 1: Examples of misleading (fake) image use: (a)
of the task, a corpus of images/videos and their associated
                                                                        reposting of real photo claiming to show two Vietnamese
posts is made available, together with labels indicating the
                                                                        siblings at Nepal 2015 earthquake; (b) reposting of art-
appearance of misuse (fake) or not (real) in each case as
                                                                        work as a photo of solar eclipse (March 2015); (c) spliced
well as some useful post metadata.
                                                                        sharks on a photo during Hurricane Sandy in 2012.
                                                                        sufficiently reflects the reality.” In practice, participants re-
1.    INTRODUCTION                                                      ceive a list of posts that are associated with images and are
   Social media, such as Twitter and Facebook, as means of              required to automatically predict, for each post, whether it is
news sharing is very popular and also very often used by gov-           trustworthy or deceptive (real or fake respectively). In ad-
ernment or politicians to reach the public. The speed of news           dition to fully automated approaches, the task also considers
spreading on such platforms often leads to the appearance               human-assisted approaches provided that they are practical
of large amounts of misleading multimedia content. Given                (i.e., fast enough) in real-world settings. The following def-
the need for automated real-time verification of this content,          initions should be also taken into account:
several techniques have been presented by researchers. For
instance, previous work focused on the classification between           • A post is considered fake when it shares multimedia con-
fake and real tweets spread during Hurricane Sandy [6] and                tent that does not represent the event that it refers to.
other events [2] or on automatic methods for assessing posts’             Figure 1 presents examples of such content.
credibility [3]. Several systems for checking content credi-            • A post is considered real when it shares multimedia that
bility have been proposed, such as Truthy [8], TweetCred                  legitimately represents the event it refers to.
[5] and Hoaxy [9]. The second edition of this task aims to              • A post that shares multimedia content that does not rep-
encourage the development of new verification approaches.                 resent the event it refers to but reports the false infor-
This year, the task is extended by introducing a sub-task, fo-            mation or refers to it with a sense of humour is neither
cused on identifying digitally manipulated multimedia con-                considered fake nor real (and hence not included in
tent. To this end, we encourage participants to create text-              the task dataset).
focused and/or image-focused approaches equally.
                                                                        Sub-task. This version of the task addresses the problem
                                                                        of detecting digitally manipulated (tampered) images. The
2.    TASK OVERVIEW                                                     definition of the task is the following: “Given an image, the
Main task. The definition of the main task is the following:            task requires participants to return a decision (tampered,
“Given a social media post, comprising a text component,                non-tampered or unknown) on whether the image has been
an associated piece of visual content (image/video) and a set           digitally modified or not”. In practice, participants receive
of metadata originating from the social media platform, the             a list of images and are required to predict if this image
task requires participants to return a decision (fake, real or          is tampered or not, using multimedia forensic analysis. It
unknown) on whether the information presented by this post              should also be noted that an image is considered tampered
                                                                        when it is digitally altered.
                                                                           In both cases, the task also asks participants to optionally
Copyright is held by the author/owner(s).                               return an explanation (which can be a text string, or URLs
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands       pointing to evidence) that supports the verification decision.
The explanation is not used for quantitative evaluation, but
rather for gaining qualitative insights into the results.         Table 1: devset events: For each event, we report
                                                                  the numbers of unique real (if available) and fake im-
                                                                  ages/videos (IR , IF respectively), unique posts that
3.     VERIFICATION CORPUS                                        shared those images (PR , PF ) and unique Twitter
Development dataset (devset): This is provided together           accounts that posted those tweets (UR , UF ).
                                                                       Name                      IR     PR      UR     IF     PF      UF
with ground truth and is used by participants to develop               Hurricane Sandy           148   4,664   4,446   62    5,559   5,432
their approach. For the main task, it contains posts related           Boston Marathon bombing   28     344     310    35     189     187
                                                                       Sochi Olympics             -      -       -     26     274     252
to the 17 events of Table 1, comprising in total 193 cases             Bring Back Our Girls       -      -       -      7     131     126
of real and 220 cases of misused images/videos, associated             MA flight 370              -      -       -     29     501     493
                                                                       Columbian Chemicals        -      -       -     15     185     87
with 6,225 real and 9,596 fake posts posted by 5,895 and               Passport hoax              -      -       -      2     44      44
9,216 unique users respectively. This data is the union of             Rock Elephant              -      -       -      1     13      13
                                                                       Underwater bedroom         -      -       -      3     113     112
last year’s devset and testset [1]. Note that several of the           Livr mobile app            -      -       -      4      9       9
events, e.g., Columbian Chemicals and Passport Hoax are                Pig fish                   -      -       -      1     14      14
                                                                       Nepal earthquake          11    1004     934    21     356     343
hoaxes, hence all multimedia content associated with them              Solar Eclipse              4     140     133     6     137     135
is misused. For several real events (e.g., MA flight 370) no           Garissa Attack             2     73      72      2      6       6
real images (and hence no real posts) are included in the              Samurai and Girl           -      -       -      4     218     212
                                                                       Syrian Boy                 -      -       -      1    1786    1692
dataset, since none came up as a result of the data collection         Varoufakis and ZDF         -      -       -      1     61      59
process that is described below. For the sub-task, the devel-          Total                     193   6225    5895    220   9596    9216

opment set contains 33 cases of non-tampered and 33 cases         information we got would also be useful. Overall, the data
of tampered images, derived from the same events, along           collected was very useful. We performed 75 tasks and each
with their labels (tampered and non-tampered).                    worker earned 2, 75$ per task.
Test dataset (testset): This is used for evaluation. For             For every item of the datasets, we extracted and made
the main task, it comprises 104 cases of real and misused         available three types of features, similar to the ones we made
images and 25 cases of real and misused videos, in total          available for the 2015 edition of the task: (i) features ex-
associated with 1,107 and 1,121 posts, respectively. For the      tracted from the post itself, i.e., the number of words,
sub-task, it includes 64 cases of both tampered and non-          hashtags, mentions, etc. in the post’s text [1], (ii) features
tampered images from the testset events.                          extracted from the user account, i.e., number of friends
   The data for both datasets are publicly available1 . Simi-     and followers, whether the user is verified, etc. [1]. and
lar to the 2015 edition of the task, the posts were collected     (iii) forensic features extracted from the image, i.e.,
around a number of known events or news stories and con-          the probability map of the aligned double JPEG compres-
tain fake and real multimedia content manually verified by        sion, the estimated quantization steps for the first six DCT
cross-checking online sources (articles and blogs). Having        coefficients of the non-aligned JPEG compression, and the
defined a set of keywords K for each testset event, we            Photo-Response Non-Uniformity (PRNU) [4].
collected a set of posts P (using Twitter API and specific
keywords) and a set of unique fake and real pictures around       4.    EVALUATION
these events, resulting in the fake and real image sets IF , IR
respectively. We then used the image sets as seeds to cre-           Overall, the main task is interested in the accuracy with
ate our reference verification corpus PC ⊂ P , which includes     which an automatic method can distinguish between use of
only those posts that contain at least one image of the pre-      multimedia in posts in ways that faithfully reflect reality
defined sets IF , IR . However, in order not to restrict the      versus ways that spread false impressions. Hence, given a
posts to the ones pointing to the exact image, we employed        set of labelled instances (post + image + label) and a set of
a scalable visual near-duplicate search strategy [10]: we used    predicted labels (included in the submitted runs) for these
the IF , IR as visual queries and for each query we checked       instances, the classic IR measures (i.e., Precision P , Recall
whether each post image from the P set exists as an image         R, and F -score) are used to quantify the classification per-
item or a near-duplicate image item of the IF or the IR set.      formance, where the target class is the class of fake tweets.
In addition to this process, we also used a real-time system      Since the two classes (fake/real) are represented in a rela-
that collects posts using keywords and a location filter [7].     tively balanced way in the testset, the classic IR measures
This was performed mainly to increase the real samples for        are good proxies of the classifier accuracy. Note that task
events that occurred in known locations.                          participants are allowed to classify a tweet as unknown. Ob-
   To further extend the testset, we carried out a crowd-         viously, in case a system produces many unknown outputs,
sourcing campaign using the microWorkers platform2 . We           it is likely that its precision will benefit, assuming that the
asked each worker to provide three cases of manipulated mul-      selection of unknown is done wisely, i.e. successfully avoid-
timedia content that they found on the web. Furthermore,          ing erroneous classifications. However, the recall of such a
they had to provide a link with information and description       system will suffer in case the tweets that are labelled as un-
on each case, along with online resources containing evidence     known turn out to be fake (the target class). Similarly, in
of its misleading nature. We also asked them to provide the       the sub-task case, given the instances of (image + label), we
original content if available. To avoid cheating, they had        use the same IR measures to quantify the performance of
to provide a manual description of the manipulation. We           the approach, where the target class is tampered.
also tested the task in two pilot studies to be sure that the
                                                                  5.    ACKNOWLEDGEMENTS
1
 https://github.com/MKLab-ITI/image-verification-corpus/            This work is supported by the REVEAL and InVID projects,
tree/master/mediaeval2016                                         partially funded by the European Commission (FP7-610928 and
2
    https://microworkers.com/                                     H2020-687786 respectively).
6.   REFERENCES                                                          Sandy: characterizing and identifying fake images on
[1] C. Boididou, K. Andreadou, S. Papadopoulos, D.-T.                    twitter during Hurricane Sandy. In Proceedings of the 22nd
    Dang-Nguyen, G. Boato, M. Riegler, and Y. Kompatsiaris.              international conference on World Wide Web companion,
    Verifying multimedia use at mediaeval 2015. In Working               pages 729–736, 2013.
    Notes Proceedings of the MediaEval 2015 Workshop, 2015.          [7] S. E. Middleton, L. Middleton, and S. Modafferi. Real-time
[2] C. Boididou, S. Papadopoulos, Y. Kompatsiaris,                       crisis mapping of natural disasters using social media.
    S. Schifferes, and N. Newman. Challenges of computational            IEEE Intelligent Systems, 29(2):9–17, 2014.
    verification in social multimedia. In Proceedings of the 23rd    [8] J. Ratkiewicz, M. Conover, M. Meiss, B. Gonçalves,
    International Conference on World Wide Web, pages                    S. Patil, A. Flammini, and F. Menczer. Truthy: mapping
    743–748. ACM, 2014.                                                  the spread of astroturf in microblog streams. In Proceedings
[3] C. Castillo, M. Mendoza, and B. Poblete. Information                 of the 20th international conference companion on World
    credibility on twitter. In Proceedings of the 20th                   wide web, pages 249–252. ACM, 2011.
    international conference on World wide web, pages                [9] C. Shao, G. L. Ciampaglia, A. Flammini, and F. Menczer.
    675–684. ACM, 2011.                                                  Hoaxy: A platform for tracking online misinformation. In
[4] V. Conotter, D.-T. Dang-Nguyen, M. Riegler, G. Boato,                Proceedings of the 25th International Conference
    and M. Larson. A crowdsourced data set of edited images              Companion on World Wide Web, pages 745–750.
    online. In Proceedings of the 2014 International ACM                 International World Wide Web Conferences Steering
    Workshop on Crowdsourcing for Multimedia, CrowdMM                    Committee, 2016.
    ’14, pages 49–52, New York, NY, USA, 2014. ACM.                 [10] E. Spyromitros-Xioufis, S. Papadopoulos, I. Kompatsiaris,
[5] A. Gupta, P. Kumaraguru, C. Castillo, and P. Meier.                  G. Tsoumakas, and I. Vlahavas. A comprehensive study
    Tweetcred: Real-time credibility assessment of content on            over VLAD and Product Quantization in large-scale image
    twitter. In International Conference on Social Informatics,          retrieval. IEEE Transactions on Multimedia,
    pages 228–243. Springer, 2014.                                       16(6):1713–1728, 2014.
[6] A. Gupta, H. Lamba, P. Kumaraguru, and A. Joshi. Faking