Verifying Multimedia Use at MediaEval 2016 Christina Boididou1 , Symeon Papadopoulos1 , Duc-Tien Dang-Nguyen2 , Giulia Boato2 , Michael Riegler3 , Stuart E. Middleton4 , Andreas Petlund3 , and Yiannis Kompatsiaris1 1 Information Technologies Institute, CERTH, Greece. [boididou,papadop,ikom]@iti.gr 2 University of Trento, Italy. [dangnguyen,boato]@disi.unitn.it 3 Simula Research Laboratory, Norway. michael@simula.no,apetlund@ifi.uio.no 4 University of Southampton IT Innovation Centre, Southampton, UK. sem@it-innovation.soton.ac.uk ABSTRACT This paper provides an overview of the Verifying Multime- dia Use task that takes places as part of the 2016 MediaEval Benchmark. The task motivates the development of auto- mated techniques for detecting manipulated and misleading use of web multimedia content. Splicing, tampering and reposting videos and images are examples of manipulation (a) (b) (c) that are part of the task definition. For the 2016 edition Figure 1: Examples of misleading (fake) image use: (a) of the task, a corpus of images/videos and their associated reposting of real photo claiming to show two Vietnamese posts is made available, together with labels indicating the siblings at Nepal 2015 earthquake; (b) reposting of art- appearance of misuse (fake) or not (real) in each case as work as a photo of solar eclipse (March 2015); (c) spliced well as some useful post metadata. sharks on a photo during Hurricane Sandy in 2012. sufficiently reflects the reality.” In practice, participants re- 1. INTRODUCTION ceive a list of posts that are associated with images and are Social media, such as Twitter and Facebook, as means of required to automatically predict, for each post, whether it is news sharing is very popular and also very often used by gov- trustworthy or deceptive (real or fake respectively). In ad- ernment or politicians to reach the public. The speed of news dition to fully automated approaches, the task also considers spreading on such platforms often leads to the appearance human-assisted approaches provided that they are practical of large amounts of misleading multimedia content. Given (i.e., fast enough) in real-world settings. The following def- the need for automated real-time verification of this content, initions should be also taken into account: several techniques have been presented by researchers. For instance, previous work focused on the classification between • A post is considered fake when it shares multimedia con- fake and real tweets spread during Hurricane Sandy [6] and tent that does not represent the event that it refers to. other events [2] or on automatic methods for assessing posts’ Figure 1 presents examples of such content. credibility [3]. Several systems for checking content credi- • A post is considered real when it shares multimedia that bility have been proposed, such as Truthy [8], TweetCred legitimately represents the event it refers to. [5] and Hoaxy [9]. The second edition of this task aims to • A post that shares multimedia content that does not rep- encourage the development of new verification approaches. resent the event it refers to but reports the false infor- This year, the task is extended by introducing a sub-task, fo- mation or refers to it with a sense of humour is neither cused on identifying digitally manipulated multimedia con- considered fake nor real (and hence not included in tent. To this end, we encourage participants to create text- the task dataset). focused and/or image-focused approaches equally. Sub-task. This version of the task addresses the problem of detecting digitally manipulated (tampered) images. The 2. TASK OVERVIEW definition of the task is the following: “Given an image, the Main task. The definition of the main task is the following: task requires participants to return a decision (tampered, “Given a social media post, comprising a text component, non-tampered or unknown) on whether the image has been an associated piece of visual content (image/video) and a set digitally modified or not”. In practice, participants receive of metadata originating from the social media platform, the a list of images and are required to predict if this image task requires participants to return a decision (fake, real or is tampered or not, using multimedia forensic analysis. It unknown) on whether the information presented by this post should also be noted that an image is considered tampered when it is digitally altered. In both cases, the task also asks participants to optionally Copyright is held by the author/owner(s). return an explanation (which can be a text string, or URLs MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands pointing to evidence) that supports the verification decision. The explanation is not used for quantitative evaluation, but rather for gaining qualitative insights into the results. Table 1: devset events: For each event, we report the numbers of unique real (if available) and fake im- ages/videos (IR , IF respectively), unique posts that 3. VERIFICATION CORPUS shared those images (PR , PF ) and unique Twitter Development dataset (devset): This is provided together accounts that posted those tweets (UR , UF ). Name IR PR UR IF PF UF with ground truth and is used by participants to develop Hurricane Sandy 148 4,664 4,446 62 5,559 5,432 their approach. For the main task, it contains posts related Boston Marathon bombing 28 344 310 35 189 187 Sochi Olympics - - - 26 274 252 to the 17 events of Table 1, comprising in total 193 cases Bring Back Our Girls - - - 7 131 126 of real and 220 cases of misused images/videos, associated MA flight 370 - - - 29 501 493 Columbian Chemicals - - - 15 185 87 with 6,225 real and 9,596 fake posts posted by 5,895 and Passport hoax - - - 2 44 44 9,216 unique users respectively. This data is the union of Rock Elephant - - - 1 13 13 Underwater bedroom - - - 3 113 112 last year’s devset and testset [1]. Note that several of the Livr mobile app - - - 4 9 9 events, e.g., Columbian Chemicals and Passport Hoax are Pig fish - - - 1 14 14 Nepal earthquake 11 1004 934 21 356 343 hoaxes, hence all multimedia content associated with them Solar Eclipse 4 140 133 6 137 135 is misused. For several real events (e.g., MA flight 370) no Garissa Attack 2 73 72 2 6 6 real images (and hence no real posts) are included in the Samurai and Girl - - - 4 218 212 Syrian Boy - - - 1 1786 1692 dataset, since none came up as a result of the data collection Varoufakis and ZDF - - - 1 61 59 process that is described below. For the sub-task, the devel- Total 193 6225 5895 220 9596 9216 opment set contains 33 cases of non-tampered and 33 cases information we got would also be useful. Overall, the data of tampered images, derived from the same events, along collected was very useful. We performed 75 tasks and each with their labels (tampered and non-tampered). worker earned 2, 75$ per task. Test dataset (testset): This is used for evaluation. For For every item of the datasets, we extracted and made the main task, it comprises 104 cases of real and misused available three types of features, similar to the ones we made images and 25 cases of real and misused videos, in total available for the 2015 edition of the task: (i) features ex- associated with 1,107 and 1,121 posts, respectively. For the tracted from the post itself, i.e., the number of words, sub-task, it includes 64 cases of both tampered and non- hashtags, mentions, etc. in the post’s text [1], (ii) features tampered images from the testset events. extracted from the user account, i.e., number of friends The data for both datasets are publicly available1 . Simi- and followers, whether the user is verified, etc. [1]. and lar to the 2015 edition of the task, the posts were collected (iii) forensic features extracted from the image, i.e., around a number of known events or news stories and con- the probability map of the aligned double JPEG compres- tain fake and real multimedia content manually verified by sion, the estimated quantization steps for the first six DCT cross-checking online sources (articles and blogs). Having coefficients of the non-aligned JPEG compression, and the defined a set of keywords K for each testset event, we Photo-Response Non-Uniformity (PRNU) [4]. collected a set of posts P (using Twitter API and specific keywords) and a set of unique fake and real pictures around 4. EVALUATION these events, resulting in the fake and real image sets IF , IR respectively. We then used the image sets as seeds to cre- Overall, the main task is interested in the accuracy with ate our reference verification corpus PC ⊂ P , which includes which an automatic method can distinguish between use of only those posts that contain at least one image of the pre- multimedia in posts in ways that faithfully reflect reality defined sets IF , IR . However, in order not to restrict the versus ways that spread false impressions. Hence, given a posts to the ones pointing to the exact image, we employed set of labelled instances (post + image + label) and a set of a scalable visual near-duplicate search strategy [10]: we used predicted labels (included in the submitted runs) for these the IF , IR as visual queries and for each query we checked instances, the classic IR measures (i.e., Precision P , Recall whether each post image from the P set exists as an image R, and F -score) are used to quantify the classification per- item or a near-duplicate image item of the IF or the IR set. formance, where the target class is the class of fake tweets. In addition to this process, we also used a real-time system Since the two classes (fake/real) are represented in a rela- that collects posts using keywords and a location filter [7]. tively balanced way in the testset, the classic IR measures This was performed mainly to increase the real samples for are good proxies of the classifier accuracy. Note that task events that occurred in known locations. participants are allowed to classify a tweet as unknown. Ob- To further extend the testset, we carried out a crowd- viously, in case a system produces many unknown outputs, sourcing campaign using the microWorkers platform2 . We it is likely that its precision will benefit, assuming that the asked each worker to provide three cases of manipulated mul- selection of unknown is done wisely, i.e. successfully avoid- timedia content that they found on the web. Furthermore, ing erroneous classifications. However, the recall of such a they had to provide a link with information and description system will suffer in case the tweets that are labelled as un- on each case, along with online resources containing evidence known turn out to be fake (the target class). Similarly, in of its misleading nature. We also asked them to provide the the sub-task case, given the instances of (image + label), we original content if available. To avoid cheating, they had use the same IR measures to quantify the performance of to provide a manual description of the manipulation. We the approach, where the target class is tampered. also tested the task in two pilot studies to be sure that the 5. ACKNOWLEDGEMENTS 1 https://github.com/MKLab-ITI/image-verification-corpus/ This work is supported by the REVEAL and InVID projects, tree/master/mediaeval2016 partially funded by the European Commission (FP7-610928 and 2 https://microworkers.com/ H2020-687786 respectively). 6. REFERENCES Sandy: characterizing and identifying fake images on [1] C. Boididou, K. Andreadou, S. Papadopoulos, D.-T. twitter during Hurricane Sandy. In Proceedings of the 22nd Dang-Nguyen, G. Boato, M. Riegler, and Y. Kompatsiaris. international conference on World Wide Web companion, Verifying multimedia use at mediaeval 2015. In Working pages 729–736, 2013. Notes Proceedings of the MediaEval 2015 Workshop, 2015. [7] S. E. Middleton, L. Middleton, and S. Modafferi. Real-time [2] C. Boididou, S. Papadopoulos, Y. Kompatsiaris, crisis mapping of natural disasters using social media. S. Schifferes, and N. Newman. Challenges of computational IEEE Intelligent Systems, 29(2):9–17, 2014. verification in social multimedia. In Proceedings of the 23rd [8] J. Ratkiewicz, M. Conover, M. Meiss, B. Gonçalves, International Conference on World Wide Web, pages S. Patil, A. Flammini, and F. Menczer. Truthy: mapping 743–748. ACM, 2014. the spread of astroturf in microblog streams. In Proceedings [3] C. Castillo, M. Mendoza, and B. Poblete. Information of the 20th international conference companion on World credibility on twitter. In Proceedings of the 20th wide web, pages 249–252. ACM, 2011. international conference on World wide web, pages [9] C. Shao, G. L. Ciampaglia, A. Flammini, and F. Menczer. 675–684. ACM, 2011. Hoaxy: A platform for tracking online misinformation. In [4] V. Conotter, D.-T. Dang-Nguyen, M. Riegler, G. Boato, Proceedings of the 25th International Conference and M. Larson. A crowdsourced data set of edited images Companion on World Wide Web, pages 745–750. online. In Proceedings of the 2014 International ACM International World Wide Web Conferences Steering Workshop on Crowdsourcing for Multimedia, CrowdMM Committee, 2016. ’14, pages 49–52, New York, NY, USA, 2014. ACM. [10] E. Spyromitros-Xioufis, S. Papadopoulos, I. Kompatsiaris, [5] A. Gupta, P. Kumaraguru, C. Castillo, and P. Meier. G. Tsoumakas, and I. Vlahavas. A comprehensive study Tweetcred: Real-time credibility assessment of content on over VLAD and Product Quantization in large-scale image twitter. In International Conference on Social Informatics, retrieval. IEEE Transactions on Multimedia, pages 228–243. Springer, 2014. 16(6):1713–1728, 2014. [6] A. Gupta, H. Lamba, P. Kumaraguru, and A. Joshi. Faking