MediaEval 2016: A multimodal system for the Verifying Multimedia Use task Cédric Maigrot1 , Vincent Claveau2 , Ewa Kijak1 , and Ronan Sicre2 1,2 IRISA, 1 Univ. of Rennes 1, 2 CNRS, Rennes, France , Cedric.Maigrot@irisa.fr, Vincent.Claveau@irisa.fr, Ewa.Kijak@irisa.fr, Ronan.Sicre@irisa.fr ABSTRACT associated images are classified as real ; if at least one of the This paper presents a multi-modal hoax detection system images is classified as fake, the tweet is considered as fake. composed of text, source, and image analysis. As hoax can 2.1 Text-based nearest neighbors prediction be very diverse, we want to analyze several modalities to better detect them. This system is applied in the context of This approach exploits the textual contents of the tweets the Verifying Multimedia Use task of MediaEval 2016. Ex- and do not rely on any external data apart from the training periments show the performance of each separated modality set. As previously explained, a tweet is classified based on as well as their combination. the images it contains; an image is described by the con- catenated texts of every tweet containing this image.The idea here is to capture similar comments between an un- 1. INTRODUCTION known image and an image from the training set (such as Social Networks (SN) have been of increasing importance It’s photoshopped ) or similar genres of comments (presence in our daily lives. When studying SN, one interesting as- of smileys, slang/journalistic languages...). pect is the publication propagation, e.g. news, facts, or Let us note Iq such a description for an unknown image, any information considered as important and shared across and {Idi } the training set of image descriptions. The class communities. A major characteristic of the propagation is of Iq is decided based on the classes of the k similar image its speed. However, users rarely verify the veracity of the descriptions in {Idi }. In practice, to compute the similari- shared information. Moreover, verified false information is ties, we use a state-of-the-art information retrieval approach often shared and spreading can not be contained [11, 9]. called Okapi-BM25 [5]. A language-detection system (based Therefore, we are studying how to verify directly the ve- on the Google translate service1 ) is used to detect non En- racity of any information. Our goal is to create systems that glish tweets, which are then translated into English with can inform users before sharing false information. Conse- Google translate. As another preprocessing, we use ortho- quently, we are extremely interested in the Verification Mul- graphic and smiley normalization tools developed in-house. timedia Use task of MediaEval 2016, which aims at classify- The parameter k was set to 1 by cross-validation. ing Twitter publications to detect fake information [2]. Con- sidering the nature of tweet data, diverse information com- 2.2 Trusted sources prediction ing from the message and its meta-data can be extracted. This approach, already used by [4], is conceptually the We explored in this work the predictive power of various simplest but rely on external (static) knowledge. As for the features. We propose different approaches based on text in- previous run, prediction is made at the image level, and formation, source credibility, and image content. an image is represented as the concatenation of every tweet (translated in English if needed) in which it appears. The prediction is made by detecting trustworthy sources in the 2. APPROACHES image description. Two types of sources are searched: 1) a We propose four approaches: text-based (run-T), source- known news-related organism; 2) an explicit citation of the based (run-S), image-based (run-I), and the combination source of the image. For the first types, we gathered lists of the three approaches (run-C). For all of these methods of press agencies in the world, newspapers (mostly French the prediction is first made at the image-level, then prop- and English ones), news TV networks (French and English agated to the tweets that contains the image, according to ones). For the second types, we manually defined some pat- the following rule: the tweet is predicted as real if all the terns, like photographed by + Name, captured by + Name, etc. Finally, an image is classified as fake by default, unless a trustworthy source is found in its text description. 2.3 Image retrieval prediction In this approach only the image content is used to provide a prediction, at the image level. Note that some tweets do not contain images but videos; such tweets are thus labeled as unknown. Copyright is held by the author/owner(s) 1 MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands https://translate.google.com/ 94.63% Images from the Verification Multimedia Use task are 92.42% 92.23% 91.22% 90.3% classified using external information. We perform image 82.47% retrieval, which consists in querying a database of known 75.57% 75.25% fake/real images to discover already known fake images. The database is built by collecting images from 5 specialized 63.98% websites, i.e www.hoaxbuster.com/, hoax-busters.org, urban- legends.about.com, snopes.com, and www.hoax-slayer.com/. 49.18% The set contains around 500 original images and 7500 fake 40.25% Score in % samples. 34.07% Generic image descriptors are computed using the very deep Convolutional Neural Networks (CNN) [8]. First, we apply the convolutional layers [10] of the network on im- ages scaled to a standard size of 544 × 544. Then, the two first fully connected layers are kernelized and applied, on the output feature map, producing a new 11 × 11 × 4096 dimensional feature map. Finally, average pooling followed by l2 -normalization is performed, giving a 4096-dimensional run-T run-I run-S run-C descriptor [3, 7, 6]. Once all images descriptors are obtained, Approaches cosine similarity is computed between the query and all im- ages from the database. If the highest similarity is higher than a threshold of 0.9 (set on the training dataset), then Figure 1: Recall (red), precision (green) and F- the query receives the label of the most similar image. Oth- Measure (blue) scores of the fake class on the test erwise, the query is labeled as unknown. set. 2.4 Combination ages (the rest are associated with video content), meaning This last approach aims at combining the three preceding that the image approach is evaluated only on this portion ones in a late fusion process. Thus, for a given image, it of the dataset. Therefore, recall and F-score are directly takes as input the predictions given by the three systems impacted. Secondly, the reference database that we built describe above. As before, the final prediction on the image is small and unbalanced, resulting in a high number of un- is then propagated to the tweets containing it. known labels in the predictions. Thirdly, the base does not Instead of using a simple fusion process (for instance, a always contain the original images and small modifications majority vote), we try to automatically build a fusion model between forged image and its original version can be consid- fine-tuned to the task. We thus use a machine learning algo- ered as similar. Finally, images shared on SN often present rithm, namely boosting (adaboost.MH) over decision trees specific editing characteristics, as visible added watermarks [1], which takes as input the predictions of the three previous like fake,rumor or real, circles, text annotations, etc. Such approaches, and also the scores associated to these predic- edits impair the similarity computation between images. tions (for run-T and run-I). The parameters of the machine Concerning the run-C, we note that the combination using learning algorithm are set by cross-validation on the training late fusion does not offer any gain, and perform even worse data: the number of iterations for boosting is 500 and the than the run-S alone. This result is disappointing, as it depth of the trees is 3. Finally, the fusion model is learned differs from what we evaluated on the training set by cross- on the whole training set; it is then used on the test set validation. It may be explained by an overfitting problem images. when learning the fusion model, and by the lower precision (compared to the one estimated on training set) obtained by 3. RESULTS the run-I which is used as input. The four approaches are applied on the MediaEval 2016 test set and results are reported in Figure 1. The test set 4. CONCLUSION is composed of 2228 Twitter messages associated with 130 A multi-modal hoax detection system based on text, source, images. Moreover, 65% and 26% of the tweets of the devel- and image analysis is presented. This system uses different opment and test set respectively are associated with a single categories of external knowledge: static and general ones, event. such as press agency lists, and dynamic and dedicated ones We observe that the approach based on the source trust- such as hoax listing websites, etc. Our evaluation conforts worthiness level (run-S) outperforms the text-based approach previous results on the good performance of the source anal- (run-T), which outperforms the image-based approach (run- ysis; conversely, the image approach shows poor results. Yet, I). We can see that the text-based approach competes with we still consider this later approach as promising; several im- the source-based approach in terms of recall. It means that provements are foreseen to improve both the database and the text approach tends to classify every tweet as fake. This the content comparison. Finally, multimodality remains a may be explained by the fact that the training set is unbal- challenge, as integrating different sources of knowledge may anced as it contains 3 times more fake than real. result in performance loss. We note that the prediction based on the image approach has several drawbacks and performs poorly. In particular, the precision is low compared to what we estimated on the 5. ACKNOWLEDGEMENTS training set. Several explanations can be given. First, only This work is partly supported by the Direction Générale 86% of the test tweets are associated with one or more im- de l’Armement, France (DGA). References [1] Nathalie Camelin Antoine Laurent and Christian Ray- mond. Boosting bonsai trees for efficient features com- bination : application to speaker role identification. In Proc. of InterSpeech, 2014. [2] Christina Boididou, Symeon Papadopoulos, Duc-Tien Dang-Nguyen, Giulia Boato, Michael Riegler, Stuart E. Middleton, Katerina Andreadou, and Yiannis Kompat- siaris. Verifying multimedia use at mediaeval 2016. In Working Notes Proceedings of the MediaEval 2016 Workshop, 2016. [3] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, and Andrea Vedaldi. Deep filter banks for texture recog- nition, description, and segmentation. International Journal of Computer Vision, 118(1):65–94, 2016. [4] Stuart Middleton. Extracting attributed verification and debunking reports from social media: mediaeval- 2015 trust and credibility analysis of image and video. 2015. [5] Stephen E. Robertson, Steve Walker, and Micheline Hancock-Beaulieu. Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive. In Proc. of the 7th Text Retrieval Conference, TREC-7, pages 199–210, 1998. [6] Ronan Sicre and Hervé Jégou. Memory vectors for par- ticular object retrieval with multiple queries. In Pro- ceedings of the 5th ACM on International Conference on Multimedia Retrieval, pages 479–482. ACM, 2015. [7] Ronan Sicre and Frédéric Jurie. Discriminative part model for visual recognition. Computer Vision and Im- age Understanding, 141:28–37, 2015. [8] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. arXiv preprint arXiv:1409.1556, 2014. [9] Hokky Situngkir. Spread of hoax in social media. 2011. [10] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. Partic- ular object retrieval with integral max-pooling of cnn activations. In ICLR, 2016. [11] Jaewon Yang and Jure Leskovec. Modeling informa- tion diffusion in implicit networks. In 2010 IEEE In- ternational Conference on Data Mining, pages 599–608. IEEE, 2010.