The CERTH-UNITN Participation @ Verifying Multimedia Use 2015 Christina Boididou1 , Symeon Papadopoulos1 , Duc-Tien Dang-Nguyen2 , Giulia Boato2 , and Yiannis Kompatsiaris1 1 Information Technologies Institute, CERTH, Greece. [boididou,papadop,ikom]@iti.gr 2 University of Trento, Italy. [dangnguyen,boato]@disi.unitn.it ABSTRACT Table 1: List of features used in the experiments. We propose an approach that predicts whether a tweet, Feature set Description which is accompanied by multimedia content (image/video), TB-base Baseline tweet-based is trustworthy or deceptive. We test different combinations TB-ext Extended tweet-based of quality and trust-oriented features (tweet-based, user- UB-base Baseline user-based based and forensics) in tandem with a standard classification UB-ext Extended user-based and an agreement-retraining technique, with the goal of pre- FOR Forensic features dicting the most likely label (fake or real) for each tweet. The experiments carried out on the Verifying Multimedia Use dataset show that the best performance is achieved using information and metadata about the user posting (or when using all available features in combination with the retweeting) the tweet, c) multimedia forensics features, which agreement-retraining method. are computed based on the image that accompanies the tweet. We test two variants of the first two sets of features: i) 1. INTRODUCTION baseline (base), which correspond to the features shared by Since social media have gained momentum over the years the organisers, and ii) extended (ext), which include a few as a fast and real-time means of sharing news, a huge amount new features. The forensics features include both the ones of information is constantly flowing through it, quickly reach- distributed by the organisers and some additional ones. ing massive numbers of readers. Thus, it can easily become TB-ext: We extract additional features based on the tweet viral and affect public opinion and sentiment. This has mo- text, such as the presence of a word, symbol or external tivated a number of malicious efforts to spread misleading link. We also use language-specific binary features that cor- content, highlighting the need for fast verification. In this respond to the presence of specific terms; for languages, in setting, the goal of Verifying Multimedia Use task is to au- which we cannot manage to define such terms, we consider tomatically predict whether a tweet that shares multimedia the values of these features missing. We perform language content is misleading (referred to as fake) or trustworthy detection with a publicly available library1 . We add a fea- (real) [1]. To this end, we make use of the tweet text con- ture for the number of slang words in a text, using slang tent, a set of tweet- and user-based features and multimedia lists in English2 and Spanish3 . For the number of nouns, forensic features for the images embedded in the tweet. we use the Stanford parser4 to assign parts of speech to In our work, we present an extension of our original ap- each word (supported only in English). For the readability proach [2], combining different sets of the aforementioned of the text, we use the Flesch Reading Ease method5 , which features. The conducted experiments include plain classifi- computes the complexity of a piece of text as a score in the cation models and an agreement-retraining method that uses interval [0, 100] (0: hard-to-read, 100: easy-to-read). part of its own predictions as new training samples with the UB-ext: We extract user-specific features such as the number goal of adapting to the new event. In the next sections, we of media content, the account age and others that refer present in detail the adopted methodology. to the information that the profile shares. For example, we check whether the user declares his/her geographic location and whether the location can be matched to a city name 2. SYSTEM OVERVIEW from the Geonames dataset6 . Next, for both TB and UB features, we adopt trust-oriented 2.1 Features features for the links shared, through the tweet itself (TB) or The approach uses three types of features: a) tweet-based 1 (TB), which make use of information coming from the tweet https://code.google.com/p/language-detection/ 2 and its metadata, b) user-based (UB), which are computed http://onlineslangdictionary.com/word-list/0-a/ 3 http://www.languagerealm.com/spanish/spanishslang.php 4 http://nlp.stanford.edu/software/lex-parser.shtml 5 Copyright is held by the author/owner(s). http://simple.wikipedia.org/wiki/Flesch_Reading_Ease 6 MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany http://download.geonames.org/export/dump/cities1000.zip Table 3: Results. Recall Precision F-score RUN-1 0.794 0.733 0.762 RUN-2 0.749 0.994 0.854 RUN-3 0.922 0.736 0.819 RUN-4 0.798 0.860 0.828 RUN-5 0.969 0.861 0.911 high likelihood, we use them as training samples to build a new classifier for classifying the disagreed samples. To Figure 1: Overview of agreement-based retraining. this end, in step (b), we add the agreed samples to the best performing of the two initial models, CL1 , CL2 (comparing them on the basis of their performance when doing cross- Table 2: Description of runs. For SSL-AR case, the two validation on the training set). The goal of this method is sets of features used for building the two classifiers are to retrain the initial model and make it adaptable to any mentioned. For example, in RUN-3 the TB-base + FOR are specific characteristics of the new event. In that way, the used for CL1 and the UB-base for CL2 . Run Features Learning model can predict more accurately the values of the samples RUN-1 SL TB-base for which CL1 , CL2 did not agree in the first step. RUN-2 SL TB-base+FOR RUN-3 SSL-AR (TB-base+FOR) + UB-base 2.3 Bagging RUN-4 SL TB-ext+UB-ext+FOR Due to the unequal number of fake and real tweets, we RUN-5 SSL-AR (TB-ext+FOR) + UB-ext exploit only a part of the data while building a model. In order to take advantage of the whole training dataset, we use bagging that tends to improve the accuracy of the method, the user profile (UB). The WOT metric7 is a score indicating as it produces predictions using the average result of nu- how trustworthy a website is, using reputation ratings by merous predictors. Bagging creates m different subsets of Web users. We also include the in-degree and harmonic the training set, including equal number of samples for each centralities, rankings computed based on the links of the class (some samples may appear in multiple subsets), lead- web forming a graph8 . Trust analysis of the links is also ing to the creation of m instances of CL1 and CL2 classifiers done using four Web metrics provided by the Alexa API9 . (m = 9). The final prediction for each of the testing samples FOR: For each image, the additional forensics features are is calculated using the majority vote of the m predictions. extracted from the provided BAG feature based on the maps obtained from AJPG and NAJPG. First, a binary map is created 3. SUBMITTED RUNS AND RESULTS by thresholding the AJPG map (we use 0.6 as the threshold), then the largest region is selected as object and the rest of the The five runs submitted explore different combinations map is considered as the background. For both regions, seven of features and the use of a standard supervised learning descriptive statistics (maximum, minimum, mean, median, scheme (SL) versus the newly proposed agreement-based re- most frequent value, standard deviation, and variance) are training (SSL-AR). The specific run configurations are spec- computed from the BAG values and concatenated to a 14- ified in Table 2. dimensional vector. We apply the same process on the NAJPG RUN-1, RUN-2 and RUN-4 are built using a plain classifica- map to obtain a second feature vector. tion model. RUN-3 and RUN-5 are built with the agreement- based retraining technique, in which we build CL1 and CL2 2.2 Agreement-based retraining method (Figure 1) by using the sets of features specified in Table 2. All models use a Random Forest classifier from the Weka The main extension of this system compared to [2] in- implementation. cludes an agreement-based retraining step in order to im- Table 3 presents the performance of each run. In terms prove the prediction accuracy for unseen events. This is of F-score, which is the primary evaluation metric of the motivated by a similar approach implemented in [3] (for the task, RUN-5 achieved the best score when using the ext and problem of polarity classification). Figure 1 illustrates the the FOR features with the SSL-AR technique. As we observe, adopted process. In step (a), we build two classifiers CL1 , RUN-2 in which the FOR features are added, performed quite CL2 based on the training set, each classifier built on dif- better than RUN-1, which uses just the TB-base features. ferent types of features, and we combine their outputs in a Comparing RUN-4 and RUN-5, one may observe the consid- Semi-Supervised Learning (SSL) fashion. We compare the erable performance benefit stemming from the use of the two predictions for each sample of the test set, and depend- SSL-AR approach, as it is the only difference between the ing on their agreement, we divide the test set in two subsets, two runs (the same sets of features are used). Additionally, the agreed and disagreed samples. These two subsets are it is important to note the contribution of the ext features, treated differently by the classification framework. as RUN-5 (ext) performs better than RUN-3 (base). Assuming that the agreed predictions are correct with 7 https://www.mywot.com/ 8 http://wwwranking.webdatacommons.org/more.html 4. ACKNOWLEDGEMENTS 9 This work is supported by the REVEAL project, partially http://data.alexa.com/data?cli=10&dat=snbamz&url= google.gr funded by the European Commission (FP7-610928). 5. REFERENCES verification in social multimedia. In Proceedings of the [1] C. Boididou, K. Andreadou, S. Papadopoulos, D.-T. Companion Publication of the 23rd International Conference Dang-Nguyen, G. Boato, M. Riegler, and Y. Kompatsiaris. on World Wide Web Companion, pages 743–748, 2014. Verifying multimedia use at mediaeval 2015. In MediaEval [3] A. Tsakalidis, S. Papadopoulos, and I. Kompatsiaris. An 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany, ensemble model for cross-domain polarity classification on 2015. twitter. In Web Information Systems Engineering–WISE [2] C. Boididou, S. Papadopoulos, Y. Kompatsiaris, 2014, pages 168–177. Springer, 2014. S. Schifferes, and N. Newman. Challenges of computational