1. INTRODUCTION

The CERTH-UNITN Participation @ Verifying Multimedia Use 2015

Christina Boididou

boididou@iti.gr 0 1

Symeon Papadopoulos

papadop@iti.gr 0 1

Duc-Tien Dang-Nguyen

Giulia Boato

boato@disi.unitn.it 1

Yiannis Kompatsiaris

0 1 0 Information Technologies Institute , CERTH , Greece 1 Table 1: List of features used in the experiments. Feature set Description TB-base Baseline tweet-based TB-ext Extended tweet-based UB-base Baseline user-based UB-ext Extended user-based FOR Forensic features , USA

2015

14 15

We propose an approach that predicts whether a tweet, which is accompanied by multimedia content (image/video), is trustworthy or deceptive. We test di erent combinations of quality and trust-oriented features (tweet-based, userbased and forensics) in tandem with a standard classi cation and an agreement-retraining technique, with the goal of predicting the most likely label (fake or real) for each tweet. The experiments carried out on the Verifying Multimedia Use dataset show that the best performance is achieved when using all available features in combination with the agreement-retraining method.

1. INTRODUCTION

Since social media have gained momentum over the years as a fast and real-time means of sharing news, a huge amount of information is constantly owing through it, quickly reaching massive numbers of readers. Thus, it can easily become viral and a ect public opinion and sentiment. This has motivated a number of malicious e orts to spread misleading content, highlighting the need for fast veri cation. In this setting, the goal of Verifying Multimedia Use task is to automatically predict whether a tweet that shares multimedia content is misleading (referred to as fake) or trustworthy (real) [ 1 ]. To this end, we make use of the tweet text content, a set of tweet- and user-based features and multimedia forensic features for the images embedded in the tweet.

In our work, we present an extension of our original approach [ 2 ], combining di erent sets of the aforementioned features. The conducted experiments include plain classi cation models and an agreement-retraining method that uses part of its own predictions as new training samples with the goal of adapting to the new event. In the next sections, we present in detail the adopted methodology. 2.1

SYSTEM OVERVIEW Features

The approach uses three types of features: a) tweet-based (TB), which make use of information coming from the tweet and its metadata, b) user-based (UB), which are computed using information and metadata about the user posting (or retweeting) the tweet, c) multimedia forensics features, which are computed based on the image that accompanies the tweet. We test two variants of the rst two sets of features: i) baseline (base), which correspond to the features shared by the organisers, and ii) extended (ext), which include a few new features. The forensics features include both the ones distributed by the organisers and some additional ones. TB-ext: We extract additional features based on the tweet text, such as the presence of a word, symbol or external link. We also use language-speci c binary features that correspond to the presence of speci c terms; for languages, in which we cannot manage to de ne such terms, we consider the values of these features missing. We perform language detection with a publicly available library1. We add a feature for the number of slang words in a text, using slang lists in English2 and Spanish3. For the number of nouns, we use the Stanford parser4 to assign parts of speech to each word (supported only in English). For the readability of the text, we use the Flesch Reading Ease method5, which computes the complexity of a piece of text as a score in the interval [0; 100] (0: hard-to-read, 100: easy-to-read). UB-ext: We extract user-speci c features such as the number of media content, the account age and others that refer to the information that the pro le shares. For example, we check whether the user declares his/her geographic location and whether the location can be matched to a city name from the Geonames dataset6.

Next, for both TB and UB features, we adopt trust-oriented features for the links shared, through the tweet itself (TB) or

1https://code.google.com/p/language-detection/

2http://onlineslangdictionary.com/word-list/0-a/ 3http://www.languagerealm.com/spanish/spanishslang.php 4http://nlp.stanford.edu/software/lex-parser.shtml 5http://simple.wikipedia.org/wiki/Flesch_Reading_Ease 6http://download.geonames.org/export/dump/cities1000.zip the user pro le (UB). The WOT metric7 is a score indicating how trustworthy a website is, using reputation ratings by Web users. We also include the in-degree and harmonic centralities, rankings computed based on the links of the web forming a graph8. Trust analysis of the links is also done using four Web metrics provided by the Alexa API9. FOR: For each image, the additional forensics features are extracted from the provided BAG feature based on the maps obtained from AJPG and NAJPG. First, a binary map is created by thresholding the AJPG map (we use 0.6 as the threshold), then the largest region is selected as object and the rest of the map is considered as the background. For both regions, seven descriptive statistics (maximum, minimum, mean, median, most frequent value, standard deviation, and variance) are computed from the BAG values and concatenated to a 14dimensional vector. We apply the same process on the NAJPG map to obtain a second feature vector. 2.2

Agreement-based retraining method

The main extension of this system compared to [ 2 ] includes an agreement-based retraining step in order to improve the prediction accuracy for unseen events. This is motivated by a similar approach implemented in [ 3 ] (for the problem of polarity classi cation). Figure 1 illustrates the adopted process. In step (a), we build two classi ers CL1, CL2 based on the training set, each classi er built on different types of features, and we combine their outputs in a Semi-Supervised Learning (SSL) fashion. We compare the two predictions for each sample of the test set, and depending on their agreement, we divide the test set in two subsets, the agreed and disagreed samples. These two subsets are treated di erently by the classi cation framework.

Assuming that the agreed predictions are correct with

7https://www.mywot.com/

8http://wwwranking.webdatacommons.org/more.html 9http://data.alexa.com/data?cli=10&dat=snbamz&url= google.gr RUN-1 RUN-2 RUN-3 RUN-4 RUN-5 high likelihood, we use them as training samples to build a new classi er for classifying the disagreed samples. To this end, in step (b), we add the agreed samples to the best performing of the two initial models, CL1, CL2 (comparing them on the basis of their performance when doing crossvalidation on the training set). The goal of this method is to retrain the initial model and make it adaptable to any speci c characteristics of the new event. In that way, the model can predict more accurately the values of the samples for which CL1, CL2 did not agree in the rst step. 2.3

Bagging

Due to the unequal number of fake and real tweets, we exploit only a part of the data while building a model. In order to take advantage of the whole training dataset, we use bagging that tends to improve the accuracy of the method, as it produces predictions using the average result of numerous predictors. Bagging creates m di erent subsets of the training set, including equal number of samples for each class (some samples may appear in multiple subsets), leading to the creation of m instances of CL1 and CL2 classi ers (m = 9). The nal prediction for each of the testing samples is calculated using the majority vote of the m predictions.

3. SUBMITTED RUNS AND RESULTS

The ve runs submitted explore di erent combinations of features and the use of a standard supervised learning scheme (SL) versus the newly proposed agreement-based retraining (SSL-AR). The speci c run con gurations are speci ed in Table 2.

RUN-1, RUN-2 and RUN-4 are built using a plain classi cation model. RUN-3 and RUN-5 are built with the agreementbased retraining technique, in which we build CL1 and CL2 (Figure 1) by using the sets of features speci ed in Table 2. All models use a Random Forest classi er from the Weka implementation.

Table 3 presents the performance of each run. In terms of F-score, which is the primary evaluation metric of the task, RUN-5 achieved the best score when using the ext and the FOR features with the SSL-AR technique. As we observe, RUN-2 in which the FOR features are added, performed quite better than RUN-1, which uses just the TB-base features. Comparing RUN-4 and RUN-5, one may observe the considerable performance bene t stemming from the use of the SSL-AR approach, as it is the only di erence between the two runs (the same sets of features are used). Additionally, it is important to note the contribution of the ext features, as RUN-5 (ext) performs better than RUN-3 (base).

ACKNOWLEDGEMENTS

This work is supported by the REVEAL project, partially funded by the European Commission (FP7-610928).

[1]

Boididou ,

Andreadou ,

Papadopoulos , D.-T. Dang-Nguyen, G.

Boato , M.

Riegler , and Y.

Kompatsiaris . Verifying multimedia use at mediaeval 2015 . In MediaEval 2015 Workshop, Sept. 14 - 15 , 2015 , Wurzen, Germany, 2015 .

[2]

Boididou ,

Papadopoulos ,

Kompatsiaris , S.

Schi eres, and

Newman . Challenges of computational veri cation in social multimedia . In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion , pages 743 { 748 , 2014 .

[3]

Tsakalidis ,

Papadopoulos , and I. Kompatsiaris. An ensemble model for cross-domain polarity classi cation on twitter . In Web Information Systems Engineering{WISE 2014 , pages 168 { 177 . Springer, 2014 .