MCG-ICT at MediaEval 2015: Verifying Multimedia Use with
            a Two-Level Classification Model

                          Zhiwei Jin1,2 , Juan Cao1 , Yazi Zhang1,2 , Yongdong Zhang1
           1
               Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS),
                                 Institute of Computing Technology, CAS, Beijing, China
                              2
                                University of Chinese Academy of Sciences, Beijing, China
                                 {jinzhiwei, caojuan, zhangyazi, zhyd }@ict.ac.cn

ABSTRACT
The Verifying Multimedia Use task aims to detect misuse of
online multimedia content and verify them as real or fake.
This is a highly challenging problem because of strong vari-
ations among tweets from different events. Traditional ap-
proaches train the classifier at message level, which ignores
inter-message relations. We propose a two-level classifica-
tion model to exploit the information that tweets of a same
topic are probably have same credibility values. In this mod-
el a topic level is introduced to eliminate message variations.
Messages are aggregated into topics as a higher level repre-
sentation. Pre-results gained from classification at the topic
level are then fused with original message level features to
train a better classifier. Results indicate that topic level is
very helpful and our two-level approach offers significantly
better results than a traditional one-level method. Our best
result on this task achieves an F-score of 0.94 using features
extracted only from tweet content.                                 Figure 1: The framework of proposed two-level clas-
                                                                   sification model. Topic level classification results are
                                                                   fused with message level to produce a final result.
1.    PROPOSED APPROACH
   The paper presents the approach developed by MCG-ICT
for the MediaEval 2015 Verification Multimedia Use task.
The task deals with the automatic detection of manipula-           into sub-events and build links among tweets, sub-events
tion and misuse of Web multimedia content. Online content          and event. The three-layer network captures entities’ rela-
verification is a fairly new problem, participants are encour-     tions from different scales and results in good verification
aged to propose effective features and methods. The goal           performance.
of the task is to evaluate a set of tweets from several events        Our network model is designed to evaluate the credibility
and identify them as real or fake. More details about the          of a specific event. However, in the presented data set of
task can be found in [1].                                          the target task [1], some events are actually a set of many
                                                                   related events (e.g. Hurricane Sandy) while some events on-
1.1    Two-Level Classification Model                              ly contain a few tweets (e.g. Pig Fish). Moreover, the task
   Traditional approaches formulate the verification problem       aims to give each tweet a verification label rather than an
as a two-class classification task [2]. Features from tweet text   over all verification label for the event. These differences in
contents and users are extracted to train a classifier at the      the dataset and task definitions limit our model to directly
message (tweet) level. One problem of this training strategy       work on it. But the idea of exploiting inter-tweet implica-
is that tweets are trained and tested individually. However,       tions inspire us to propose a two level classification method.
tweets in reality have strong relations among each other,          Figure 1 gives an overview of this method.
especially tweets of a same topic would probably have the             As illustrated in Figure 1, the proposed model has two
same verification result: real or fake.                            levels of classifications: One is the message level which is
   Rather than classifying each tweet individually, some re-       just the same as previous message level methods. Features
cent studies propose to verify tweets as a whole with inter-       extracted from tweets text content, user information and
tweets information. Gupta et al. [3] propose a network             other aspects are used for training; the other is the top-
which consists of tweets and users with similarity links a-        ic level which is the main contribution of this paper. By
mong them. In our recent work [4, 5], we cluster tweets            assuming tweets under a same topic probably have similar
                                                                   credibility values, we cluster tweets into different topics. A
                                                                   topic is a specific subject in an event, it consists of all tweet-
Copyright is held by the author/owner(s).                          s concerning the same subject. Compared with raw tweets,
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        topics eliminate variations of tweets by taking the average of
them. Thus, it also reduces the impact of noisy data. Com-
pared with event, topics maintain most of tweet details.                 Table 1: Verifying Multimedia Use Results
   Topics Clustering: In [4], a clustering algorithm is used                                Run 1 Run 2
to cluster tweets into sub-events. But this algorithm per-                         Recall   0.9212  0.9220
forms poorly in forming topics in the task dataset as it is                       Precision 0.9645  0.9374
difficult to decide the optimum number of clusters. Howev-                        F-Score   0.9423  0.9296
er, we observe that each tweet contains an image or video
and each image or video can be contained in more than one
                                                                    1). Run 1 uses only content features while run 2 uses both
tweets. This intrinsic one-to-many relations in the data set is
                                                                    content and user features. Both runs follow our two level
a clue to form topics. To form topics, each image/video cor-
                                                                    classification model illustrated in Figure 1. We use J48 deci-
responds to a topic and tweets containing this image/video
                                                                    sion tree classifier for topic level classification and Random
belong to this topic.
                                                                    Forest classifier for message level classification. The topic
   Topics Labeling: We label each topic as the average
                                                                    level classification for training set is built by a 10-fold cross
labels of its tweets: if more than a half of tweets in a topic is
                                                                    validation on it. The reported three evaluation measures in
real then we label this topic as real. The labels are used for
                                                                    Table 1 are computed with respect to fake tweets.
training the topic level classifier. (In fact, with the proposed
                                                                       From the results we can observe that our two level clas-
topic formation, almost all tweets in a topic have the same
                                                                    sification method achieves very promising results on both
label.)
                                                                    runs. Specifically, it reaches a verification F-Score of 0.9423
   Topic Level Feature Aggregation: We take the aver-
                                                                    for run 1 and a slightly worse result 0.9296 for run 2. More-
age of message level features of all tweets in a topic as the
                                                                    over, our method achieves high recall performance as well
topic level feature. Some nominal features, such as ”con-
                                                                    as high precision. This demonstrates the strong distinctive
tains question/exclaimation mark”, are also aggregated into
                                                                    ability of our method for both fake and real tweets. We also
corresponding numeric features.
                                                                    notice that the result of run 2 is slightly worse than that of
   Fusing Topic Level Result: After topic level classi-
                                                                    run 1, which indicates the user features may be redundant.
fication, we can get a probability value for each topic on
                                                                    In fact, we get a similar result in our experiments on the
predicting how likely it is fake. Then for each tweet in the
                                                                    development set.
topic, we add this pre-result value as a feature to its original
                                                                       In the future, we want to explore other features, such as
feature vector. Finally, we train a message level classifier
                                                                    image forensics features, with our model. This model also
with extended message features and give the final result.
                                                                    need to be tested on a much larger data set or in real-time
1.2    Feature Set                                                  situations to validate its effectiveness.
   In [2], 18 content features and 7 user features are extracted
from the message level. We use these two kinds of features          3.   ACKNOWLEDGMENTS
as base features. In addition, we also experiment on some             This work was supported by National Nature Science Foun-
new features: word term features and several image features.        dation of China (61172153, 61571424) and the National High
   We extract the commonly used term frequency (tf) fea-            Technology Research and Development Program of China
tures and tf-idf features to represent each tweet. With ex-         (2014AA015202).
periments on the training set (development set), this kind
of feature was found to be very over-fitting. It reached very       4.   REFERENCES
high performance on cross validation and very low perfor-           [1] C. Boididou, K. Andreadou, S. Papadopoulos, D.-T.
mance on event-separation validation. Because few words                 Dang-Nguyen, G. Boato, M. Riegler, and
co-occur in different events, we assume other pure term-                Y. Kompatsiaris. Verifying multimedia use at
based features (e.g. LDA features) would contribute little              mediaeval 2015. In Proceedings of the MediaEval 2015
on this task.                                                           Multimedia Benchmark Workshop, 2015.
   Almost each tweet contains an image in the dataset, so we        [2] C. Boididou, S. Papadopoulos, and Y. Kompatsiaris.
extract several features concerning images (e.g. image pop-             Challenges of computational verification in social
ularity, resolution). These image features can replace the              multimedia. In Proceedings of the Companion
topic level features to train classifier at topic level, because        Publication of the 23rd International Conference on
a topic is generated for each image as mentioned earlier. Ex-           World Wide Web Companion, pages 743–748, 2014.
periments on the development set show that these image fea-         [3] M. Gupta, P. Zhao, and J. Han. Evaluating event
tures result in slightly worse performance for the topic level          credibility on twitter. In Proceedings of the SIAM
classification than content features but much worse perfor-             International Conference on Data Mining, page 153.
mance after fusing with message level features to generate              Society for Industrial and Applied Mathematics, 2012.
the final result. Moreover, these image features cannot be          [4] Z. Jin, J. Cao, Y.-G. Jiang, and Y. Zhang. News
applied directly on videos included in the test set. As they            credibility evaluation on microblog with a hierarchical
are not the main concern of this paper, we leave these fea-             propagation model. In 2014 IEEE International
tures to future research.                                               Conference on Data Mining (ICDM), pages 230–239.
                                                                        IEEE, 2014.
2.    RESULTS AND DISCUSSION                                        [5] X. Zhou, J. Cao, Z. Jin, X. Fei, Y. Su, J. Zhang,
  In the task requirements definition, runs 3-5 are experi-             D. Chu, and X. Cao. Real-time news certification
ments with external resources. As our approach focuses on               system on sina weibo. In Proceedings of the 24th
the classification method rather than using external materi-            International Conference on World Wide Web
als, we only submitted results for the first two runs (Table            Companion, pages 983–988, 2015.