MCG-ICT at MediaEval 2015: Verifying Multimedia Use with a Two-Level Classification Model Zhiwei Jin1,2 , Juan Cao1 , Yazi Zhang1,2 , Yongdong Zhang1 1 Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China 2 University of Chinese Academy of Sciences, Beijing, China {jinzhiwei, caojuan, zhangyazi, zhyd }@ict.ac.cn ABSTRACT The Verifying Multimedia Use task aims to detect misuse of online multimedia content and verify them as real or fake. This is a highly challenging problem because of strong vari- ations among tweets from different events. Traditional ap- proaches train the classifier at message level, which ignores inter-message relations. We propose a two-level classifica- tion model to exploit the information that tweets of a same topic are probably have same credibility values. In this mod- el a topic level is introduced to eliminate message variations. Messages are aggregated into topics as a higher level repre- sentation. Pre-results gained from classification at the topic level are then fused with original message level features to train a better classifier. Results indicate that topic level is very helpful and our two-level approach offers significantly better results than a traditional one-level method. Our best result on this task achieves an F-score of 0.94 using features extracted only from tweet content. Figure 1: The framework of proposed two-level clas- sification model. Topic level classification results are fused with message level to produce a final result. 1. PROPOSED APPROACH The paper presents the approach developed by MCG-ICT for the MediaEval 2015 Verification Multimedia Use task. The task deals with the automatic detection of manipula- into sub-events and build links among tweets, sub-events tion and misuse of Web multimedia content. Online content and event. The three-layer network captures entities’ rela- verification is a fairly new problem, participants are encour- tions from different scales and results in good verification aged to propose effective features and methods. The goal performance. of the task is to evaluate a set of tweets from several events Our network model is designed to evaluate the credibility and identify them as real or fake. More details about the of a specific event. However, in the presented data set of task can be found in [1]. the target task [1], some events are actually a set of many related events (e.g. Hurricane Sandy) while some events on- 1.1 Two-Level Classification Model ly contain a few tweets (e.g. Pig Fish). Moreover, the task Traditional approaches formulate the verification problem aims to give each tweet a verification label rather than an as a two-class classification task [2]. Features from tweet text over all verification label for the event. These differences in contents and users are extracted to train a classifier at the the dataset and task definitions limit our model to directly message (tweet) level. One problem of this training strategy work on it. But the idea of exploiting inter-tweet implica- is that tweets are trained and tested individually. However, tions inspire us to propose a two level classification method. tweets in reality have strong relations among each other, Figure 1 gives an overview of this method. especially tweets of a same topic would probably have the As illustrated in Figure 1, the proposed model has two same verification result: real or fake. levels of classifications: One is the message level which is Rather than classifying each tweet individually, some re- just the same as previous message level methods. Features cent studies propose to verify tweets as a whole with inter- extracted from tweets text content, user information and tweets information. Gupta et al. [3] propose a network other aspects are used for training; the other is the top- which consists of tweets and users with similarity links a- ic level which is the main contribution of this paper. By mong them. In our recent work [4, 5], we cluster tweets assuming tweets under a same topic probably have similar credibility values, we cluster tweets into different topics. A topic is a specific subject in an event, it consists of all tweet- Copyright is held by the author/owner(s). s concerning the same subject. Compared with raw tweets, MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany topics eliminate variations of tweets by taking the average of them. Thus, it also reduces the impact of noisy data. Com- pared with event, topics maintain most of tweet details. Table 1: Verifying Multimedia Use Results Topics Clustering: In [4], a clustering algorithm is used Run 1 Run 2 to cluster tweets into sub-events. But this algorithm per- Recall 0.9212 0.9220 forms poorly in forming topics in the task dataset as it is Precision 0.9645 0.9374 difficult to decide the optimum number of clusters. Howev- F-Score 0.9423 0.9296 er, we observe that each tweet contains an image or video and each image or video can be contained in more than one 1). Run 1 uses only content features while run 2 uses both tweets. This intrinsic one-to-many relations in the data set is content and user features. Both runs follow our two level a clue to form topics. To form topics, each image/video cor- classification model illustrated in Figure 1. We use J48 deci- responds to a topic and tweets containing this image/video sion tree classifier for topic level classification and Random belong to this topic. Forest classifier for message level classification. The topic Topics Labeling: We label each topic as the average level classification for training set is built by a 10-fold cross labels of its tweets: if more than a half of tweets in a topic is validation on it. The reported three evaluation measures in real then we label this topic as real. The labels are used for Table 1 are computed with respect to fake tweets. training the topic level classifier. (In fact, with the proposed From the results we can observe that our two level clas- topic formation, almost all tweets in a topic have the same sification method achieves very promising results on both label.) runs. Specifically, it reaches a verification F-Score of 0.9423 Topic Level Feature Aggregation: We take the aver- for run 1 and a slightly worse result 0.9296 for run 2. More- age of message level features of all tweets in a topic as the over, our method achieves high recall performance as well topic level feature. Some nominal features, such as ”con- as high precision. This demonstrates the strong distinctive tains question/exclaimation mark”, are also aggregated into ability of our method for both fake and real tweets. We also corresponding numeric features. notice that the result of run 2 is slightly worse than that of Fusing Topic Level Result: After topic level classi- run 1, which indicates the user features may be redundant. fication, we can get a probability value for each topic on In fact, we get a similar result in our experiments on the predicting how likely it is fake. Then for each tweet in the development set. topic, we add this pre-result value as a feature to its original In the future, we want to explore other features, such as feature vector. Finally, we train a message level classifier image forensics features, with our model. This model also with extended message features and give the final result. need to be tested on a much larger data set or in real-time 1.2 Feature Set situations to validate its effectiveness. In [2], 18 content features and 7 user features are extracted from the message level. We use these two kinds of features 3. ACKNOWLEDGMENTS as base features. In addition, we also experiment on some This work was supported by National Nature Science Foun- new features: word term features and several image features. dation of China (61172153, 61571424) and the National High We extract the commonly used term frequency (tf) fea- Technology Research and Development Program of China tures and tf-idf features to represent each tweet. With ex- (2014AA015202). periments on the training set (development set), this kind of feature was found to be very over-fitting. It reached very 4. REFERENCES high performance on cross validation and very low perfor- [1] C. Boididou, K. Andreadou, S. Papadopoulos, D.-T. mance on event-separation validation. Because few words Dang-Nguyen, G. Boato, M. Riegler, and co-occur in different events, we assume other pure term- Y. Kompatsiaris. Verifying multimedia use at based features (e.g. LDA features) would contribute little mediaeval 2015. In Proceedings of the MediaEval 2015 on this task. Multimedia Benchmark Workshop, 2015. Almost each tweet contains an image in the dataset, so we [2] C. Boididou, S. Papadopoulos, and Y. Kompatsiaris. extract several features concerning images (e.g. image pop- Challenges of computational verification in social ularity, resolution). These image features can replace the multimedia. In Proceedings of the Companion topic level features to train classifier at topic level, because Publication of the 23rd International Conference on a topic is generated for each image as mentioned earlier. Ex- World Wide Web Companion, pages 743–748, 2014. periments on the development set show that these image fea- [3] M. Gupta, P. Zhao, and J. Han. Evaluating event tures result in slightly worse performance for the topic level credibility on twitter. In Proceedings of the SIAM classification than content features but much worse perfor- International Conference on Data Mining, page 153. mance after fusing with message level features to generate Society for Industrial and Applied Mathematics, 2012. the final result. Moreover, these image features cannot be [4] Z. Jin, J. Cao, Y.-G. Jiang, and Y. Zhang. News applied directly on videos included in the test set. As they credibility evaluation on microblog with a hierarchical are not the main concern of this paper, we leave these fea- propagation model. In 2014 IEEE International tures to future research. Conference on Data Mining (ICDM), pages 230–239. IEEE, 2014. 2. RESULTS AND DISCUSSION [5] X. Zhou, J. Cao, Z. Jin, X. Fei, Y. Su, J. Zhang, In the task requirements definition, runs 3-5 are experi- D. Chu, and X. Cao. Real-time news certification ments with external resources. As our approach focuses on system on sina weibo. In Proceedings of the 24th the classification method rather than using external materi- International Conference on World Wide Web als, we only submitted results for the first two runs (Table Companion, pages 983–988, 2015.