HCMUS MediaEval 2021: Multi-Model Decision Method Applied on Data Augmentation for COVID-19 Conspiracy Theories Classification Tuan-An To∗1,3, Nham-Tan Nguyen∗1,3, Dinh-Khoi Vo∗1,3, Nhat-Quynh Le-Pham∗1,3, Hai-Dang Nguyen1,2,3, Minh-Triet Tran1,2,3 1University of Science, VNU-HCM, 2John von Neumann Institute, VNU-HCM 3Vietnam National University, Ho Chi Minh city, Vietnam ttan20@apcs.fitus.edu.vn,nntan20@apcs.fitus.edu.vn,vdkhoi20@clc.fitus.edu.vn lpnquynh20@apcs.fitus.edu.vn,nhdang@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn ABSTRACT 2.2 Data Centric Approach Corona Virus and Conspiracies Multimedia Analysis Task is the Text-Based Misinformation Detection exists similar objectives to the task in MediaEval 2021 Challenge that concentrates on conspir- text classification task. Hence, we take advantage of pretrained NLP acy theories that assume some kind of nefarious actions related models and fine-tune them for this task. However, the validation to COVID-19. Our HCMUS team performs different approaches result is biased towards non-conspiracy class since given dataset is based on multiple pretrained models and many techniques to deal small and unbalance. Therefore, we adapt those models to generate with 2 subtasks. Based on our experiments, we submit 5 runs for new data by crawling data from Twitter and assigning a label for subtask 1 and 1 run for subtask 2. Run 1 and 2 both introduces a tweet if it gets the most voting which increase the effectiveness BERT[5] pretrained model but the difference between them is that and balance on the dataset as well. we add a sentimental analysis to extract semantic feature before training in the first run. In run 3 and 4, we propose a naive bayes classifier[4] and a LSTM[8] model to diversify our methods. Run 5 ultilize an ensemble of machine learning and deep learning models - multimodal approach for text-based analysis[3]. Finally, in the only run in subtask 2, we conduct a simple naive bayes algorithm to classify those theories. In the final result, our method achieves 0.5987 in task 1, 0.3136 in task 2. 1 INTRODUCTION The COVID-19 pandemic has severely affected people worldwide, and consequently, it has dominated world news for months. Thus, it has been the topic of a massive amount of misinformation, which was most likely amplified by the fact that many details about the virus were unknown at the start of the pandemic. In the Multimedia Evaluation Challenge 2021 (MediaEval2021), the purpose of Corona Virus and Conspiracies Multimedia Analysis Task is to develop methods capable of detecting such misinformation. By this way, this task aid in preventing misinformation outspread causing social anxiety and vaccination doubts. We propose different methods which are mainly based on deep learning model to solve the problem in various aspects which would be described in the later sections. 2 DATASET AND RELATED WORK 2.1 Dataset 3 METHOD In subtask 1, we have recieved two datasets in total: 3.1 Data Processing * dev1-task1.csv: unbalance dataset consisting of 500 tweets in From the pure train data, we need to preprocess the tweets to make which (Non-Discuss-Promote) respectively is (340,76,84) it easier for our model to learn. The first step is replace those words * dev-task1.csv: unbalance dataset consisting of 1011 tweets in in short form into its original form - I’m to I am. The second step is which (Non-Discuss-Promote) respectively is (414,186,411) to remove the stopwords - about, above, ...; lemmatize the family words - roofing, roofers,... into roof. Finally, we also try to remove Copyright 2021 for this paper by its authors. Use permitted under Creative Commons any other meaningless feature in the tweets such as the "https", License Attribution 4.0 International (CC BY 4.0). MediaEval’21, 13-15 December 2021, Online "(amp)" and the emoji to get a perfect tweet for training. Tuan-An To11,3 , Nham-Tan Nguyen11,3 , Dinh-Khoi Vo11,3 , Nhat-Quynh Le-Pham11,3 , MediaEval’21, December 13-15, 2021, Online Hai-Dang Nguyen1,2,3 , Minh-Triet Tran1,2,3 3.2 Run 01 - Subtask1 3.6 Run 05 - Subtask1 Firstly, we use sentiment analysis method to categorize all tweets We combine all the results run by those deep-learning and machine- into two classes - optimism and anger. Based on the observation, learning algorithm and label the tweet by its highest-voted class[3]. non conspiracy tweets contain a higher rate in optimism while There are rare cases that the tweet get equal votes in different discuss/promote tweets dominate the anger rate. Therefore, we classes, we decide to label it the result given by BERT model. decide to pick out the tweets with the opt rate greater than 0.8 and anger rate less than 0.2 in the test set and directly label them as 3.7 Run 01 - Subtask2 non conspiracy. The remained tweets are predicted by BERT[1] Similar to the method in run 3, we used tf-idf vectorizer to extract - a pre-trained of deep bidirectional transformer for epochs = 20, feature from the text. Base on our observation, test set is extremely batch-size = 4 with adam optimization. unbalanced regrading to multilabel problem, so we try to resolve it by downsampling the data - keeping only the dominant sentences in the biased class. In order to handle multilabel problem, we utilize three different methods: Binary Relevance[7], Classifier Chain[6] and Label Powerset[2] combined with Naive-Bayes and Logistic Regression. According to our experiment, Binary Relevance with Logistic Regression gives the best result. 4 EXPERIMENTS AND RESULTS Table 1 shows the results of our runs in term of matthews correlation coefficient score. Team-run Task-1 Task-2 SelabHCMUSJunior 0.5581 - BERT/run1 SelabHCMUSJunior 0.5106 - BERT/run2 SelabHCMUSJunior 0.4469 0.3136 Naive-Bayes/run3 SelabHCMUSJunior 0.2570 - LSTM/run4 SelabHCMUSJunior 0.5987 - Multi-model/run5 Figure 1: Tweet sentiment analysis Table 1: HCMUS Team Submission results for Corona Virus and Conspiracies Multimedia Analysis Task 3.3 Run 02 - Subtask1 5 CONCLUSION AND FUTURE WORKS Different from Run 01, we try keeping the stopwords and just In summary, we identify challenges of the dataset and propose cleaning the tweets as well as replacing all the shorten terms into different approaches to address the issues. We conclude that clas- full written terms in order to remain the original structure of the sifying a tweet promotes/supports or discusses sentiment task is sentence for the best performance of Transformer model[9]. The heavily biased towards the writers attitude, therefore making it training process is still conducted on pre-trained BERT model with difficult for NLP model to learn the true label. In recent study, we augmented dataset (batch_size=16, epochs=10). can only extract basic state of sentiment of a tweet such as sad or optimism, so we aim to tackle the challenge in a higher level in the future. 3.4 Run 03 - Subtask1 We use tf-idf vectorizer to extract feature from the text. After trying ACKNOWLEDGMENTS both Logistic Regression and Naive Bayes[4], the latter algorithm This research was funded by SeLab-HCMUS and VNUHCM-University perform better. The result of this run is our baseline score. Of Science. 3.5 Run 04 - Subtask1 REFERENCES [1] Stefano Bocconi Andrey Malakhov, Alessandro Patruno. 2020. Fake We use pretrained glove to transform each word in the sentence News Classification with BERT. into an array of 300 numbers represent the "meaning" of the word. [2] Preeti GuptaEmail authorTarun K. SharmaDeepti Mehrotra. 2018. La- Finally we built a 2D LSTM[8] with adam optimization - batch_size bel Powerset Based Multi-label Classification for Mobile Applications. = 64 , epoch = 4. In Soft Computing: Theories and Applications. FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task MediaEval’21, December 13-15, 2021, Online [3] Mihir P Mehta Chahat Raj. 2020. MediaEval 2020: An Ensemble-based Science. Multimodal Approach for Coronavirus and 5G Conspiracy Tweet [8] Rouhollah Rahmani Seyed Mahdi Rezaeinia, Ali Ghodsi. 2017. Im- Detection. proving the Accuracy of Pre-trained Word Embeddings for Sentiment [4] Di Li Haiyi Zhang. 2007. Naïve Bayes Text Classifier. arXiv:2007 IEEE Analysis. International Conference on Granular Computing (GRC 2007) [9] Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pier- [5] Kenton Lee Kristina Toutanova Jacob Devlin, Ming-Wei Chang. 2019. ric Cistac Tim Rault Rémi Louf Morgan Funtowicz Joe Davison Sam BERT: Pre-training of Deep Bidirectional Transformers for Language Shleifer Patrick von Platen Clara Ma Yacine Jernite Julien Plu Canwen Understanding. Xu Teven Le Scao Sylvain Gugger Mariama Drame Quentin Lhoest [6] Geoff Holmes Eibe Frank Jesse Read, Bernhard Pfahringer. Classifier Alexander M. Rush Thomas Wolf, Lysandre Debut. 2020. Hugging- Chains for Multi-label Classification. Face’s Transformers: State-of-the-art Natural Language Processing. [7] Xu-Ying Liu Xin Geng Min-Ling Zhang, Yu-Kun Li. 2018. Binary rele- (2020). vance for multi-label learning: an overview. In Frontiers of Computer