HCMUS MediaEval 2021: Multi-Model Decision Method
         Applied on Data Augmentation for COVID-19 Conspiracy
                         Theories Classification
                  Tuan-An To∗1,3, Nham-Tan Nguyen∗1,3, Dinh-Khoi Vo∗1,3, Nhat-Quynh Le-Pham∗1,3,
                                     Hai-Dang Nguyen1,2,3, Minh-Triet Tran1,2,3
                                 1University of Science, VNU-HCM, 2John von Neumann Institute, VNU-HCM
                                      3Vietnam National University, Ho Chi Minh city, Vietnam
                             ttan20@apcs.fitus.edu.vn,nntan20@apcs.fitus.edu.vn,vdkhoi20@clc.fitus.edu.vn
                          lpnquynh20@apcs.fitus.edu.vn,nhdang@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                                             2.2    Data Centric Approach
Corona Virus and Conspiracies Multimedia Analysis Task is the                        Text-Based Misinformation Detection exists similar objectives to the
task in MediaEval 2021 Challenge that concentrates on conspir-                       text classification task. Hence, we take advantage of pretrained NLP
acy theories that assume some kind of nefarious actions related                      models and fine-tune them for this task. However, the validation
to COVID-19. Our HCMUS team performs different approaches                            result is biased towards non-conspiracy class since given dataset is
based on multiple pretrained models and many techniques to deal                      small and unbalance. Therefore, we adapt those models to generate
with 2 subtasks. Based on our experiments, we submit 5 runs for                      new data by crawling data from Twitter and assigning a label for
subtask 1 and 1 run for subtask 2. Run 1 and 2 both introduces                       a tweet if it gets the most voting which increase the effectiveness
BERT[5] pretrained model but the difference between them is that                     and balance on the dataset as well.
we add a sentimental analysis to extract semantic feature before
training in the first run. In run 3 and 4, we propose a naive bayes
classifier[4] and a LSTM[8] model to diversify our methods. Run 5
ultilize an ensemble of machine learning and deep learning models
- multimodal approach for text-based analysis[3]. Finally, in the
only run in subtask 2, we conduct a simple naive bayes algorithm
to classify those theories. In the final result, our method achieves
0.5987 in task 1, 0.3136 in task 2.

1    INTRODUCTION
The COVID-19 pandemic has severely affected people worldwide,
and consequently, it has dominated world news for months. Thus,
it has been the topic of a massive amount of misinformation, which
was most likely amplified by the fact that many details about the
virus were unknown at the start of the pandemic. In the Multimedia
Evaluation Challenge 2021 (MediaEval2021), the purpose of Corona
Virus and Conspiracies Multimedia Analysis Task is to develop
methods capable of detecting such misinformation. By this way,
this task aid in preventing misinformation outspread causing social
anxiety and vaccination doubts. We propose different methods
which are mainly based on deep learning model to solve the problem
in various aspects which would be described in the later sections.
2 DATASET AND RELATED WORK
2.1 Dataset                                                                          3 METHOD
In subtask 1, we have recieved two datasets in total:                                3.1 Data Processing
* dev1-task1.csv: unbalance dataset consisting of 500 tweets in                      From the pure train data, we need to preprocess the tweets to make
which (Non-Discuss-Promote) respectively is (340,76,84)                              it easier for our model to learn. The first step is replace those words
* dev-task1.csv: unbalance dataset consisting of 1011 tweets in                      in short form into its original form - I’m to I am. The second step is
which (Non-Discuss-Promote) respectively is (414,186,411)                            to remove the stopwords - about, above, ...; lemmatize the family
                                                                                     words - roofing, roofers,... into roof. Finally, we also try to remove
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   any other meaningless feature in the tweets such as the "https",
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, 13-15 December 2021, Online                                            "(amp)" and the emoji to get a perfect tweet for training.
                                                         Tuan-An To11,3 , Nham-Tan Nguyen11,3 , Dinh-Khoi Vo11,3 , Nhat-Quynh Le-Pham11,3 ,
MediaEval’21, December 13-15, 2021, Online                                                      Hai-Dang Nguyen1,2,3 , Minh-Triet Tran1,2,3

3.2    Run 01 - Subtask1                                                   3.6    Run 05 - Subtask1
Firstly, we use sentiment analysis method to categorize all tweets         We combine all the results run by those deep-learning and machine-
into two classes - optimism and anger. Based on the observation,           learning algorithm and label the tweet by its highest-voted class[3].
non conspiracy tweets contain a higher rate in optimism while              There are rare cases that the tweet get equal votes in different
discuss/promote tweets dominate the anger rate. Therefore, we              classes, we decide to label it the result given by BERT model.
decide to pick out the tweets with the opt rate greater than 0.8 and
anger rate less than 0.2 in the test set and directly label them as        3.7    Run 01 - Subtask2
non conspiracy. The remained tweets are predicted by BERT[1]
                                                                           Similar to the method in run 3, we used tf-idf vectorizer to extract
- a pre-trained of deep bidirectional transformer for epochs = 20,
                                                                           feature from the text. Base on our observation, test set is extremely
batch-size = 4 with adam optimization.
                                                                           unbalanced regrading to multilabel problem, so we try to resolve it
                                                                           by downsampling the data - keeping only the dominant sentences
                                                                           in the biased class. In order to handle multilabel problem, we utilize
                                                                           three different methods: Binary Relevance[7], Classifier Chain[6]
                                                                           and Label Powerset[2] combined with Naive-Bayes and Logistic
                                                                           Regression. According to our experiment, Binary Relevance with
                                                                           Logistic Regression gives the best result.

                                                                           4     EXPERIMENTS AND RESULTS
                                                                           Table 1 shows the results of our runs in term of matthews correlation
                                                                           coefficient score.
                                                                                              Team-run             Task-1     Task-2
                                                                                         SelabHCMUSJunior
                                                                                                                   0.5581     -
                                                                                             BERT/run1
                                                                                         SelabHCMUSJunior
                                                                                                                   0.5106     -
                                                                                             BERT/run2
                                                                                         SelabHCMUSJunior
                                                                                                                   0.4469     0.3136
                                                                                          Naive-Bayes/run3
                                                                                         SelabHCMUSJunior
                                                                                                                   0.2570     -
                                                                                             LSTM/run4
                                                                                         SelabHCMUSJunior
                                                                                                                   0.5987     -
                                                                                          Multi-model/run5
              Figure 1: Tweet sentiment analysis                           Table 1: HCMUS Team Submission results for Corona Virus
                                                                           and Conspiracies Multimedia Analysis Task

3.3    Run 02 - Subtask1                                                   5     CONCLUSION AND FUTURE WORKS
Different from Run 01, we try keeping the stopwords and just               In summary, we identify challenges of the dataset and propose
cleaning the tweets as well as replacing all the shorten terms into        different approaches to address the issues. We conclude that clas-
full written terms in order to remain the original structure of the        sifying a tweet promotes/supports or discusses sentiment task is
sentence for the best performance of Transformer model[9]. The             heavily biased towards the writers attitude, therefore making it
training process is still conducted on pre-trained BERT model with         difficult for NLP model to learn the true label. In recent study, we
augmented dataset (batch_size=16, epochs=10).                              can only extract basic state of sentiment of a tweet such as sad or
                                                                           optimism, so we aim to tackle the challenge in a higher level in the
                                                                           future.
3.4    Run 03 - Subtask1
We use tf-idf vectorizer to extract feature from the text. After trying    ACKNOWLEDGMENTS
both Logistic Regression and Naive Bayes[4], the latter algorithm          This research was funded by SeLab-HCMUS and VNUHCM-University
perform better. The result of this run is our baseline score.              Of Science.


3.5    Run 04 - Subtask1                                                   REFERENCES
                                                                            [1] Stefano Bocconi Andrey Malakhov, Alessandro Patruno. 2020. Fake
We use pretrained glove to transform each word in the sentence
                                                                                News Classification with BERT.
into an array of 300 numbers represent the "meaning" of the word.
                                                                            [2] Preeti GuptaEmail authorTarun K. SharmaDeepti Mehrotra. 2018. La-
Finally we built a 2D LSTM[8] with adam optimization - batch_size               bel Powerset Based Multi-label Classification for Mobile Applications.
= 64 , epoch = 4.                                                               In Soft Computing: Theories and Applications.
FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task                                  MediaEval’21, December 13-15, 2021, Online


[3] Mihir P Mehta Chahat Raj. 2020. MediaEval 2020: An Ensemble-based           Science.
    Multimodal Approach for Coronavirus and 5G Conspiracy Tweet             [8] Rouhollah Rahmani Seyed Mahdi Rezaeinia, Ali Ghodsi. 2017. Im-
    Detection.                                                                  proving the Accuracy of Pre-trained Word Embeddings for Sentiment
[4] Di Li Haiyi Zhang. 2007. Naïve Bayes Text Classifier. arXiv:2007 IEEE       Analysis.
    International Conference on Granular Computing (GRC 2007)               [9] Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pier-
[5] Kenton Lee Kristina Toutanova Jacob Devlin, Ming-Wei Chang. 2019.           ric Cistac Tim Rault Rémi Louf Morgan Funtowicz Joe Davison Sam
    BERT: Pre-training of Deep Bidirectional Transformers for Language          Shleifer Patrick von Platen Clara Ma Yacine Jernite Julien Plu Canwen
    Understanding.                                                              Xu Teven Le Scao Sylvain Gugger Mariama Drame Quentin Lhoest
[6] Geoff Holmes Eibe Frank Jesse Read, Bernhard Pfahringer. Classifier         Alexander M. Rush Thomas Wolf, Lysandre Debut. 2020. Hugging-
    Chains for Multi-label Classification.                                      Face’s Transformers: State-of-the-art Natural Language Processing.
[7] Xu-Ying Liu Xin Geng Min-Ling Zhang, Yu-Kun Li. 2018. Binary rele-          (2020).
    vance for multi-label learning: an overview. In Frontiers of Computer