=Paper=
{{Paper
|id=Vol-3181/paper63
|storemode=property
|title=HCMUS MediaEval 2021: Multi-Model Decision Method Applied on Data
Augmentation for COVID-19 Conspiracy Theories Classification
|pdfUrl=https://ceur-ws.org/Vol-3181/paper63.pdf
|volume=Vol-3181
|authors=Tuan-An To,Nham-Tan Nguyen,Dinh-Khoi Vo,Nhat-Quynh Le-Pham,Hai-Dang Nguyen,Minh-Triet Tran
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ToNVLNT21
}}
==HCMUS MediaEval 2021: Multi-Model Decision Method Applied on Data
Augmentation for COVID-19 Conspiracy Theories Classification==
HCMUS MediaEval 2021: Multi-Model Decision Method
Applied on Data Augmentation for COVID-19 Conspiracy
Theories Classification
Tuan-An To∗1,3, Nham-Tan Nguyen∗1,3, Dinh-Khoi Vo∗1,3, Nhat-Quynh Le-Pham∗1,3,
Hai-Dang Nguyen1,2,3, Minh-Triet Tran1,2,3
1University of Science, VNU-HCM, 2John von Neumann Institute, VNU-HCM
3Vietnam National University, Ho Chi Minh city, Vietnam
ttan20@apcs.fitus.edu.vn,nntan20@apcs.fitus.edu.vn,vdkhoi20@clc.fitus.edu.vn
lpnquynh20@apcs.fitus.edu.vn,nhdang@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn
ABSTRACT 2.2 Data Centric Approach
Corona Virus and Conspiracies Multimedia Analysis Task is the Text-Based Misinformation Detection exists similar objectives to the
task in MediaEval 2021 Challenge that concentrates on conspir- text classification task. Hence, we take advantage of pretrained NLP
acy theories that assume some kind of nefarious actions related models and fine-tune them for this task. However, the validation
to COVID-19. Our HCMUS team performs different approaches result is biased towards non-conspiracy class since given dataset is
based on multiple pretrained models and many techniques to deal small and unbalance. Therefore, we adapt those models to generate
with 2 subtasks. Based on our experiments, we submit 5 runs for new data by crawling data from Twitter and assigning a label for
subtask 1 and 1 run for subtask 2. Run 1 and 2 both introduces a tweet if it gets the most voting which increase the effectiveness
BERT[5] pretrained model but the difference between them is that and balance on the dataset as well.
we add a sentimental analysis to extract semantic feature before
training in the first run. In run 3 and 4, we propose a naive bayes
classifier[4] and a LSTM[8] model to diversify our methods. Run 5
ultilize an ensemble of machine learning and deep learning models
- multimodal approach for text-based analysis[3]. Finally, in the
only run in subtask 2, we conduct a simple naive bayes algorithm
to classify those theories. In the final result, our method achieves
0.5987 in task 1, 0.3136 in task 2.
1 INTRODUCTION
The COVID-19 pandemic has severely affected people worldwide,
and consequently, it has dominated world news for months. Thus,
it has been the topic of a massive amount of misinformation, which
was most likely amplified by the fact that many details about the
virus were unknown at the start of the pandemic. In the Multimedia
Evaluation Challenge 2021 (MediaEval2021), the purpose of Corona
Virus and Conspiracies Multimedia Analysis Task is to develop
methods capable of detecting such misinformation. By this way,
this task aid in preventing misinformation outspread causing social
anxiety and vaccination doubts. We propose different methods
which are mainly based on deep learning model to solve the problem
in various aspects which would be described in the later sections.
2 DATASET AND RELATED WORK
2.1 Dataset 3 METHOD
In subtask 1, we have recieved two datasets in total: 3.1 Data Processing
* dev1-task1.csv: unbalance dataset consisting of 500 tweets in From the pure train data, we need to preprocess the tweets to make
which (Non-Discuss-Promote) respectively is (340,76,84) it easier for our model to learn. The first step is replace those words
* dev-task1.csv: unbalance dataset consisting of 1011 tweets in in short form into its original form - I’m to I am. The second step is
which (Non-Discuss-Promote) respectively is (414,186,411) to remove the stopwords - about, above, ...; lemmatize the family
words - roofing, roofers,... into roof. Finally, we also try to remove
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons any other meaningless feature in the tweets such as the "https",
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, 13-15 December 2021, Online "(amp)" and the emoji to get a perfect tweet for training.
Tuan-An To11,3 , Nham-Tan Nguyen11,3 , Dinh-Khoi Vo11,3 , Nhat-Quynh Le-Pham11,3 ,
MediaEval’21, December 13-15, 2021, Online Hai-Dang Nguyen1,2,3 , Minh-Triet Tran1,2,3
3.2 Run 01 - Subtask1 3.6 Run 05 - Subtask1
Firstly, we use sentiment analysis method to categorize all tweets We combine all the results run by those deep-learning and machine-
into two classes - optimism and anger. Based on the observation, learning algorithm and label the tweet by its highest-voted class[3].
non conspiracy tweets contain a higher rate in optimism while There are rare cases that the tweet get equal votes in different
discuss/promote tweets dominate the anger rate. Therefore, we classes, we decide to label it the result given by BERT model.
decide to pick out the tweets with the opt rate greater than 0.8 and
anger rate less than 0.2 in the test set and directly label them as 3.7 Run 01 - Subtask2
non conspiracy. The remained tweets are predicted by BERT[1]
Similar to the method in run 3, we used tf-idf vectorizer to extract
- a pre-trained of deep bidirectional transformer for epochs = 20,
feature from the text. Base on our observation, test set is extremely
batch-size = 4 with adam optimization.
unbalanced regrading to multilabel problem, so we try to resolve it
by downsampling the data - keeping only the dominant sentences
in the biased class. In order to handle multilabel problem, we utilize
three different methods: Binary Relevance[7], Classifier Chain[6]
and Label Powerset[2] combined with Naive-Bayes and Logistic
Regression. According to our experiment, Binary Relevance with
Logistic Regression gives the best result.
4 EXPERIMENTS AND RESULTS
Table 1 shows the results of our runs in term of matthews correlation
coefficient score.
Team-run Task-1 Task-2
SelabHCMUSJunior
0.5581 -
BERT/run1
SelabHCMUSJunior
0.5106 -
BERT/run2
SelabHCMUSJunior
0.4469 0.3136
Naive-Bayes/run3
SelabHCMUSJunior
0.2570 -
LSTM/run4
SelabHCMUSJunior
0.5987 -
Multi-model/run5
Figure 1: Tweet sentiment analysis Table 1: HCMUS Team Submission results for Corona Virus
and Conspiracies Multimedia Analysis Task
3.3 Run 02 - Subtask1 5 CONCLUSION AND FUTURE WORKS
Different from Run 01, we try keeping the stopwords and just In summary, we identify challenges of the dataset and propose
cleaning the tweets as well as replacing all the shorten terms into different approaches to address the issues. We conclude that clas-
full written terms in order to remain the original structure of the sifying a tweet promotes/supports or discusses sentiment task is
sentence for the best performance of Transformer model[9]. The heavily biased towards the writers attitude, therefore making it
training process is still conducted on pre-trained BERT model with difficult for NLP model to learn the true label. In recent study, we
augmented dataset (batch_size=16, epochs=10). can only extract basic state of sentiment of a tweet such as sad or
optimism, so we aim to tackle the challenge in a higher level in the
future.
3.4 Run 03 - Subtask1
We use tf-idf vectorizer to extract feature from the text. After trying ACKNOWLEDGMENTS
both Logistic Regression and Naive Bayes[4], the latter algorithm This research was funded by SeLab-HCMUS and VNUHCM-University
perform better. The result of this run is our baseline score. Of Science.
3.5 Run 04 - Subtask1 REFERENCES
[1] Stefano Bocconi Andrey Malakhov, Alessandro Patruno. 2020. Fake
We use pretrained glove to transform each word in the sentence
News Classification with BERT.
into an array of 300 numbers represent the "meaning" of the word.
[2] Preeti GuptaEmail authorTarun K. SharmaDeepti Mehrotra. 2018. La-
Finally we built a 2D LSTM[8] with adam optimization - batch_size bel Powerset Based Multi-label Classification for Mobile Applications.
= 64 , epoch = 4. In Soft Computing: Theories and Applications.
FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task MediaEval’21, December 13-15, 2021, Online
[3] Mihir P Mehta Chahat Raj. 2020. MediaEval 2020: An Ensemble-based Science.
Multimodal Approach for Coronavirus and 5G Conspiracy Tweet [8] Rouhollah Rahmani Seyed Mahdi Rezaeinia, Ali Ghodsi. 2017. Im-
Detection. proving the Accuracy of Pre-trained Word Embeddings for Sentiment
[4] Di Li Haiyi Zhang. 2007. Naïve Bayes Text Classifier. arXiv:2007 IEEE Analysis.
International Conference on Granular Computing (GRC 2007) [9] Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pier-
[5] Kenton Lee Kristina Toutanova Jacob Devlin, Ming-Wei Chang. 2019. ric Cistac Tim Rault Rémi Louf Morgan Funtowicz Joe Davison Sam
BERT: Pre-training of Deep Bidirectional Transformers for Language Shleifer Patrick von Platen Clara Ma Yacine Jernite Julien Plu Canwen
Understanding. Xu Teven Le Scao Sylvain Gugger Mariama Drame Quentin Lhoest
[6] Geoff Holmes Eibe Frank Jesse Read, Bernhard Pfahringer. Classifier Alexander M. Rush Thomas Wolf, Lysandre Debut. 2020. Hugging-
Chains for Multi-label Classification. Face’s Transformers: State-of-the-art Natural Language Processing.
[7] Xu-Ying Liu Xin Geng Min-Ling Zhang, Yu-Kun Li. 2018. Binary rele- (2020).
vance for multi-label learning: an overview. In Frontiers of Computer