HCMUS at MediaEval2021: Content-Based Misinformation
    Detection Using Contextualized Word Embedding from BERT
       Tuan-Luc Huynh1,2 ,Nhat-Khang Ngo1,2 ,Phu-Van Nguyen1,2 ,Thien-Tri Cao1,2 ,Thanh-Danh Le1,2
                               ,Hai-Dang Nguyen 1,2 , Minh-Triet Tran1,2,3
                                                              1 University of Science, VNU-HCM
                                              2 Vietnam National University, Ho Chi Minh city, Vietnam
                                          3 John von Neumann Institute, VNU-HCM

        {htluc,nnkhang,npvan,cttri,ltdanh}19@apcs.fitus.edu.vn,nhdang@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                                             3 APPROACH
The FakeNews task in MediaEval2021 explores the challenge of                         3.1 Preprocess
building accurate and high-performance algorithms. Despite the
                                                                                     We follow a conventional text preprocess pipeline. Additionally, we
dominance of deep learning approaches in fake news detection, in
                                                                                     also expand contractions; expand internet slang that are popular
this paper, we propose different approaches leveraging the advan-
                                                                                     among tweets; extract URLs domain; convert emojis and emoticons
tages of using pretrained BERT family transformers in extracting
                                                                                     to text. Finally, Ekphrasis[2] library segments words written with
word embedding. The result from experiments shows that averaging
                                                                                     no spaces and correct misspellings or typos. As for data augmen-
ensemble methods using machine learning classifiers as estimators
                                                                                     tation, we follow the augmentation method used by Tuan et al.[9]
can achieve up to 0.6478 Matthew Correlation Coefficient(MCC) in
                                                                                     in his work: EDA[10]. Our data consist of around 1500 sentences.
the run submission’s test set.
                                                                                     Therefore, according to the paper of Wei et al.[10], we decide to
                                                                                     use the following parameters for data augmentation: "–num_aug=8
                                                                                     –alpha_sr=0.05 –alpha_rd=0.1 –alpha_ri=0.1 –alpha_rs=0.1".

1    INTRODUCTION                                                                    3.2    Features
In the context of social media where information is no longer trust-                 Inspired by the work of Tuan et al[9], we decide to use the COVID-
worthy, MediaEval2021 FakeNews: Corona Virus and Conspiracies                        Twitter-BERT-v2 pretrained model provided by Müller et al[7] for
Multimedia Analysis call participants for solving problem misinfor-                  extracting word embedding. The preprocessed data are fed into
mation disseminated in the context of the long-lasting COVID-19                      BERT tokenizer’s[11] and "max_length" is set to 64, which is an
crisis. The first subtask is about Text-Based Misinformation De-                     approximation for the mean of the tweets’ length. The output of
tection, which is based on tweets on Twitter. The mission is to                      the tokenizer is then fed directly to the pretrained model to obtain
classify tweets into categories like "promote/support", "discuss"                    all the hidden states. We process the hidden states into 5 different
or "not related" regarding COVID-19 related fake news and con-                       features, which inspired by Dharti’s article[4]: concatenate 4 last
spiracy theories. In this paper, we follow the content-based new                     hidden states (Concat), last hidden state (LHS), sum of 4 last hidden
detection approach. We experiment in both machine learning and                       states (Sum), mean of 4 last hidden states (Mean), and sentence em-
deep learning approaches and propose ensemble models of different                    bedding (Sentence). All the mentioned features are self-explanatory,
scikit-learn machine learning algorithms[8]. We also propose some                    except the "Sentence" feature, which is the mean of all 64 tokens of
features that are exceptionally effective on these classifiers.                      the "LHS" feature.

2    RELATED WORK                                                                    3.3    Models
Fake news detection and classification are no longer new problems;                   We use two models: a dense model and a dense model with a con-
nevertheless, more accurate and efficient models is needed every                     volutional layer[1] as illustrated in Figure1 for the deep learning
year because the task is surprisingly challenging. Zhou et al [12],                  approach. The dense model shares the same structure as the con-
divides fake news into four categories Knowledge-based, Style-based,                 volutional one, except it does not have the convolutional layer[1].
Propagation-based, Source-based and using Bag-Of-Word to obtain                      Moreover, the feature can be any feature as described above. For
the frequency of lexicons for classifying fake news. In general, there               machine learning, we try applying "Sentence", "LHS", and "Mean"
are two different approaches to this problem: News Content-based                     features on different classifiers provided by scikit-learn[8].
learning and Social Context-based learning. Our work is greatly
inspired by the previous attempt of Tuan et al.[9]
                                                                                     4     EXPERIMENTS
                                                                                     In the deep learning approach, we set the learning rate for both
                                                                                     dense models to be 1e-04. The optimizer for both models is AdamW[6].
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                     The train test ratio is 8:2. Since the dataset is small, we try ap-
MediaEval’21, December 13-15 2021, Online                                            plying EDA[10] and evaluate using the same method as the non-
                                                                                     augmented attempt. The results for the two attempts are illustrated
MediaEval’21, December 13-15 2021, Online                                                                                  L. Huynh Tuan et al.

                                                                                Table 1: Machine learning: MCC metric on test set

                                                                                  Classifier               Sentence      LHS       Mean
                                                                                  Perceptron               0.5645        0.6014    0.6096
                                                                                  SGDHinge                 0.5516        0.6252    0.6305
                                                                                  SGDLog                   0.5458        0.5899    0.6305
                                                                                  SGDModified_huber        0.5719        0.6125    0.5849
                                                                                  SGDSquared_hinge         0.5115        0.6068    0.5747
                                                                                  SGDPerceptron            0.5661        0.6054    0.5894
                                                                                  LogisticRegression       0.5502        0.6058    0.5946
                                                                                  SVC                      0.6279        0.483     0.4739
                                                                                  LinearSVC                0.5251        0.6012    0.6127
           Figure 1: Architecture of Conv1d model

                                                                           Table 2: Machine learning: MCC cross_val_score of some
                                                                           efficient classifiers and features

                                                                                  Classifier               Sentence      LHS       Mean
                                                                                  Perceptron               0.5114        0.4735    0.4926
                                                                                  SGDHinge                 0.4875        0.4659    0.4898
                                                                                  SGDLog                   0.5127        0.4823    0.5106
          Figure 2: Deep learning: no augmentation                                SGDModified_huber        0.4888        0.4831    0.5161
                                                                                  SGDSquared_hinge         0.5045        0.4699    0.4889
                                                                                  SGDPerceptron            0.4868        0.4645    0.479
                                                                                  LogisticRegression       0.5141        0.4936    0.5167
                                                                                  SVC                      0.5822        0.4008    0.3939
                                                                                  LinearSVC                0.4775        0.4938    0.504


                                                                                           Table 3: Run submission results

                  Figure 3: Deep learning: EDA                                         Run ID     Classifier                  MCC
                                                                                       1          Soft voting “Mean”          0.6145
                                                                                       2          Hard voting “Mean”          0.6282
in Figure2 and Figure3, respectively.                                                  3          Soft voting “Sentence”      0.6016
As for the machine learning approach, the detailed result on the                       4          Hard voting “Sentence”      0.6154
test set as illustrated by table 1. Most of the classifiers obtain MCC                 5          SVC “Sentence”              0.6478
greater than 0.5, which is better than some deep learning models.
Some classifiers even have an MCC score of 0.63, which is as good
as the dense model using the augmented "LHS" feature. In the end,
                                                                           term of performance; however, we recommend using the ensemble
this is the main reason why we decide to move from deep learning
                                                                           classifiers for better generalization. Classifiers using "Sentence"
approach to machine learning approach.
                                                                           feature obtain competitive result in comparison with classifiers
We perform a 5 fold stratified cross-validation to ensure the clas-
                                                                           using "Mean" feature.
sifiers work well on new data. Table 2 shows the cross_val_score
using MCC as the metric. After the cross-validation process, some
classifiers still retain high MCC (E.g. SVC using the "Sentence" fea-
                                                                           6   CONCLUSION
ture). According to this cross-validation result, we decide to choose      We propose using different features derived from BERT’s[3] hidden
SVC, Logistic Regression using the "Sentence" feature, and classi-         states for training lightweight and high performance machine learn-
fiers that have an MCC greater than 0.5 using the "Mean" feature as        ing classifiers in this text classification task. Our approach achieves
potential classifiers for later BayesSearchCV[5] fine-tuning. Finally,     an average score over 0.6 MCC in the run submission without using
all fine-tuned classifiers are estimators for the voting classifiers for   any augmentation or extra information. In future works, we will
better generalization. We use both soft and hard voting ensemble           thoroughly experiment on more deep learning models.
methods.
                                                                           ACKNOWLEDGMENTS
5    RESULTS AND ANALYSIS                                                  This work was funded by Gia Lam Urban Development and Invest-
All the results have MCC greater than 0.6 as illustrated by table          ment Company Limited, Vingroup and supported by Vingroup In-
3. The stand-alone SVC classifier in run 5 is the best classifier in       novation Foundation (VINIF) under project code VINIF.2019.DA19
FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task                 MediaEval’21, December 13-15 2021, Online


REFERENCES
 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
     Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean,
     Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,
     Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz
     Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat
     Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,
     Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
     Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
     Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,
     and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learn-
     ing on Heterogeneous Systems. (2015). https://www.tensorflow.org/
     Software available from tensorflow.org.
 [2] Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. 2017. DataS-
     tories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-
     level and Topic-based Sentiment Analysis. In Proceedings of the 11th
     International Workshop on Semantic Evaluation (SemEval-2017). Asso-
     ciation for Computational Linguistics, Vancouver, Canada, 747–754.
 [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
     2018. Bert: Pre-training of deep bidirectional transformers for language
     understanding. arXiv preprint arXiv:1810.04805 (2018).
 [4] Dharti Dhami. 2020.            Understanding BERT — Word Em-
     beddings.          (2020).        https://medium.com/@dhartidhami/
     understanding-bert-word-embeddings-7dc4d2ea54ca
 [5] Tim Head, Manoj Kumar, Holger Nahrstaedt, Gilles Louppe, and
     Iaroslav Shcherbatyi. 2021. scikit-optimize/scikit-optimize. (Oct. 2021).
     https://doi.org/10.5281/zenodo.5565057
 [6] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay
     Regularization. (2019). arXiv:cs.LG/1711.05101
 [7] Martin Müller, Marcel Salathé, and Per E Kummervold. 2020. COVID-
     Twitter-BERT: A Natural Language Processing Model to Analyse
     COVID-19 Content on Twitter. (2020). arXiv:cs.CL/2005.07503
 [8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O.
     Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
     A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.
     2011. Scikit-learn: Machine Learning in Python. Journal of Machine
     Learning Research 12 (2011), 2825–2830.
 [9] Nguyen Manh Duc Tuan and Pham Quang Nhat Minh. 2020. FakeNews
     Detection Using Pre-trained Language Models and Graph Convolu-
     tional Networks. (2020).
[10] Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Tech-
     niques for Boosting Performance on Text Classification Tasks. (2019).
     arXiv:cs.CL/1901.11196
[11] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond,
     Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi
     Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von
     Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le
     Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexan-
     der M. Rush. 2020. Transformers: State-of-the-Art Natural Lan-
     guage Processing. In Proceedings of the 2020 Conference on Empiri-
     cal Methods in Natural Language Processing: System Demonstrations.
     Association for Computational Linguistics, Online, 38–45. https:
     //www.aclweb.org/anthology/2020.emnlp-demos.6
[12] Xinyi Zhou and Reza Zafarani. 2020. A Survey of Fake News: Funda-
     mental Theories, Detection Methods, and Opportunities. ACM Comput.
     Surv. 53, 5, Article 109 (sep 2020), 40 pages. https://doi.org/10.1145/
     3395046