<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCMUS at MediaEval2021: Content-Based Misinformation Detection Using Contextualized Word Embedding from BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tuan-Luc Huynh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nhat-Khang Ngo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Phu-Van Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thien-Tri Cao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thanh-Danh Le</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hai-Dang Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Triet Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John von Neumann Institute</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh city</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The FakeNews task in MediaEval2021 explores the challenge of building accurate and high-performance algorithms. Despite the dominance of deep learning approaches in fake news detection, in this paper, we propose diferent approaches leveraging the advantages of using pretrained BERT family transformers in extracting word embedding. The result from experiments shows that averaging ensemble methods using machine learning classifiers as estimators can achieve up to 0.6478 Matthew Correlation Coeficient(MCC) in the run submission's test set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In the context of social media where information is no longer
trustworthy, MediaEval2021 FakeNews: Corona Virus and Conspiracies
Multimedia Analysis call participants for solving problem
misinformation disseminated in the context of the long-lasting COVID-19
crisis. The first subtask is about Text-Based Misinformation
Detection, which is based on tweets on Twitter. The mission is to
classify tweets into categories like "promote/support", "discuss"
or "not related" regarding COVID-19 related fake news and
conspiracy theories. In this paper, we follow the content-based new
detection approach. We experiment in both machine learning and
deep learning approaches and propose ensemble models of diferent
scikit-learn machine learning algorithms[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We also propose some
features that are exceptionally efective on these classifiers.
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Fake news detection and classification are no longer new problems;
nevertheless, more accurate and eficient models is needed every
year because the task is surprisingly challenging. Zhou et al [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
divides fake news into four categories Knowledge-based, Style-based,
Propagation-based, Source-based and using Bag-Of-Word to obtain
the frequency of lexicons for classifying fake news. In general, there
are two diferent approaches to this problem: News Content-based
learning and Social Context-based learning. Our work is greatly
inspired by the previous attempt of Tuan et al.[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-4">
      <title>Preprocess</title>
      <p>
        We follow a conventional text preprocess pipeline. Additionally, we
also expand contractions; expand internet slang that are popular
among tweets; extract URLs domain; convert emojis and emoticons
to text. Finally, Ekphrasis[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] library segments words written with
no spaces and correct misspellings or typos. As for data
augmentation, we follow the augmentation method used by Tuan et al.[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
in his work: EDA[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Our data consist of around 1500 sentences.
Therefore, according to the paper of Wei et al.[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we decide to
use the following parameters for data augmentation: "–num_aug=8
–alpha_sr=0.05 –alpha_rd=0.1 –alpha_ri=0.1 –alpha_rs=0.1".
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Features</title>
      <p>
        Inspired by the work of Tuan et al[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we decide to use the
COVIDTwitter-BERT-v2 pretrained model provided by Müller et al[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for
extracting word embedding. The preprocessed data are fed into
BERT tokenizer’s[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and "max_length" is set to 64, which is an
approximation for the mean of the tweets’ length. The output of
the tokenizer is then fed directly to the pretrained model to obtain
all the hidden states. We process the hidden states into 5 diferent
features, which inspired by Dharti’s article[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: concatenate 4 last
hidden states (Concat), last hidden state (LHS), sum of 4 last hidden
states (Sum), mean of 4 last hidden states (Mean), and sentence
embedding (Sentence). All the mentioned features are self-explanatory,
except the "Sentence" feature, which is the mean of all 64 tokens of
the "LHS" feature.
3.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Models</title>
      <p>
        We use two models: a dense model and a dense model with a
convolutional layer[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as illustrated in Figure1 for the deep learning
approach. The dense model shares the same structure as the
convolutional one, except it does not have the convolutional layer[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Moreover, the feature can be any feature as described above. For
machine learning, we try applying "Sentence", "LHS", and "Mean"
features on diferent classifiers provided by scikit-learn[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTS</title>
      <p>
        In the deep learning approach, we set the learning rate for both
dense models to be 1e-04. The optimizer for both models is AdamW[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
The train test ratio is 8:2. Since the dataset is small, we try
applying EDA[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and evaluate using the same method as the
nonaugmented attempt. The results for the two attempts are illustrated
in Figure2 and Figure3, respectively.
      </p>
      <p>As for the machine learning approach, the detailed result on the
test set as illustrated by table 1. Most of the classifiers obtain MCC
greater than 0.5, which is better than some deep learning models.
Some classifiers even have an MCC score of 0.63, which is as good
as the dense model using the augmented "LHS" feature. In the end,
this is the main reason why we decide to move from deep learning
approach to machine learning approach.</p>
      <p>
        We perform a 5 fold stratified cross-validation to ensure the
classifiers work well on new data. Table 2 shows the cross_val_score
using MCC as the metric. After the cross-validation process, some
classifiers still retain high MCC (E.g. SVC using the "Sentence"
feature). According to this cross-validation result, we decide to choose
SVC, Logistic Regression using the "Sentence" feature, and
classiifers that have an MCC greater than 0.5 using the "Mean" feature as
potential classifiers for later BayesSearchCV[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] fine-tuning. Finally,
all fine-tuned classifiers are estimators for the voting classifiers for
better generalization. We use both soft and hard voting ensemble
methods.
5
      </p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS AND ANALYSIS</title>
      <p>All the results have MCC greater than 0.6 as illustrated by table
3. The stand-alone SVC classifier in run 5 is the best classifier in
term of performance; however, we recommend using the ensemble
classifiers for better generalization. Classifiers using "Sentence"
feature obtain competitive result in comparison with classifiers
using "Mean" feature.
6</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>
        We propose using diferent features derived from BERT’s[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] hidden
states for training lightweight and high performance machine
learning classifiers in this text classification task. Our approach achieves
an average score over 0.6 MCC in the run submission without using
any augmentation or extra information. In future works, we will
thoroughly experiment on more deep learning models.
      </p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was funded by Gia Lam Urban Development and
Investment Company Limited, Vingroup and supported by Vingroup
Innovation Foundation (VINIF) under project code VINIF.2019.DA19</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Martín</given-names>
            <surname>Abadi</surname>
          </string-name>
          , Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis,
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Matthieu</given-names>
            <surname>Devin</surname>
          </string-name>
          , Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geofrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke,
          <string-name>
            <given-names>Yuan</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xiaoqiang</given-names>
            <surname>Zheng</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <source>TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems</source>
          . (
          <year>2015</year>
          ). https://www.tensorflow.org/ Software available from tensorflow.
          <source>org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Christos</given-names>
            <surname>Baziotis</surname>
          </string-name>
          , Nikos Pelekis, and
          <string-name>
            <given-names>Christos</given-names>
            <surname>Doulkeridis</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Messagelevel and Topic-based Sentiment Analysis</article-title>
          .
          <source>In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-</source>
          <year>2017</year>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          , Vancouver, Canada,
          <fpage>747</fpage>
          -
          <lpage>754</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Dharti</given-names>
            <surname>Dhami</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <string-name>
            <surname>Understanding BERT - Word</surname>
            <given-names>Embeddings.</given-names>
          </string-name>
          (
          <year>2020</year>
          ). https://medium.com/@dhartidhami/ understanding
          <article-title>-bert-word-embeddings-7dc4d2ea54ca</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Tim</given-names>
            <surname>Head</surname>
          </string-name>
          , Manoj Kumar, Holger Nahrstaedt, Gilles Louppe, and
          <string-name>
            <given-names>Iaroslav</given-names>
            <surname>Shcherbatyi</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>scikit-optimize/scikit-optimize</article-title>
          .
          <source>(Oct</source>
          .
          <year>2021</year>
          ). https://doi.org/10.5281/zenodo.5565057
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Decoupled Weight Decay Regularization</article-title>
          . (
          <year>2019</year>
          ).
          <source>arXiv:cs.LG/1711.05101</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Müller</surname>
          </string-name>
          , Marcel Salathé, and
          <string-name>
            <given-names>Per E</given-names>
            <surname>Kummervold</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <string-name>
            <surname>COVIDTwitter-BERT: A Natural Language Processing Model to Analyse</surname>
          </string-name>
          COVID-19 Content on Twitter. (
          <year>2020</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CL/
          <year>2005</year>
          .07503
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          ),
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Nguyen</given-names>
            <surname>Manh Duc Tuan and Pham Quang Nhat Minh</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>FakeNews Detection Using Pre-trained Language Models and Graph Convolutional Networks</article-title>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Wei</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kai</given-names>
            <surname>Zou</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks</article-title>
          . (
          <year>2019</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CL/
          <year>1901</year>
          .11196
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Thomas</surname>
            <given-names>Wolf</given-names>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and
          <string-name>
            <surname>Alexander</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Rush</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Transformers: State-of-the-Art Natural Language Processing</article-title>
          .
          <source>In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics</source>
          , Online,
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . https: //www.aclweb.org/anthology/2020.emnlp-demos.
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Xinyi</given-names>
            <surname>Zhou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Reza</given-names>
            <surname>Zafarani</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>53</volume>
          ,
          <issue>5</issue>
          , Article 109 (sep
          <year>2020</year>
          ),
          <volume>40</volume>
          pages. https://doi.org/10.1145/ 3395046
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>