<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCMUS MediaEval 2021: Multi-Model Decision Method Applied on Data Augmentation for COVID-19 Conspiracy Theories Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tuan-An To</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nham-Tan Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dinh-Khoi Vo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nhat-Quynh Le-Pham</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hai-Dang Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Triet Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John von Neumann Institute</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh city</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Corona Virus and Conspiracies Multimedia Analysis Task is the task in MediaEval 2021 Challenge that concentrates on conspiracy theories that assume some kind of nefarious actions related to COVID-19. Our HCMUS team performs diferent approaches based on multiple pretrained models and many techniques to deal with 2 subtasks. Based on our experiments, we submit 5 runs for subtask 1 and 1 run for subtask 2. Run 1 and 2 both introduces BERT[5] pretrained model but the diference between them is that we add a sentimental analysis to extract semantic feature before training in the first run. In run 3 and 4, we propose a naive bayes classifier[ 4] and a LSTM[8] model to diversify our methods. Run 5 ultilize an ensemble of machine learning and deep learning models - multimodal approach for text-based analysis[3]. Finally, in the only run in subtask 2, we conduct a simple naive bayes algorithm to classify those theories. In the final result, our method achieves 0.5987 in task 1, 0.3136 in task 2.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The COVID-19 pandemic has severely afected people worldwide,
and consequently, it has dominated world news for months. Thus,
it has been the topic of a massive amount of misinformation, which
was most likely amplified by the fact that many details about the
virus were unknown at the start of the pandemic. In the Multimedia
Evaluation Challenge 2021 (MediaEval2021), the purpose of Corona
Virus and Conspiracies Multimedia Analysis Task is to develop
methods capable of detecting such misinformation. By this way,
this task aid in preventing misinformation outspread causing social
anxiety and vaccination doubts. We propose diferent methods
which are mainly based on deep learning model to solve the problem
in various aspects which would be described in the later sections.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
    </sec>
    <sec id="sec-3">
      <title>DATASET AND RELATED WORK</title>
      <p>In subtask 1, we have recieved two datasets in total:
* dev1-task1.csv: unbalance dataset consisting of 500 tweets in
which (Non-Discuss-Promote) respectively is (340,76,84)
* dev-task1.csv: unbalance dataset consisting of 1011 tweets in
which (Non-Discuss-Promote) respectively is (414,186,411)
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Data Centric Approach</title>
      <p>Text-Based Misinformation Detection exists similar objectives to the
text classification task. Hence, we take advantage of pretrained NLP
models and fine-tune them for this task. However, the validation
result is biased towards non-conspiracy class since given dataset is
small and unbalance. Therefore, we adapt those models to generate
new data by crawling data from Twitter and assigning a label for
a tweet if it gets the most voting which increase the efectiveness
and balance on the dataset as well.
3
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>METHOD</title>
    </sec>
    <sec id="sec-6">
      <title>Data Processing</title>
      <p>
        From the pure train data, we need to preprocess the tweets to make
it easier for our model to learn. The first step is replace those words
in short form into its original form - I’m to I am. The second step is
to remove the stopwords - about, above, ...; lemmatize the family
words - roofing, roofers,... into roof. Finally, we also try to remove
any other meaningless feature in the tweets such as the "https",
"(amp)" and the emoji to get a perfect tweet for training.
Firstly, we use sentiment analysis method to categorize all tweets
into two classes - optimism and anger. Based on the observation,
non conspiracy tweets contain a higher rate in optimism while
discuss/promote tweets dominate the anger rate. Therefore, we
decide to pick out the tweets with the opt rate greater than 0.8 and
anger rate less than 0.2 in the test set and directly label them as
non conspiracy. The remained tweets are predicted by BERT[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
- a pre-trained of deep bidirectional transformer for epochs = 20,
batch-size = 4 with adam optimization.
Diferent from Run 01, we try keeping the stopwords and just
cleaning the tweets as well as replacing all the shorten terms into
full written terms in order to remain the original structure of the
sentence for the best performance of Transformer model[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The
training process is still conducted on pre-trained BERT model with
augmented dataset (batch_size=16, epochs=10).
3.4
      </p>
    </sec>
    <sec id="sec-7">
      <title>Run 03 - Subtask1</title>
      <p>
        We use tf-idf vectorizer to extract feature from the text. After trying
both Logistic Regression and Naive Bayes[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the latter algorithm
perform better. The result of this run is our baseline score.
3.5
      </p>
    </sec>
    <sec id="sec-8">
      <title>Run 04 - Subtask1</title>
      <p>
        We use pretrained glove to transform each word in the sentence
into an array of 300 numbers represent the "meaning" of the word.
Finally we built a 2D LSTM[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] with adam optimization - batch_size
= 64 , epoch = 4.
3.6
      </p>
    </sec>
    <sec id="sec-9">
      <title>Run 05 - Subtask1</title>
      <p>
        We combine all the results run by those deep-learning and
machinelearning algorithm and label the tweet by its highest-voted class[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
There are rare cases that the tweet get equal votes in diferent
classes, we decide to label it the result given by BERT model.
3.7
      </p>
    </sec>
    <sec id="sec-10">
      <title>Run 01 - Subtask2</title>
      <p>
        Similar to the method in run 3, we used tf-idf vectorizer to extract
feature from the text. Base on our observation, test set is extremely
unbalanced regrading to multilabel problem, so we try to resolve it
by downsampling the data - keeping only the dominant sentences
in the biased class. In order to handle multilabel problem, we utilize
three diferent methods: Binary Relevance[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Classifier Chain[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
and Label Powerset[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] combined with Naive-Bayes and Logistic
Regression. According to our experiment, Binary Relevance with
Logistic Regression gives the best result.
4
      </p>
    </sec>
    <sec id="sec-11">
      <title>EXPERIMENTS AND RESULTS</title>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSION AND FUTURE WORKS</title>
      <p>In summary, we identify challenges of the dataset and propose
diferent approaches to address the issues. We conclude that
classifying a tweet promotes/supports or discusses sentiment task is
heavily biased towards the writers attitude, therefore making it
dificult for NLP model to learn the true label. In recent study, we
can only extract basic state of sentiment of a tweet such as sad or
optimism, so we aim to tackle the challenge in a higher level in the
future.</p>
    </sec>
    <sec id="sec-13">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research was funded by SeLab-HCMUS and VNUHCM-University
Of Science.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Bocconi Andrey Malakhov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Patruno</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Fake News Classification with BERT</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Preeti</given-names>
            <surname>GuptaEmail authorTarun K. SharmaDeepti Mehrotra</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Label Powerset Based Multi-label Classification for Mobile Applications</article-title>
          . In Soft Computing:
          <article-title>Theories and Applications</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Mihir</surname>
            <given-names>P Mehta</given-names>
          </string-name>
          <string-name>
            <surname>Chahat Raj</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>MediaEval 2020: An Ensemble-based Multimodal Approach for Coronavirus and 5G Conspiracy Tweet Detection</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Di</given-names>
            <surname>Li Haiyi Zhang</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Naïve Bayes Text Classifier</article-title>
          .
          <source>arXiv:2007 IEEE International Conference on Granular Computing (GRC</source>
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee Kristina Toutanova Jacob Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei Chang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Geof</given-names>
            <surname>Holmes Eibe Frank Jesse Read</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          .
          <article-title>Classifier Chains for Multi-label Classification</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Xu-Ying Liu Xin Geng Min-Ling</surname>
            <given-names>Zhang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu-Kun Li</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Binary relevance for multi-label learning: an overview</article-title>
          . In Frontiers of Computer Science.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Rouhollah</given-names>
            <surname>Rahmani Seyed Mahdi Rezaeinia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ali</given-names>
            <surname>Ghodsi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Improving the Accuracy of Pre-trained Word Embeddings for Sentiment Analysis</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Victor</given-names>
            <surname>Sanh Julien Chaumond Clement Delangue Anthony Moi Pierric Cistac Tim Rault Rémi Louf Morgan Funtowicz Joe Davison Sam Shleifer Patrick von Platen Clara Ma Yacine Jernite Julien Plu Canwen Xu Teven Le Scao Sylvain Gugger Mariama Drame Quentin Lhoest Alexander M. Rush Thomas</surname>
          </string-name>
          <string-name>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Lysandre</given-names>
            <surname>Debut</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>HuggingFace's Transformers: State-of-the-art Natural Language Processing</article-title>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>