<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detecting COVID-19 Conspiracy Theories with Transformers and TF-IDF</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haoming Guo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tianyi Huang</string-name>
          <email>tianyihuang@berkeley.edu</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huixuan Huang</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mingyue Fan</string-name>
          <email>migofan@berkeley.edu</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerald Friedland</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The sharing of fake news and conspiracy theories on social media has wide-spread negative efects. By designing and applying diferent machine learning models, researchers have made progress in detecting fake news from text. However, existing research places a heavy emphasis on general, common-sense fake news, while in reality fake news often involves rapidly changing topics and domain-specific vocabulary. In this paper, we present our methods and results for three fake news detection tasks at MediaEval benchmark 2021 that specifically involve COVID-19 related topics. We experiment with a group of text-based models including Support Vector Machines, Random Forest, BERT, and RoBERTa. We ifnd that a pre-trained transformer yields the best validation results, but a randomly initialized transformer with smart design can also be trained to reach accuracies close to that of the pre-trained transformer.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        This paper presents several methods in detecting online conspiracy
theories from tweets. The task overview paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] describes
the dataset more in-depth as well as providing information on
how the dataset was constructed. In later sections, we describe our
methods including support vector machines, random forest, and
transformers in depth and present the performance of these models
on the provided dataset.
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        There are a variety of methods towards the goal of detecting text
stance and fake news, including false knowledge, writing style,
propagation patterns, and source credibility.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] Much of the
textual methodologies focus on writing style, as the information is
completely embedded in the textual data.
      </p>
      <p>
        FNC-1, a similar benchmark on fake news and stance detection
from texts, has received much attention from researchers[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Using handcrafted features and a Multi-Layer Perceptron model has
proved to perform well on the task[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Recently, transformer
architecture was shown to exceed previous results on a wide range of
natural language tasks.[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] Furthermore, Slovikovskaya et al. showed
that fine-tuning pretrained transformers achieves state-of-the-art
on the FNC-1 benchmark.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] It was also shown that BERT is
topperforming on other fake news detection task.[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>In this section, we describe the methodologies behind our feature
design, model choice and model training.
3.1</p>
      <p>TF-IDF
We used Term Frequency Inverse Document Frequency (TF-IDF)
to create classifiers based only on the plain tweet texts, which is
described in this section. TF-IDF, which is robust to understand the
text content, is useful for finding important and related words or
phrases in the text. We used the IDF transformer from the
scikitlearn framework to get the weight of diferent words. After
removing all the punctuations and stop words, we implemented the
TF-IDF Vectorizer to preprocess the training data. We proceeded
with creating several classifiers based on scikit-learn framework,
such as naïve bayes classifier, decision tree classifier, Random Forest
(RF) classifier, Support Vector Machine (SVM) classifier, etc. Having
a quick look into the training dataset, we found that the dataset
is imbalanced, since the number of a certain category was much
lesser than others. In order to overcome this, we included the class
weight to balance the datasets. During our experiments, we
observed RF classifier and SVM classifier perform much better and
more eficiently than other methods. Thus, we decided to use SVM
classifier and RF classifier for future prediction. Furthermore, we
optimized RF classifier by adjusting the parameter of the number
of estimators to get more accurate predictions.
3.2
Given the bidirectional feature of Bidirectional Encoder
Representations from Transformers (BERT), BERT can understand and predict
the meaning of sentences based on a Mask Language Model (MLM)
technique under complex contexts. In our case, analyzing the tweets
content and detecting their conspiracy features, BERT is a practical
Language Model that we can leverage on in fake news detection
tasks. For the purpose of assessing the performance of the model on
provided datasets, we decided to experiment with BERT both with
and without pretrained weights. We still first splitted our provided
dataset into 80% training set and 20% validating set, tokenized the
tweets contents from BERT tokenizer, and assigned targets with
the label columns. Then, we trained the BERT model along with a
two fully connected hidden layers that process the BERT hidden
outputs to the prediction probabilities, and optimized the model
through cross entropy loss.
3.3</p>
    </sec>
    <sec id="sec-4">
      <title>Ensemble</title>
      <p>In this section, we present two diferent ensemble methods.</p>
      <p>The first method is multi-layer ensemble. On the basis of the
fully connected neural network classifier, we tried to use the output
of diferent layers of Roberta to get the classification result. On the
basis of the fully connected neural network classifier, we try to use
the output of diferent layers of Roberta to get the classification
result. Since diferent layers of a neural network can capture diferent
levels of syntactic and semantic information, to adapt Roberta to
a specific downstream task, we directly obtain result from lower
layers which contain more general information. We concatenate
multi-layers’ output as joint output to get the experimental results
by assigning weights to the results of diferent layers.</p>
      <p>The second method is multi-models ensemble. After combining
multiple pre-training models, the size of the weight matrix was
doubled, and a mixed output was obtained to jointly predict the
class of data. This method efectively guarantees the robustness of
the model. Specifically, we selected the pre-trained models Roberta
and BERT, which had the same model depth, and concatenated their
output embeddings into one feature matrix. When the input data
passes through the pipeline, we will get the joint classification of
Roberta and BERT, which can perfectly eliminate error case of one
model of two, so as to get more accurate and stable results.
4</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND ANALYSIS</title>
      <p>In this section, we present our models’ Matthews Correlation
Coefifcient (MCC) on each of the three tasks. We also provide a thorough
analysis on the experiments we conducted and motivations behind
choosing the models and hyperparameters.</p>
      <p>TF-IDF
After training several models by diferent methods, such as naïve
Bayes, decision tree, Random Forest (RF) and Support Vector
Machine (SVM), we found that a RF classifier and a SVM classifier
performed much better than other models. As for RF model, we
ifne-tuned the hyperparameter of the number of trees in the forest
by trying out a wide range of values and found the number of 150
estimator reached the peak results. The best result shows that the
training accuracy is 63.5% for Task 1, 92.1% and 89.6% for Task 2
and Task 3 respectively. As for SVM model, the best result shows
the same accuracy compared to the RF model and approximately
92.5% for Task 2 and 89.9% for Task 3, which performed slightly
better than the RF model. Table 1 showed that the MCC score of
RF and SVM models for single-label was much higher than
multilabel classification, since we selected the most simple-to-implement
approaches with simple additions to the initial preprocessed data.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Transformers</title>
      <p>The initial results showed a training accuracy around 40% and a
testing accuracy no greater than 35% indicating the model was
not generalizing across training epochs. Therefore, we tuned the
model by adjusting the hyperparameters such as hidden layer size,
number of hidden layers, learning rate and number of training
iterations. The best model yields approximately 61.5% validating
accuracy on text-based dataset, approximately 91.5% accuracy on
structure-based dataset, and 89.7% accuracy on structure and
textbased dataset under 516 hidden layer size, 4 hidden layers, 1e-4
learning rate and 9 epochs. Such a boost in validating accuracy
after shrinking layer size and increasing learning rate is caused by
the limited size of the training set and a model without pretrained
weights. Typically, a model without pretrained weights needs to
be trained under more epochs, smaller layer size and number of
layers, and larger learning rates. Larger learning rates result in rapid
changes and smaller number of layers with smaller layer size force
the model converge more quickly. Those adjusted hyperparameters
enables the model to find the optimal weights.
4.3</p>
    </sec>
    <sec id="sec-7">
      <title>Results of Ensemble Model</title>
      <p>In the multi-layer ensemble experiment, we test each hidden layer
of Roberta model as an independent output. From the results, we
can see that the output results of layers 11, 10 and 8 are better than
the original layer’s output. Therefore, we connect these layers with
the original output layer to obtain the multi-layer ensemble model.
The experimental results of multi-layer combined output are better
than single-layer output.</p>
      <p>In the multi-model ensemble experiment, we merge Roberta
and BERT model to the ensemble model. In this experiment, we
expand the model’s size from 768 to 1536 and obtain the higher test
accuracy in 10 epoch. We preserve the Roberta’s hyperparameters
of finetuning, and achieve 0.529 MCC and 69.19% accuracy in Task
1, achieve 0.567 MCC and 91.94% accuracy in Task 2, much higher
than both BERT and Roberta. From the training results of final
round, it can be seen that Roberta has a significant influence on
multi-models ensemble.
5</p>
    </sec>
    <sec id="sec-8">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>We present several methods to detect COVID-19 conspiracy theories
from social media content. A pre-trained transformer is shown to
achieve the best performance, but a transformer trained on the
small dataset from scratch also yields reasonable accuracies. Our
future work includes finding a better initialization than random for
training an attention model from scratch as well as a more robust
ensemble model for such a dataset with trending vocabularies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Hanselowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Avinesh P.V.S.</given-names>
            ,
            <surname>Benjamin</surname>
          </string-name>
          <string-name>
            <surname>Schiller</surname>
          </string-name>
          , Felix Caspelherr, Debanjan * Chaudhuri, Christian M. Meyer, and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A Retrospective Analysis of the Fake News Challenge StanceDetection Task</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics (COLING</source>
          <year>2018</year>
          ). http://tubiblio.ulb. tu-darmstadt.de/105434/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Junaed</given-names>
            <surname>Younus</surname>
          </string-name>
          <string-name>
            <surname>Khan</surname>
          </string-name>
          , Md. Tawkat Islam Khondaker, Sadia Afroz, Gias Uddin, and
          <string-name>
            <given-names>Anindya</given-names>
            <surname>Iqbal</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>A benchmark study of machine learning models for online fake news detection</article-title>
          .
          <source>Machine Learning with Applications 4 (Jun</source>
          <year>2021</year>
          ),
          <volume>100032</volume>
          . https://doi.org/10.1016/j.mlwa.
          <year>2021</year>
          .100032
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          , Daniel Thilo Schroeder, Stefan Brenner, and Johannes Langguth.
          <fpage>13</fpage>
          -
          <issue>15</issue>
          <year>December 2021</year>
          .
          <article-title>FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task at MediaEval 2021</article-title>
          .
          <source>In Proc. of the MediaEval 2021 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          , Daniel Thilo Schroeder, Petra Filkuková, Stefan Brenner, and
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Langguth</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>WICO Text: A Labeled Dataset of Conspiracy Theory and 5G-Corona Misinformation Tweets</article-title>
          .
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <fpage>21</fpage>
          -
          <lpage>25</lpage>
          . https://doi.org/10.1145/3472720.3483617
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Valeriya</given-names>
            <surname>Slovikovskaya</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Transfer Learning from Transformers to Fake News Challenge Stance Detection (FNC-1) Task</article-title>
          . CoRR abs/
          <year>1910</year>
          .14353 (
          <year>2019</year>
          ). arXiv:
          <year>1910</year>
          .14353 http://arxiv.org/abs/
          <year>1910</year>
          . 14353
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          . Attention Is All You Need. (
          <year>2017</year>
          ).
          <source>arXiv:cs.CL/1706.03762</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Xinyi</given-names>
            <surname>Zhou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Reza</given-names>
            <surname>Zafarani</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A Survey of Fake News</article-title>
          .
          <source>Comput. Surveys</source>
          <volume>53</volume>
          ,
          <issue>5</issue>
          (Oct
          <year>2020</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          . https://doi.org/10.1145/3395046
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>