<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating TF-IDF and Transformers-based Models for Detecting COVID-19 related Conspiracies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rohullah Akbari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Simula Research Laboratory</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The proliferation of misinformation and conspiracy theories on online social media platforms has become a significant concern for public health and safety. To efectively combat this issue, a new generation of data mining and analysis algorithms is essential for early detection and tracking of these information cascades. In this paper, we employed a multifaceted approach for detecting and identifying conspiracy theories and misinformation spreaders related to the Coronavirus pandemic. Specifically, we utilized Text-Based Detection (Task 1) through a combination of TF-IDF-based and Transformers-based methods, Graph-Based Detection (Task 2) through a graph convolutional network, and alternative Transformersbased methods to improve the results of Task 1. Our eforts have yielded promising results, with our best models achieving an impressive MCC score of 0.705 for Task 1, 0.041 for Task 2, and 0.698 for Task 3.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Text-Based Misinformation and Conspiracies Detection</title>
      <sec id="sec-2-1">
        <title>2.1. The TF-IDF approach</title>
        <p>In this section, we will create nine distinct TF-IDF models for each of the nine categories.
We are interested to see if the TF-IDF technique can outperform the CT-BERT model, and if
not, how close it can come. This approach is based on using Tfidf Vectorizer and Stochastic
Gradient Descent classifier (SGD) from the scikit-learn framework [8]. SGD is a simple but very
eficient approach to fit linear classifiers such as linear Support Vector Machines (SVM). SGD
does not belong to any particular family of machine learning models; it is only an optimization
technique. Often, an instance of SGD Classifier has an equivalent estimator in the Scikit-learn
API, potentially using a diferent optimization technique. For example, logistic regression is
produced when SGDClassifier(loss=’log loss’) is used. The TF-IDF approaches in previous
works have been only executed with unigrams [7]. This leads to mislaid learning since there
could be important information in the bigrams and trigrams. We can see in Table 2 that N-grams
such as "bill gate" and "new world order" could be very important for the classification of the
conspiracies. Based on this, we have chosen to implement the TF-IDF with various N-grams
including unigrams, bigrams, trigrams, and other ranges. In addition to that, we have also
chosen to implement the SGD with diferent loss functions and penalties (see Table 1 for the
parameters).</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Transformers-based approaches</title>
        <p>The first Transformers approach ( One-for-All) is based on training one CT-BERT model for
classifying all of the conspiracy categories at once (see Figure 1). The CT-BERT is fine-tuned
with nine diferent weighted Cross Entropy loss functions. The weights are computed by taking
into account the number of samples in a specific category and dividing it by the numbers of
each of the subcategories in that category. The optimizer used in this approach is AdamW
[9]. Before feeding the text data into the model, we preprocessed it by converting the emojis
into their textual meaning. Furthermore, the training of the model was done with 5-fold Cross
validation and the model with the best test MCC score was chosen. The One-for-One approach
is based on training nine separate CT-BERT models for the nine categories (the approach is
shown in Figure 2). In this approach, we are not using any weighted loss function. Other than
that, we are applying the same loss function, optimizer, and preprocessing method. The training
of the model was done with stratified 5-fold cross-validation and the model with the best MCC
score was chosen.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Graph-Based Conspiracy Source Detection</title>
      <p>For this task, we applied a simple node classification where the nodes are representing the
user’s label for whether they are a misinformation spreader or not. We created a network for
each of the users that had a label. The network consisted of all of the other users that had an
edge directed to the main user and the users with low-weight values were removed. We chose
to work with graph convolutional network (GCN) [10]. The implementation was done by using
the GCNConv class from the torch_geometric library with PyTorch.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Graph and Text-Based Conspiracy Detection</title>
      <p>In this section, we will examine whether we can improve the results from Section 2 by
combining the data from Section 2 and Section 3. The output of the classifiers will be enriched
by combining text with numerical features. We are proposing an approach that consists of
training the CT-BERT with the text data and concatenating the last layer of the CT-BERT
with the user information such as verified_account , description_length, num_favourites,
num_followers, num_statuses, num_friends and location_country. The concatenating
layer is then driven through a multilayer perceptron (MLP) and then processed into an output
layer (see Figure 3). Our second approach is based on extending the text data with tweeters’
statistics and then feeding it into the One-for-All approach 2.2. The numerical features that
have been inserted in the text are separated with [SEP] token, e.g.</p>
      <p>Tweet_text [SEP] 0 [SEP] 159 [SEP] 2812 [SEP] 566
[SEP] 1426 [SEP] 1041 [SEP] 3</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>
        As expected, the TF-IDF approach obtained a lower MCC score than the Transformers-based
approaches (see Table 3). The One-for-One approach achieved the best score from all submitted
runs. The TF-IDF approach does quite well for some of the categories, especially for the
Population reduction and the New World Order. Bigrams such as "population control" and
"bill gate" are very important for Population reduction, and "world order" and "new world"
are obviously talking about the New World Order category (Table 2). Furthermore, we can
see that the N-range such as (
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ), (
        <xref ref-type="bibr" rid="ref2 ref4">2,4</xref>
        ), and (
        <xref ref-type="bibr" rid="ref2 ref4">2,4</xref>
        ) did not do well and the dominating range is
(
        <xref ref-type="bibr" rid="ref1 ref4">1,4</xref>
        ) (Figure 4). As a result, unigrams are crucial for the classification of conspiracies since the
N-gram ranges without it performed poorly. We submitted only one run for Task 2 which
resulted in an MCC score of 0.041 and clearly states that our implementation was not successful.
The main reason for the poor performance could be the fact that we removed all the neighbors
of the main user node that had low edge values. The combination of CT-BERT with numerical
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Outlook</title>
      <p>
        We successfully implemented three approaches for Task 1; one TF-IDF approach and two
Transformers-based approaches. We experimented with diferent N-gram ranges and found
out that the N-gram range (
        <xref ref-type="bibr" rid="ref1 ref4">1,4</xref>
        ) was best suited for most of the categories. The best MCC score
(0.705) was found with the One-for-One approach. We presented two approaches for improving
the Task 1 results but none of them improved the results from Task 1.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Moe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maulana1</surname>
          </string-name>
          , J. Langguth,
          <article-title>Combining tweets and connections graph for fakenews detection at mediaeval 2022</article-title>
          , in: roceedings of MediaEval
          <source>2022 CEUR Workshop</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Al-Rakhami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Al-Amri</surname>
          </string-name>
          ,
          <article-title>Lies kill, facts save: Detecting covid-19 misinformation in twitter</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>155961</fpage>
          -
          <lpage>155970</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2020</year>
          .
          <volume>3019600</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wani</surname>
          </string-name>
          , I. Joshi,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khandve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Wagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <article-title>Evaluating deep learning approaches for covid19 fake news detection</article-title>
          ,
          <source>in: Combating Online Hostile Posts in Regional Languages during Emergency Situation</source>
          , Springer International Publishing,
          <year>2021</year>
          , pp.
          <fpage>153</fpage>
          -
          <lpage>163</lpage>
          . URL: https://doi.org/10.1007%
          <fpage>2F978</fpage>
          -
          <fpage>3</fpage>
          -
          <fpage>030</fpage>
          -73696-5_
          <fpage>15</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -73696-5_
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Glazkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Glazkov</surname>
          </string-name>
          , T. Trifonov,
          <article-title>g2tmn at constraint@AAAI2021: Exploiting CT-BERT and ensembling learning for COVID-19 fake news detection</article-title>
          ,
          <source>in: Combating Online Hostile Posts in Regional Languages during Emergency Situation</source>
          , Springer International Publishing,
          <year>2021</year>
          , pp.
          <fpage>116</fpage>
          -
          <lpage>127</lpage>
          . URL: https://doi.org/10.1007%
          <fpage>2F978</fpage>
          -
          <fpage>3</fpage>
          -
          <fpage>030</fpage>
          -73696-5_
          <fpage>12</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -73696-5_
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pykl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Guptha</surname>
          </string-name>
          , G. Kumari,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ekbal</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <article-title>Fighting an infodemic: COVID-19 fake news dataset</article-title>
          ,
          <source>in: Combating Online Hostile Posts in Regional Languages during Emergency Situation</source>
          , Springer International Publishing,
          <year>2021</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>29</lpage>
          . URL: https://doi.org/10.1007%
          <fpage>2F978</fpage>
          -
          <fpage>3</fpage>
          -
          <fpage>030</fpage>
          -73696-
          <issue>5</issue>
          _3. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -73696-
          <issue>5</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          , D. Nandini,
          <string-name>
            <surname>FakeCovid- A Multilingual</surname>
          </string-name>
          Cross
          <article-title>-domain Fact Check News Dataset for COVID-19</article-title>
          , ICWSM,
          <year>2020</year>
          . URL: https://doi.org/10.36190/
          <year>2020</year>
          .14. doi:
          <volume>10</volume>
          .36190/
          <year>2020</year>
          .14.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peskine</surname>
          </string-name>
          , G. Alfarano,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ismail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          , Detecting covid-19
          <article-title>-related conspiracy theories in tweets (</article-title>
          <year>2021</year>
          ). URL: https://2021.multimediaeval.com/paper65.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          , et al.,
          <article-title>Scikit-learn: Machine learning in python</article-title>
          ,
          <source>the Journal of machine Learning research 12</source>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , Decoupled weight decay regularization,
          <year>2017</year>
          . URL: https://arxiv.org/abs/ 1711.05101. doi:
          <volume>10</volume>
          .48550/ARXIV.1711.05101.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Kipf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Semi-supervised classification with graph convolutional networks</article-title>
          ,
          <year>2016</year>
          . URL: https://arxiv.org/abs/1609.02907. doi:
          <volume>10</volume>
          .48550/ARXIV.1609.02907.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>