<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Identification of Conspiracy Theories Using BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Yacob Espinosa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigori Sidorov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eusebio Ricárdez-Vázquez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Politécnico Nacional, Centro de Investigación en Computación</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>In recent years, we have seen an increase in conspiracy theories on social media. Users create these ideas mainly due to ignorance and distrust of government indices. These types of problems became prominent during the Covid-19 pandemic, representing a serious public health issue. This year, PAN 2024 has organized the task "Oppositional thinking analysis: Conspiracy theories vs critical thinking narratives", to present at CLEF 2024, in which we participated. For this task, we present a solution using a BERT configuration, with a preprocessing layer that has previously given us good results for various types of classification. We also tried combinations of n-grams of words (2-3-4) and characters (2-3-4-5).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;BERT</kwd>
        <kwd>Conspiracy Theories</kwd>
        <kwd>Telegram</kwd>
        <kwd>Fake news</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With the evolution of the Internet comes the evolution of social media, now one of the primary means
of communication and dissemination, where millions of people worldwide can exchange their opinions
in real time. With the onset of the Covid-19 pandemic, millions of stories were created by social media
users, primarily driven by fear or societal ignorance. Conspiracy theories on platforms such as Facebook,
Twitter/X, Telegram and YouTube can be highly persuasive, exploiting users psychological and social
vulnerabilities.</p>
      <p>
        The lack of rigorous information verification and the ease with which these theories can be shared
and amplified contribute to their persistence and spread. These narratives exist due to a combination
of psychological, social, and technological factors when used on these platforms. Their impact can
be significant, and addressing them requires an approach that includes education, regulation, and
communication [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        There is a study by Hendari et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which evaluates tweets from two datasets. The Cresi 2017
dataset includes all kinds of information from Twitter accounts of genuine users and spambots, and the
other dataset is CoAID, which is a collection of misinformation about Covid-19, and these data are not
exclusively from Twitter [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. With these datasets, Hendari et al. demonstrate a solution perspective
with an 87% F1-score using BERT. The research by Hendari et al. provides a solid foundation for future
research. In this work, we will use BERT as a classification model. We believe it is possible to improve
the accuracy and efectiveness in identifying misleading and conspiratorial content for social media.
      </p>
      <p>
        A notable study using BERT is by Guo et al., which discusses the use of transformers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It improves
the performance of RoBERTa by using it as an independent output and testing layer eficiency. This
study demonstrates the eficiency of RoBERTa for the dataset extended from the FakeNews task in
2022 in MediaEval 2021 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This corpus is comprised of English tweets that contain many conspiracy
theories, primarily related to Covid-19. An important point is that this dataset is only in English and
it is unbalanced, which complicates the efectiveness of categorization. Significant progress is shown
with an accuracy of 91% using RoBERTa.
      </p>
      <p>
        Due to the COVID-19 lockdown, Ana López tells us how the trafic on social networks like WhatsApp
and Telegram grew [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], as certain channels were used to share messages. In this way, certain government
agencies used these methods to reach more communities, keeping everyone informed about procedures
while maintaining social distancing. The research primarily describes the use of these technologies to
reach a wider audience, which, although not as expected, was relevant during critical public health
moments. However, not everything shared through these channels is done with good intentions.
      </p>
      <p>
        In the research of Herasimenka et al., it is shown how certain communities manage to share
misinformation via Telegram, often through links to deceptive sources [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. They also note that misinformation
can spread virally even on platforms without an algorithmic timeline, such as Telegram, depending on
whether communities are involved in its dissemination. Furthermore, the people who share this type of
information are very specific users and not many of them are content creators. Since Telegram does not
use many mechanisms to eradicate these problems, these users utilize it as a channel to disseminate
such content.
      </p>
      <p>
        This year, PAN 2024 has organized a series of tasks called "Oppositional Thinking Analysis:
Conspiracy Theories vs. Critical Thinking Narratives" [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], aimed at classifying conspiracy theories and
critical thinking narratives using Telegram texts. These tasks seek to address the growing issue of
misinformation and the proliferation of conspiracy theories on social media platforms, where information
spreads rapidly and can have a significant impact on public opinion.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>For this work, we used a combination of BERT models to perform the binary classification. We have
employed these combinations for other tasks and wanted to see how they would perform in this context,
as we have primarily used them for tweets analysis as well. As part of our experiments, we utilized
certain variations of BERT, such as Bertweet and DistilBERT.
2.1. Preprocessing steps
In each research project related to natural language processing that we conduct, we recommend having
a data preprocessing layer. Often, this step is essential for obtaining better results that the training
model cannot achieve on its own to improve accuracy. Data preprocessing includes several essential
tasks such as data cleaning, normalization, tokenization, and noise removal. For this research, we
performed the following steps:
• Lowercase. We decided to use all lowercase text to standardize the text.
• Links. The links in the messages were replaced with the label ’link’.
• User Mentions. The messages in this dataset contain the ’@’ sign followed by a space and then
the username. Were replaced with the tag ’usermention’.
• Hashtags. Can identify them because they start with the ’#’ sign, followed by a space and then
the word. Were replaced with the tag ’hashtag’.
• Emojis. In previous experiments, emojis have shown to improve classifier learning, so we did
not make any changes to them.</p>
      <p>• Other symbols. All symbols not registered in the ASCII standard were removed from the dataset.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>
        Due to our previous research experience with social media, we decided also to try a methodology
based on N-grams. Previously, we obtained interesting results despite of its’ simplicity [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. While
the messages in this dataset share similarities, it is crucial to note that each social media platform has
its own dynamics and peculiarities. Therefore, adapting our approach to account for these specific
variations will ensure more precise and relevant results. We present the F1-Score results using this
setup.
      </p>
      <p>
        Given that the initial results could be improved, we decided to use BERT and some of its variants
for the task. One of these variants is DistilBERT, which is a lighter and more eficient version of BERT.
However, a significant issue with this version is that it was primarily trained on English data, which
may limit its efectiveness in contexts requiring data analysis in other languages [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Similarly, another model we considered was RoBERTa. Although RoBERTa is known for being an
extremely robust and powerful model, it also shares the limitation of being trained primarily on English
data, which could afect its performance in multilingual applications [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>To evaluate and compare the performance of these models, we decided to train and test BERT,
DistilBERT, and RoBERTa under the same conditions. Specifically, we configured the experiments with
a batch size of 16 and trained the models for 9 epochs. This setup allowed us to maintain a balanced use
of GPU resources and training time. It is important to note that while RoBERTa is more robust and
generally requires more time to fully train due to its complexity and size, this approach allowed us to
eficiently manage time and available resources without overwhelming the capabilities of the GPUs
used.</p>
      <p>Based on the experiments, we decided to use BERT as the model to execute this task. The model
performed very close to our expectations, and we believe that with a bit more training data, it could
have achieved even better results. DistilBERT ended up in last place. While it is quick in training and
execution, it may be suitable for projects that involve limited storage and technological capabilities.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>The implementation and analysis of models such as BERT, DistilBERT, and RoBERTa were highlighted,
with a focus on evaluating their performance under comparable conditions. The results indicated that
BERT was the most efective model for the task at hand, showing results close to expectations. While
DistilBERT, although fast, demonstrated limitations in contexts requiring deep analysis and multilingual
data. These findings underscore the importance of adapting models to the specific characteristics of each
task and linguistic platform, therefore optimizing the accuracy and relevance of the results obtained.</p>
      <p>The rise of conspiracy theories on social media poses a significant challenge to society, especially
during crises like the COVID-19 pandemic. The spread of misinformation can erode trust in institutions,
polarize society, and lead to dangerous behaviors. Engaging in this task not only has the potential
to advance academic research in natural language processing but also contributes significantly to
combating misinformation in the public sphere. We believe that with a meticulous, evidence-based
approach, we can develop efective tools to distinguish between conspiracy theories and critical thinking
narratives, thereby helping to mitigate the negative impact of misinformation on society.</p>
      <p>We would like to promote the use of BERT and RoBERTa for text classification in Spanish. This
could lead to discoveries and improvements that will benefit not only Spanish speakers but also other
languages that share similar linguistic features. We are excited about the possibilities this initiative
presents and are confident that our contributions will be valuable in the global efort to understand and
counter misinformation on social media.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Esayas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Alemu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Tareke,</surname>
          </string-name>
          <article-title>The negative impact of social media during covid-19 pandemic</article-title>
          , Trends in Psychology 31 (
          <year>2022</year>
          ).
          <source>doi:10.1007/s43076-022-00192-5.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Heidari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hajibabaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Malekzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hekmatiathar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Uzuner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>Bert model for fake news detection based on social bot activities in the covid-</article-title>
          19 pandemic,
          <year>2021</year>
          , pp.
          <fpage>0103</fpage>
          -
          <lpage>0109</lpage>
          . doi:
          <volume>10</volume>
          .1109/UEMCON53757.
          <year>2021</year>
          .
          <volume>9666618</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Coaid: Covid-19
          <source>healthcare misinformation dataset</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fan</surname>
          </string-name>
          , G. Friedland,
          <article-title>Detecting covid-19 conspiracy theories with transformers and tf-idf (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          , M.-G. Constantin,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fosco</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            , S. Halder,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Healy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Matran-Fernandez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Sultana, Overview of the mediaeval 2022 predicting video memorability task (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. B.</given-names>
            <surname>López Tárraga</surname>
          </string-name>
          , Comunicación de crisis y ayuntamientos: el papel de telegram durante la crisis sanitaria de la covid -
          <volume>19</volume>
          , Revista de la Asociación Española de Investigación de la
          <issue>Comunicación 7</issue>
          (
          <year>2020</year>
          )
          <fpage>104</fpage>
          -
          <lpage>126</lpage>
          . doi:
          <volume>10</volume>
          .24137/raeic.7.
          <issue>14</issue>
          .5.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Herasimenka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Knuutila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <article-title>Misinformation and professional news on largely unmoderated platforms: the case of telegram</article-title>
          ,
          <source>Journal of Information Technology and Politics</source>
          <volume>20</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . doi:
          <volume>10</volume>
          .1080/19331681.
          <year>2022</year>
          .
          <volume>2076272</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Damir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Xavier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mariona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Paolo</surname>
          </string-name>
          , R. Francisco,
          <article-title>Overview of the oppositional thinking analysis pan task at clef 2024</article-title>
          ,
          <source>Working Notes of CLEF</source>
          <year>2024</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ayele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. B.</given-names>
            <surname>Casals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elnagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Korenčić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moskovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rizwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smirnova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taulé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ustalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Yimam</surname>
          </string-name>
          , E. Zangerle,
          <article-title>Condensed Lab Overview</article-title>
          .In:
          <article-title>Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fifteenth International Conference of the CLEF Association CLEF-2024</source>
          , in: Overview of PAN 2024:
          <article-title>MultiAuthor Writing Style Analysis</article-title>
          ,
          <source>Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Espinosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Sidorov, Bots and Gender Profiling using Character Bigrams</article-title>
          , in: L.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Losada</surname>
          </string-name>
          , H. Müller (Eds.),
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2019</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2380</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter</article-title>
          , ArXiv abs/
          <year>1910</year>
          .01108 (
          <year>2019</year>
          ). URL: https://api.semanticscholar.org/CorpusID: 203626972.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wayne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jun</surname>
          </string-name>
          ,
          <article-title>A robustly optimized BERT pre-training approach with post-training</article-title>
          ,
          <year>2021</year>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .ccl-
          <volume>1</volume>
          .
          <fpage>108</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>