<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Study on NLP Model Ensembles and Data Augmentation Techniques for Separating Critical Thinking from Conspiracy Theories in English Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iñaki del Campo Sánchez-Hermosilla</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angel Panizo-Lledot</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Camacho</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer System Engineering, Universidad Politécnica de Madrid</institution>
          ,
          <addr-line>Calle de Alan Turing 28031, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>Conspiracy theories propose that significant events are orchestrated by secretive, powerful groups, gaining traction especially during social upheaval and spreading rapidly via social media. These theories have real-world consequences, as seen in incidents like Pizzagate, where false claims led to a violent attack in a pizzeria in Washington, D.C., and COVID-19 vaccine conspiracies, which fueled public distrust and made the vaccination campaign more dificult. In the age of social media, distinguishing between conspiracy theories and critical thinking is crucial for accurate content moderation because misidentification can push individuals questioning legitimate issues towards conspiracy communities, highlighting the importance of developing efective methods for identifying conspiratorial content. Our study focuses on addressing this challenge by leveraging advanced NLP models. Specifically, we build ensembles using variations of the BERT model, including BERT-base, BERT-large, and RoBERTa. We experimented with diferent loss functions, such as cross-entropy, Mix-Up, and Supervised Contrastive Loss, and data augmentation techniques like synonym replacement and random word insertions. Our final model achieved a Matthews correlation coeficient (MCC) of 0.8149 on the competition set, securing 8th place in the ranking and demonstrating a considerable level of efectiveness in identifying conspiratorial content.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2024</kwd>
        <kwd>Oppositional thinking analysis</kwd>
        <kwd>Conspiracy theories vs critical thinking narratives</kwd>
        <kwd>NLP</kwd>
        <kwd>BERT</kwd>
        <kwd>data augmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Conspiracy theories are intricate narratives that suggest major events are the result of covert actions
by secret, powerful, and malicious groups. Conspiracy theories have a long history, often surfacing
during times of social upheaval, but their spread has accelerated with the advent of social media.
Conspiracy theories have become a social issue; in recent years, we have seen them provoke real-world
consequences. For example, incidents like PizzaGate [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], where a man fired a gun in a pizzeria in
Washington, D.C. while attempting to investigate a fake child traficking ring, or the COVID-19 vaccine
conspiracy theories [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which claim that Bill Gates was introducing microchips with the
COVID19 vaccine to spy on people, raising doubts among the population and leading to vaccine hesitancy.
Therefore, identifying conspiracy content is more important than ever; however, this is a challenging
task. There is a fine line between conspiracy theory and critical thinking, and identifying this distinction
is crucial because mislabeling a critical message as conspiratorial could inadvertently push individuals
who are merely questioning into conspiracy communities. This highlights the importance of developing
efective methods for identifying conspiratorial content.
      </p>
      <p>
        This edition of PAN 2024 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] includes a challenge [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to tackle the aformentioned problem. The
challenge include two tasks, one for binary classification of full messages that diferentiate between
critical thinking and conspiracy theories; and another for token-level classification of text spans that
correspons to the key elements of the narratives. In this work we tackle the first task, i.e. the binary
classification task where we decide whether a message published in English follows a conspiracy theory
framework or, instead, is simply engaging in critical thinking. Our study focuses on addressing this
challenge by leveraging advanced NLP techniques, specifically using foundational models like BERT [ 5]
and RoBERTa [6]. These models have been trained with millions of data points in a self-supervised
manner and are capable of performing well in a wide variety of NLP tasks, especially in tasks where few
labeled examples are available. In this article, we focus on creating ensembles of various BERT models.
We fine-tune these models using the classical Cross-Entropy loss function and other alternative ones,
such as Mix-Up and Supervised Contrastive Loss; and, we employed data augmentation techniques
such as sentence rephrasing, translation, and contextual word replacement.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Experimental Design</title>
      <sec id="sec-2-1">
        <title>2.1. Data Processing</title>
        <p>To ensure a fair comparison, the original dataset was split into two sets, one for testing (10%) and
another for training (90%). A stratified split was used due to the unbalanced nature of the dataset.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Models</title>
        <p>We fine-tuned several pre-trained models using 90% of the data reserved for training. All the models
tested were variations of the BERT model [5], featuring a transformer-based, encoder-only architecture.
As a baseline, the smaller version of the BERT model, "bert-base," was used to conduct a suficient number
of experiments given the limitations of time and computation. Experiments were also conducted with
the larger version of BERT, "bert-large," and an optimized version called RoBERTa [6]. Additionally, tests
were conducted with a BERT-Large model pre-trained on texts related to the SARS-CoV-2 pandemic [7],
which yielded the best results. Finally, we tested whether creating ensembles of these models yielded
better results. To create the ensembles, we followed a 5-fold cross-validation approach, where the
training dataset was split into 5 folds. An ensemble was created by combining 5 models, each trained
with a diferent combination of 4 out of the 5 folds.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Loss Functions</title>
        <p>As a baseline, the Binary Cross-Entropy (CE) loss function was selected. In addition, we tested two
more exotic fitness functions to fine-tune the models: Mix-Up [ 8] and Supervised Contrastive Loss
(SCL) [9]. The latter, SCL, adds a new term to the cross-entropy that penalizes sparse representations of
embeddings for examples within the same class. This is achieved by calculating the distance between
embeddings of the same class within the batch. Thus, both the batch size and the weighting of each term
in the new loss function afect the final loss. Additionally, we decided to try an alternative approach to
the hybrid objective proposed in the original paper, consisting of an initial training phase using only
the Supervised Contrastive Loss function and a final training phase with binary cross-entropy. The
former, Mix-Up, lies midway between data augmentation and a loss function. The idea behind Mix-Up
is that a classifier may perform better if, instead of being trained with discrete examples (i.e., 0 or 1), the
model is trained with interpolated examples (i.e., X% of a label 0 example and Y% of a label 1 example).
Therefore, instead of predicting 0 or 1, the model must predict the percentage of each label in the sample.
This technique has shown great utility, particularly in the field of computer vision, as it promotes more
linear behavior in the classifier. However, applying this loss directly to the field of NLP is not trivial;
thus, we implemented Mix-Up at the embedding level, inspired by the paper "Mixup-Transformer" [10].</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Data Augmentation</title>
        <p>Regarding data augmentation, we tested several diferent approaches. First, we tried rewriting the
dataset’s sentences using Llama3 8B. In addition, we also used Llama for translating the dataset from
the task 2 of this challenge from Spanish to English to increase the trainning data. Finally, we tested
more common augmentation strategies using the nlpaug library [11]. Specifically, we decided to apply:
word replacement (WR), random word insertion (WI), and synonym replacement (SR). On the one hand,
for WR and RI configurations, we used a bert-base model, assigning a percentage of words to insert or
replace, while ensuring that the replaced words were not stop-words. On the other hand, for SR we use
WordNet [12].</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Validation Framework</title>
        <p>We used the 10% of data points reserved for testing to measure the performance of the models, which was
measured using the Matthews Correlation Coeficient (MCC). Additionally, due to the large variability
in the results between experiments with the same configuration and diferent seeds, we used 5-fold
cross-validation with 10 diferent random seeds, resulting in a total of 50 single models and 10 5-model
ensembles per experiment. Moreover, for the comparison of the ensemble versus the single model, we
evaluated each of the 50 models against the test set, as well as the 10 ensembles resulting from each
5-fold, and calculated the median and IQR of the MCCs obtained.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental results</title>
      <p>For fine-tuning the models, we followed the proposals by [ 13]. They recommend a batch size of 16,
the Adam optimizer with bias correction, a learning rate of 2e-5, and a triangular scheduler with a linear
increase during the first 10% of steps, followed by a linear decay to zero. Although the paper suggests
training the models for up to 20 epochs, noting that overtraining does not seem to have a negative
impact, we observed no improvement beyond epoch 3. Thus, to expedite testing for the competition,
we decided to train all models for only 5 epochs. The hyperparameter selected are available at Table 1.</p>
      <p>For baseline comparison, we have selected a small BERT model (bert-base) with no data augmentation,
ifne-tuned using the training dataset with a cross-entropy loss function. We will consider a technique
worthy if it can improve the performance of this baseline model. Table 2 shows the results of the
experiments for the small BERT model (bert-base). The first row shows the results of the baseline
model, i.e., small BERT with no data augmentation and cross-entropy loss function. The baseline model
achieves a median MCC of 0.7934 when testing the 50 models and a median MCC of 0.8156 when testing
the 10 ensembles of 5 folds. As we can see from the results in this table, only experiment number 8
improves the baseline. Each experiment is described in detail below.</p>
      <p>Rows 2 and 3 show the results for the Mix-Up training loss. These experiments show a slight
improvement over the baseline of +0.005 in the median MCC of the 50 models but a performance drop
of -0.005 when ensemble models are used. This occurred in both experiments tested: one using a mixing
distribution  (0.1, 0.1) and another using  (0.2, 0.2). These results lead us to discard this technique as it
does not present a clear improvement over the baseline and adds considerable complexity and overload
to the training process.</p>
      <p>Similarly, rows 4-5 show the results for the Supervised Contrastive Learning (SCL) loss. Row 4 shows
the results for the original version proposed by the authors, using their recommended configuration of a
0.3 temperature and a weighting in the objective function of 0.9 to the distance between embeddings and
0.1 to cross-entropy. Row 5 shows our alternative, where the embedding distance was used for 2 batches
and cross-entropy for the remaining 3. The results were poor. On the one hand, although the original
approach showed a very slight improvement in the results of the 50 models, with an increase of 0.001, it
produced a drop of -0.01 in MCC when evaluating the ensembles. On the other hand, the new version
proposed by us performed significantly worse than the baseline, with a -0.008 decrease in the median
MCC evaluation of the 50 models and a drop of -0.03 when evaluating the ensembles. Considering
these results, the efectiveness of this method cannot be assured for this problem. Nevertheless, to fully
test the method, an exhaustive search for optimal parameters would be necessary. However, given the
nature of the challenge, we decided to explore other more promising avenues.</p>
      <p>The sixth row involved training with a balanced training dataset that ensures 50% positive and 50%
negative examples. This balanced dataset was created by oversampling the minority class. As we can
see, the individual models performed similarly to the baseline. However, there was a -0.004 decrease in
performance in the ensembles. Since no significant improvement was observed, this modification was
discarded for future iterations.</p>
      <p>Row seven shows the results of augmenting the training dataset by asking Llama-3 8B to rewrite
the original sentences (llama_aug). The results showed a clear detriment to the performance of both
individual models and ensembles, leading to the decision to abandon this line of experimentation for
future iterations.</p>
      <p>Finally, the last row shows the result of extending the dataset with additional data obtained by
translating the Spanish dataset from Task 1 of this competition into English (sp_into_en). As we can
see, this approach is undoubtedly the only successful addition to the model so far, providing an average
improvement of +0.025 in the evaluation of individual models and +0.018 in the ensembles.</p>
      <p>Once the diferent configurations were tested on the BERT-base model, we developed a series of
experiments to test these configurations on larger models. The results are available in Table 3. The
ifrst row shows the baseline for this round of experiments. When compared with the results in Table 2,
we can see that increasing the BERT model size provides a significant performance boost, with an
improvement in MCC of 0.023 in individual models and 0.026 in the ensembles. Given the success of
the large model, we tried adding the sp_into_en data augmentation, which yielded good results on
BERT-base. Row 2 shows these results; as we can see, this combination achieves a solid improvement,
raising the median by 0.012 and 0.0147 in individual models and ensembles, respectively.</p>
      <p>Next, rows 3 and 4 show the results of the baseline configuration with two new models, RoBERTa-large
and a BERT-large model pre-trained with texts related to the COVID-19 pandemic (bert-large-covid).
The former shows a mild improvement over the BERT-large benchmark, with an increase of 0.01 in the
median of the individual models; however, it only shows an improvement of 0.002 in the ensembles.
Meanwhile, the latter model, bert-large-covid, presents a significant improvement over all previously
tested models, with an improvement of 0.034 in the single model and 0.036 in the ensembles. Given
these good results, the rest of the experiments will focus on the bert-large-covid model.</p>
      <p>Row 5 shows the results of incorporating the sp_into_en data augmentation onto the bert-large-covid
model. However, to our surprise, this caused a significant performance drop in the model, leading to a
loss of -0.03 and -0.04 in the individual models and ensembles, respectively.</p>
      <p>Finally, rows 6-9 show experiments with simple data augmentation techniques such as synonym
replacement (SR), word replacement (WR), and insertions with BERT-base (WI). As we can see, synonym
replacement was the technique that yielded the best results, providing a slight improvement, while
word replacement and random insertion negatively impacted the models.</p>
      <p>Based on these results, the final model used for the submission of task 1 in its English version was
an ensemble averaging the predictions of all the trained bert-large-covid SR 0.5 models. This model
obtained an MCC of 0.8149, F1-MACRO of 0.9072, F1-CONSPIRACY of 0.8770, and F1-CRITICAL of
0.9374, resulting in 8th place in the ranking.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and future work</title>
      <p>In this work, we tackle the challenge of distinguishing between conspiracy theories and critical thinking
using advanced NLP models. Specifically, we build ensembles using variations of the BERT model,
including BERT-base, BERT-large, and RoBERTa. We experimented with diferent loss functions, such
as cross-entropy, Mix-Up, and Supervised Contrastive Loss, and used data augmentation techniques like
synonym replacement and random word insertions. From our experimentation, we can conclude that
increasing the BERT model size significantly boosts performance, with BERT-large-covid showing the
best results for future experiments. Additionally, our experimentation shows that classic cross-entropy
loss achieves better results than more complex techniques like Mix-Up and Supervised Contrastive Loss.
Finally, we conclude that applying simpler data augmentation techniques like word replacement or word
insertion works better than more sophisticated techniques involving state-of-the-art LLMs. Nevertheless,
more experimentation is needed with the prompts used for the LLMs, such as including some examples
in them. Additionally, it would be interesting to try more models, for example, pre-training RoBERTa
on a large COVID corpus and then applying fine-tuning for classification.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work has been supported by MICINN under FightDIS (PID2020-117263GB-I00); by MCIN/AEI/
10.13039/501100011033/ and European Union NextGenerationEU/PRTR for XAI-Disinfodemics (PLEC2021
-007681) grant; and by the project PCI2022-134990-2 (MARTINI) of the CHISTERA IV Cofund 2021
program, funded by MCIN/AEI/10.13039/501100011033 and by the “European Union
NextGenerationEU/PRTR”.
[5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), 2019, pp. 4171–4186.
[6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[7] M. Müller, M. Salathé, P. E. Kummervold, Covid-twitter-bert: A natural language processing model
to analyse covid-19 content on twitter, Frontiers in artificial intelligence 6 (2023) 1023281.
[8] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimization, in:</p>
      <p>International Conference on Learning Representations, 2018.
[9] B. Gunel, J. Du, A. Conneau, V. Stoyanov, Supervised contrastive learning for pre-trained language
model fine-tuning, in: International Conference on Learning Representations, 2021.
[10] L. Sun, C. Xia, W. Yin, T. Liang, S. Y. Philip, L. He, Mixup-transformer: Dynamic data augmentation
for nlp tasks, in: Proceedings of the 28th International Conference on Computational Linguistics,
2020, pp. 3436–3440.
[11] E. Ma, Nlp augmentation, https://github.com/makcedward/nlpaug, 2019.
[12] G. A. Miller, Wordnet: a lexical database for english, Communications of the ACM 38 (1995) 39–41.
[13] M. Mosbach, M. Andriushchenko, D. Klakow, On the stability of fine-tuning bert: Misconceptions,
explanations, and strong baselines, in: 9th International Conference on Learning Representations,
2021.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fisher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Cox</surname>
          </string-name>
          , P. Hermann, Pizzagate: From rumor, to hashtag, to gunfire in dc, Washington Post 6 (
          <year>2016</year>
          )
          <fpage>8410</fpage>
          -
          <lpage>8415</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Connelly</surname>
          </string-name>
          ,
          <article-title>Misinformation of covid-19 vaccines and vaccine hesitancy</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>12</volume>
          (
          <year>2022</year>
          )
          <fpage>13681</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. B.</given-names>
            <surname>Casals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elnagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Korenčić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          , et al.,
          <source>Overview of pan</source>
          <year>2024</year>
          <article-title>: multi-author writing style analysis, multilingual text detoxification, oppositional thinking analysis, and generative ai authorship verification</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Damir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Xavier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mariona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Paolo</surname>
          </string-name>
          , R. Francisco,
          <source>Overview of the oppositional thinking analysis PAN task at CLEF</source>
          <year>2024</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>