=Paper=
{{Paper
|id=Vol-3740/paper-233
|storemode=property
|title=DSVS at PAN 2024: Ensemble Approach of Transformer-based Language Models for Analyzing
Conspiracy Theories Against Critical Thinking Narratives
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-233.pdf
|volume=Vol-3740
|authors=Sergio Damián,Brian Herrera,David Vázquez,Hiram Calvo,Edgardo Felipe-Riverón,Cornelio Yáñez-Márquez
|dblpUrl=https://dblp.org/rec/conf/clef/DamianHVCRY24
}}
==DSVS at PAN 2024: Ensemble Approach of Transformer-based Language Models for Analyzing
Conspiracy Theories Against Critical Thinking Narratives==
DSVS at PAN 2024: Ensemble Approach of
Transformer-based Language Models for Analyzing
Conspiracy Theories Against Critical Thinking Narratives
Notebook for PAN at CLEF 2024
Sergio Damián1,* , Brian Herrera1 , David Vázquez1 , Hiram Calvo1 , Edgardo Felipe-Riverón1
and Cornelio Yáñez-Márquez1
1
Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), Mexico City, Mexico
Abstract
This paper presents a comprehensive analysis of ensemble models for the shared task "Conspiracy Theories
Against Critical Thinking Narratives" for PAN at CLEF 2024. Through a data collection involving Telegram
conversations on COVID-19, two distinct corpora in English and Spanish were assembled and manually labeled
to differentiate between "critical" and "conspiracy" texts. The study employed ensemble models, comprising
seven trained transformer-based models per language-task pair, to address two key tasks: distinguishing between
critical and conspiracy texts (binary classification) and detecting spans for six different categories that can be
found on the texts (multi-label span classification). The results unveiled the competitive performance of ensemble
models, particularly in securing notable rankings surpassing the mean of all participants’ results in both tasks.
Keywords
Conspiracy Theories, Critical Thinking Narratives, Multi-label Token Classification, Ensemble Model, Small
Language Models
1. Introduction
Conspiracy theories (CT) are narratives that seek to explain the causes of significant situations or
events for society, suggesting the existence of secret plans secretly carried out by actors who abuse their
power to achieve their own objectives without caring about depriving people of their rights, freedoms,
prosperity, health or knowledge [1, 2, 3]. These narratives can cause great harm, as they can modify the
behavior of people who believe in them, fostering attitudes that put both believers and other members
of society at risk. The potential risk increases when it comes to health-related conspiracy theories, as
they can lead some people to make decisions that are detrimental to their well-being and that of those
around them.
In addition to the behavioral change in believers of these theories, another significant harm is the
mistrust they generate towards various medical treatments and the decrease in trust in public health
institutions and health professionals. This hinders the implementation of public health measures and
the response to health emergencies. For these reasons, it is urgent to identify and address conspiracy
theories to mitigate their harmful effects.
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
$ sdamians2019@cic.ipn.mx (S. Damián); bherrerag2019@cic.ipn.mx (B. Herrera); dvazquezs2019@cic.ipn.mx (D. Vázquez);
hcalvo@cic.ipn.mx (H. Calvo); edgardo@cic.ipn.mx (E. Felipe-Riverón); cyanez@cic.ipn.mx (C. Yáñez-Márquez)
https://github.com/sdamians (S. Damián); https://github.com/Hiram02 (D. Vázquez); http://hiramcalvo.com/ (H. Calvo)
0000-0003-2836-2102 (H. Calvo)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
1.1. Medical conspiracy theories
Although conspiracy theories are not limited to the field of health, they have been a persistent issue
over the years, causing significant harm to the population. A clear example is the case of the smallpox
vaccine, discovered by Edward Jenner in 1796, which represented a monumental advance with the
potential to improve public health significantly. However, it also led to the creation of a CT [4]. It
is likely that people did not properly understand how it worked, which led to the spread of rumors
warning of horn growth resulting from its use.
And this is not the only case of conspiracy theories related to vaccines. In fact, they have been a
recurring theme. For instance, in 1981, Dr. John Wilson claimed that the DPT vaccine caused convulsions
and brain damage [5]. In 1998, Andrew Wakefield published an article suggesting a link between the
MMR vaccine and autism [6], although it should be noted that this article was retracted by the journal
in which it was published. More recently, the COVID-19 pandemic has fueled the spread of numerous
conspiracy theories regarding vaccination against this virus [7].
1.2. Negative impacts of conspiracy theories
In general, the propagation of conspiracy theories could have several negative effects, among which we
can highlight some of them:
• Social Division and Polarization: They exacerbate social divisions by promoting extreme and
exclusionary beliefs, hindering rational dialogue and societal cohesion.
• Dissemination of Misinformation: They contribute to spreading false and unverified information,
leading to confusion and potentially harmful decisions.
• Loss of trust in authorities and experts: They foster distrust towards governmental, scientific,
and public health institutions, as well as towards experts in different areas.
• Psychological Impact: They induce anxiety, fear, and paranoia among believers, negatively
affecting their emotional and mental well-being.
• Impaired Decision-Making: Believers may base decisions on misinformation or biased information,
impeding informed and rational decision-making processes.
In the health field, conspiracy theories have had significant adverse effects. In Pakistan, for example,
there is a belief that the polio vaccine was developed by the CIA to sterilize Muslim men [8], which
has led many people to reject it. Another example is the theory that the U.S. government created
HIV/AIDS to reduce the African-American population, a widespread belief among this community that
has resulted in less frequent condom use [9].
Furthermore, certain sectors of society maintain mistrust towards specific drugs, alleging they inflict
greater harm than the diseases they aim to cure. For instance, there exists a theory attributing the
majority of deaths among AIDS patients to retroviral drugs. This conspiracy theory holds particular
influence in sub-Saharan Africa, where it receives support from influential figures [10].
There are several reasons why conspiracy theories can be widely spread. Among the most prominent
ones is their propagation by celebrities through digital media [11], which causes many of their followers
to start believing in them. In addition, it is difficult to absolutely determine their falsity, together with
the degree of plausibility attributed to them by each person [12], significantly contributes to their
dissemination. Critical thinking can help people to better evaluate the information they receive in
daily life and thus avoid fraud and harmful habits. For example, critical thinking can be useful in
differentiating reliable medical information from unfounded claims, helping in decision-making about
appropriate treatments and lifestyle. When a person with high levels of intelligence, but low levels of
critical thinking, believes in a conspiracy theory, they can generate very well-supported arguments to
support the false information [13]. These arguments can be quickly propagated through digital media
and are difficult to detect.
This year’s goal at PAN 2024 is to analyze texts reflecting oppositional thinking, specifically dis-
tinguishing between conspiracy theories and critical thinking narratives. This task addresses two
significant challenges for the NLP community: (subtask 1, a binary classification task) differentiating
between conspiracy and critical narratives, and (subtask 2, a multi-label span classification task) identi-
fying key elements of narratives that fuel intergroup conflict. Making this distinction is crucial because
mislabeling a text as conspiratorial when it is merely oppositional to mainstream views could push
individuals who are simply questioning mainstream perspectives closer to conspiracy communities
[14, 15].
2. Related Work
Conspiracy theories represent a significant danger, as they can negatively influence people’s behavior,
affecting trust in institutions and fostering disinformation. Intelligence is often thought of as synony-
mous with critical thinking, however, these terms are not the same. In reality, intelligence alone does
not always translate into critical thinking. Over the course of history, there have been people with high
levels of intelligence who nevertheless have demonstrated a lack of critical thinking in some areas, for
instance, Sir. Arthur Conan Doyle a brilliant writer who believed in spiritualism and fairies, despite
clear evidence to the contrary [16].
A recent study [13] explores the connections between critical thinking, intelligence and the predis-
position to believe in conspiracy theories. The authors note that while intelligence can help people
formulate more sophisticated arguments, it does not always protect them from false beliefs. On the
other hand, critical thinkers use logical rules, standards of evidence and other criteria that must be met
for the product of a thought to be considered good, making them less likely to believe in unsubstantiated
claims.
Intelligence is generally associated with good cognitive processing or intellectual abilities and the
potential to learn and reason well. Intelligent people tend to perform well in basic real-world domains,
such as academic performance and job success but sometimes find it difficult to adapt in other real-
world situations [17]. Intelligence without critical thinking can sometimes result in more convincing
arguments that support false beliefs. These persuasive arguments can mislead many people into
accepting these false ideas.
In a companion study [18], the impact of cognitive styles, such as analytical thinking, critical thinking,
and scientific reasoning, on the propensity to believe in conspiracy theories was examined. The findings
suggest that individuals who exhibit a stronger inclination towards analytical thinking and scientific
reasoning are less susceptible to conspiracy theories due to their more rigorous and evidence-based
approach to evaluating information.
As a matter of fact, in recent years, there has been a notable surge in the recognition and analysis of
conspiracy theories. This trend mirrors the growing acknowledgment of the significant impact that
misinformation and disinformation can have on societies, particularly in the age of digital interconnect-
edness. Research endeavors[19], have increasingly focused on understanding the dynamics behind the
propagation of conspiracy theories.
However, it’s crucial to recognize that the identification and mitigation of conspiracy theories are
part of a broader spectrum of tasks aimed at combating misinformation and preserving the integrity of
information ecosystems. Alongside the detection of conspiracy theories, researchers and practitioners
are also confronted with related challenges, including the identification and containment of rumors[20],
the mitigation of the spread of fake news[21], the recognition of clickbait content [22] designed to
manipulate user engagement, and the indispensable task of fact-checking[23].
In this contemporary landscape, where information dissemination is facilitated by sophisticated
technologies and platforms, the importance of discerning false information from genuine content cannot
be overstated. The rise of AI-driven text generation capabilities, for instance, presents both opportunities
and challenges. On one hand, these advancements offer innovative approaches to understanding and
combating misinformation. On the other hand, they underscore the urgency of developing robust
mechanisms to differentiate between authentic and fabricated texts.
3. Dataset Preprocessing
The data collection involved gathering textual data from Telegram conversations concerning COVID-19.
These texts were then manually labeled to distinguish between "critical" and "conspiracy" categories.
Two corpora were employed for this study, one in English and the other in Spanish. Each set contained
4000 entries for training purposes and an additional 1000 entries for testing [24]. In this work, the use
of k-fold cross validation was not implemented due to limited computational resources. Instead, the 10%
of the training set was split for validation experiments, preserving the initial class balance provided, as
shown in Table 1. The main hypothesis was that a single train-validation split could lead to a scenario
where the model stability were more consistent than averaging results over multiple folds specially for
subtask 2.
Table 1
Statistics of both corpora.
Language Label Numerical Label Train Val Total
English Conspiracy 0 1105 274 1379
English Critical 1 2466 155 2621
Total - - 3571 429 4000
Spanish Conspiracy 0 1178 284 1462
Spanish Critical 1 2373 165 2538
Total - - 3551 449 4000
The dataset entries had two different representations: the original sentence and the sentence split
by tokens designed for subtask 2. The approach implemented was to use the original sentence repre-
sentation for subtask 1, leveraging the tokenization step to each model’s tokenizer and to use the list
of tokens for subtask 2, trying to preserve the majority of the tokens labeled after the preprocess and
cleaning step. The following procedures for text cleaning were implemented for both tasks:
• Small combinations of numbers and letters (with lengths ranging from 2 to 4) were removed.
• Combinations of alternating letters and numbers were removed (e.g. tokens such as 1df324D
identified in URLs).
• Special words for URLs were removed.
• English contractions such as ’re or n’t were normalized by using the complete word (are and not).
• Numbers in date format and hour were tagged using the labels date, hour for English and fecha,
hora for Spanish.
• The rest of the numbers were tagged using the label number.
• Repeated strings of three or more characters were normalized (e.g. aaa to a).
Significantly, both corpora manifest an inherent class distribution imbalance, characterized by a
larger proportion of inputs labeled as "critical" in contrast to those categorized as "conspiracy", which is
illustrated in Figure 1.
4. Methodology
The baseline model provided was a transformer-based model designed for multitask learning to address
both tasks. While this generally leads to better results, it can also make the model more complex and
difficult to train, particularly in balancing the loss of both tasks to prevent one task from negatively
impacting the performance of the other.
The proposed solution in this work involved using an ensemble of transformer-based models in the
form of several Small Language Models (SLMs) to address each task-language pair independently, thus
training them as single-task learning models using low computational resources. This methodology
Figure 1: Dataset statistics for both the English (left) and Spanish (right) corpora. The first pair of charts
illustrates an imbalance of classes for both datasets. The second pair denotes the distribution of inputs by length
(number of tokens) and class, and the third pair provides a zoomed-in version of the second pair, allowing for a
clearer view of the distribution of inputs that are longer than the common length typically encountered in SLMs.
facilitated the aggregation of multiple logits and aimed to improve overall performance. The training
process consisted of developing seven distinct SLMs for each language and task. Subsequently, the
top five models for each language-task pair were selected based on specific evaluation metrics. For
subtask 1, the official evaluation metrics were the Matthews correlation coefficient (MCC) [25] and
macro F1-score, while subtask 2 was evaluated using the span-F1 metric [26].
Figure 2 illustrates the ensemble strategy utilized in this work for subtask 1. All logits obtained by
each SLM were multiplied by a weight based on the scores of the evaluation metrics. Subsequently, the
logits were aggregated and rounded, to get the final outcome of the ensemble model. The same strategy
was applied for subtask 2, where instead of getting a single outcome per SLM, a matrix 𝑌𝑛 ∈ R𝑗×𝑘 was
obtained and aggregated aftwewards, as depicted in Figure 3. In summary, two ensemble models were
evaluated per task-language pair, one using all seven trained SLMs and another using the top 5 best
trained SLMs. Each ensemble model employed a mean voting classifier.
4.1. Small Language Models employed for English Corpus
This work’s rigorous selection process led to the identification of several transformer-based models for
both subtasks within the English corpus. The transformer library by huggingface provides wrappers
for sequence classification and token classification tasks. The following enumeration provides a concise
description of the models assessed.
Figure 2: Diagram of the ensemble model approach for subtask 1. All logits obtained by each model are averaged
and rounded to get the final prediction for each input.
Figure 3: Diagram of the ensemble model approach for subtask 2. All logits obtained by each model are averaged
and rounded to get the final prediction for each input.
• BERT [27]: Demonstrates a significant performance in understanding context and semantics,
making it a natural choice. The baseline provided was constructed utilizing it.
• RoBERTa [28]: Employs an optimized pretraining and can achieve better results than BERT.
• BigBird [29]: Handles long sequences through sparse attention mechanisms. The English corpus
comprises several long sequences of tokens that exceed the typical maximum length (512) accepted
by SLMs.
• Electra [30]: Utilizes a generator-discriminator architecture for enhanced efficiency, offering
robustness against adversarial attacks and enhancing generalization capabilities.
• T5 [31]: Adopts a text-to-text framework that can handle diverse tasks. Although it is a text-
generating model, it can be used as a binary classification by adding a classification module (a
linear layer on top of the pooled output). For classification tasks, the output of the first token is
processed and classified. Huggingface has an implementation of this model’s variant.
• XLM-RoBERTa [32]: Extends RoBERTa to multiple languages, producing distinct representations
of the inputs, potentially offering a complementary perspective on the tasks.
• MDeBERTa [33]: Designs efficient multilingual representations like XLM-RoBERTa, thereby
providing another perspective of the tasks.
Table 2 displays the metric outcomes for each SLM to subtask 1 on the English corpus. Notably,
MDeBERTa and T5 models achieved the most favorable results, outperforming the rest. Conversely,
Table 3 showcases the results for the macro span-f1 metric associated with subtask 2 on the English
corpus. Here, the multilingual model MDeBERTa and Electra emerged as the best models, while T5
exhibited comparatively insignificant results.
Table 2
Evaluation results for each SLM trained for Task 1 on the English corpus
Model MCC F1-Macro F1-Conspiracy F1-Critical
XLM-RoBERTa 0.8005 0.8960 0.9256 0.8664
T5 0.8358 0.9123 0.9395 0.8851
RoBERTa 0.8204 0.9053 0.9336 0.8771
MDeBERTa 0.8347 0.9131 0.9388 0.8874
Electra 0.8292 0.9110 0.9367 0.8852
BigBird 0.8315 0.9096 0.9378 0.8814
BERT 0.8247 0.9086 0.9348 0.8824
Table 3
Evaluation results for each SLM trained for Task 2 on the English corpus
Model Campaigner Neg Effect Objective Victim Agent Facilitator F1-Macro
BERT 0.5712 0.4983 0.3305 0.5544 0.6631 0.3434 0.4935
BigBird 0.6107 0.4967 0.3991 0.5903 0.5814 0.3852 0.5106
Electra 0.5962 0.5140 0.3651 0.6121 0.6873 0.3964 0.5285
MDeBERTa 0.6257 0.5160 0.4048 0.6118 0.6857 0.4263 0.5450
RoBERTa 0.6365 0.5205 0.3650 0.5897 0.6724 0.3561 0.5233
T5 0.4632 0.4042 0.2621 0.5344 0.5959 0.2231 0.4138
XLM-RoBERTa 0.6274 0.5093 0.3500 0.5947 0.6457 0.3433 0.5117
4.2. Models employed with Spanish corpus
In alignment with the specific demands of the Spanish corpus, a tailored selection of seven models was
employed, all implemented using the Hugging Face Transformers library. The following list provides a
description of these models:
• BETO [34]: Encompasses proficient linguistic understanding and contextual comprehension of
the Spanish language, and it served as the baseline model for the subtasks.
• Bertin [35]: Contributes to the linguistic analysis of Spanish language, providing an alternative
model for addressing linguistic nuances.
• MarIA [36]: Demonstrates proficiency and efficacy in addressing the complexities of the Spanish
language, being trained by large amounts of Spanish texts.
• TwHIN-BERT [37]: Enhances capabilities in processing linguistic structures, being tailored for
hate speech detection in Spanish, particularly on social media.
• mT5 [38]: Offers a multilingual variant of the T5 model, and enriches the analytical repertoire
available for the Spanish language. It is also a generative text model.
• XLM-RoBERTa [32]: Proposes another variant of the inputs, offering an additional multilingual
perspective on the tasks.
• MDeBERTa [33]: As a third multilingual representation, it offers valuable insights, augmenting
the analytical approach of the solution approach.
Table 4 presents the metric outcomes for each trained model concerning subtask 1 for Spanish
language. Remarkably, MarIA and MDeBERTa demonstrated the most promising results, surpassing its
counterparts. On the other hand, Table 5 delineates the results for subtask 2 on the Spanish corpus.
For this language, the multilingual model MDeBERTa emerged as the leading performer, while mT5
displayed relatively negligible results, mirroring the outcomes obtained by its counterpart in the English
experiments.
Table 4
Evaluation results for each SLM trained for subtask 1 on the Spanish corpus
Model MCC F1-Macro F1-Conspiracy F1-Critical
Bertin 0.6204 0.8033 0.8515 0.7552
BETO 0.6694 0.8284 0.8723 0.7844
MarIA 0.7029 0.8437 0.8862 0.8012
MDeBERTa 0.6882 0.8371 0.8803 0.7939
mT5 0.2750 0.4782 0.7362 0.2203
TwHIN-BERT 0.6539 0.8154 0.8671 0.7669
XLM-RoBERTA 0.6250 0.8113 0.8508 0.7718
Table 5
Evaluation results for each SLM trained for subtask 2 on the Spanish corpus
Model Campaigner Neg Effect Objective Victim Agent Facilitator F1-Macro
Bertin 0.6559 0.6354 0.2850 0.6239 0.5556 0.4279 0.5306
BETO 0.6785 0.6506 0.3104 0.5965 0.5314 0.4350 0.5561
MarIA 0.6740 0.6055 0.3214 0.5965 0.5314 0.4350 0.5273
MDeBERTa 0.7117 0.6554 0.3291 0.6392 0.6064 0.4988 0.5742
mT5 0.6312 0.6034 0.3474 0.5801 0.5246 0.3268 0.5022
TwHIN-BERT 0.6597 0.6369 0.3475 0.6396 0.5700 0.5015 0.5592
XLM-RoBERTA 0.6908 0.6756 0.3309 0.6367 0.5939 0.4861 0.5690
5. Results
The shared task allowed a maximum of two submissions per subtask. For our submissions, we opted to
present two ensemble models per subtask: an ensemble version comprising all seven models trained
per language-task pair, alongside another submission featuring the top five models. Table 6 provides a
comprehensive overview of the official results attained per submission for subtask 1, incorporating the
attained placement, while Table 7 delineates the results for subtask 2. The best models were determined
on their competitiveness across the Matthews Correlation Coefficient (MCC) metric and span-F1 metric,
for both subtasks respectively. Due to complications encountered during the experimentation phase,
the evaluation of the ensemble model comprising the top 5 models for Spanish was precluded. For
subtask 1, the optimal ensemble model surpassed the baseline performance for the English language.
However, the submitted ensemble model for the Spanish language did not exhibit a similar performance.
Conversely, for subtask 2, the optimal ensemble model successfully outperformed the baselines for both
languages. In this subtask, the ensemble model with five learners was the best approach for the English
language, while the ensemble model with seven learners was the best for the Spanish language. The
results obtained for the Spanish language were significantly higher than its baseline, which implies the
learners successfully contributed different information to the final solutions.
Table 6
Test results for each submission for subtask 1. The baseline is included for comparison purposes.
Language Model MCC F1-Macro F1-Conspiracy F1-Critical Rank
English Ensemble (7 models) 0.7970 0.8985 0.8674 0.9296 14/83
English Ensemble (5 models) 0.7943 0.9071 0.9080 0.9061
English Baseline-BERT 0.7964 0.8975 0.8632 0.9318
Spanish Ensemble (7 models) 0.6462 0.8231 0.7753 0.8708 29/78
Spanish Baseline-BETO 0.6681 0.8339 0.7872 0.8806
Table 7
Test results for each submission for subtask 2. The baseline is included for comparison purposes.
Model Campaigner span-F1 span-P span-R micro-span-F1 Rank
English Ensemble (7 models) 0.5460 0.5287 0.5774 0.5133
English Ensemble (5 models) 0.5598 0.5332 0.6012 0.5287 12/28
English Baseline-BERT 0.5323 0.4684 0.6334 0.4998
Spanish Ensemble (7 models) 0.5529 0.5384 0.5785 0.5323 07/25
Spanish Ensemble (5 models) 0.5483 0.5210 0.5873 0.5383
English Baseline-BETO 0.4934 0.4533 0.5621 0.4952
6. Conclusions
The ensemble model’s combination of diverse SLM architectures contributed to robustness and gener-
alization, thereby enhancing performance across both tasks. However, certain limitations and areas
for improvement were identified. A small fixed validation set was used, but a cross-validation strategy
might lead to better performance, specially for obtaining more accurate weights for the base models.
The ensemble used a weighted mean voting classifier that can be replaced for a more sophisticated
meta model like a logistic regression classifier. The single-task learning approach did not outperform
all the baseline results obtained using a multitask learning approach. The shared knowledge from both
subtasks might enhance the results and the generalization of the final predictions.
The disparities in performance between tasks could be attributed to the inherent complexity and
ambiguity associated with detecting different classes among texts, necessitating further exploration and
refinement of the approach’s methodologies and feature representations. By leveraging insights gleaned
from the model performance analysis, future iterations of the ensemble model can be refined to enhance
robustness and efficacy within the domain of conspiracy theories and critical thinking narratives.
Acknowledgments
This work was done with partial support from the Mexican Government through Consejo Nacional de
Humanidades Ciencias y Tecnologías (CONAHCYT) and Instituto Politécnico Nacional (IPN).
References
[1] M. R. X. Dentith, M. Orr, Secrecy and conspiracy, Episteme 15 (2018) 433–450. doi:10.1017/epi.
2017.9.
[2] C. R. Sunstein, A. Vermeule, Conspiracy theories: Causes and cures*, Journal of Political Philosophy
17 (2009) 202–227. URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9760.2008.00325.x.
doi:https://doi.org/10.1111/j.1467-9760.2008.00325.x.
[3] J. E. Uscinski, J. M. Parent, American Conspiracy Theories, Oxford University Press, 2014.
URL: https://doi.org/10.1093/acprof:oso/9780199351800.001.0001. doi:10.1093/acprof:oso/
9780199351800.001.0001.
[4] M. V. Eve Dubé, N. E. MacDonald, Vaccine hesitancy, vaccine refusal and the anti-vaccine
movement: influence, impact and implications, Expert Review of Vaccines 14 (2015) 99–
117. URL: https://doi.org/10.1586/14760584.2015.964212. doi:10.1586/14760584.2015.964212.
arXiv:https://doi.org/10.1586/14760584.2015.964212, pMID: 25373435.
[5] J. T. Wilson, Dpt vaccine and serious neurological illness: current status of the controversy,
Pediatrics 68 (1981) 650–651.
[6] A. Wakefield, S. Murch, A. Anthony, J. Linnell, D. Casson, M. Malik, M. Berelowitz, A. Dhillon,
M. Thomson, P. Harvey, A. Valentine, S. Davies, J. Walker-Smith, Ileal-lymphoid-nodular hyper-
plasia, non-specific colitis, and pervasive developmental disorder in children, The Lancet 351
(1998) 637–641.
[7] N. Corbu, R. Buturoiu, V. Frunzaru, G. Guiu, Vaccine-related conspiracy and counter-conspiracy
narratives. silencing effects, Communications 49 (2024) 339–360. URL: https://doi.org/10.1515/
commun-2022-0022. doi:doi:10.1515/commun-2022-0022.
[8] G. E. Andrade, A. Hussain, Polio in pakistan: Political, sociological, and epidemiological factors,
Cureus 10 (2018) e3502. doi:10.7759/cureus.3502.
[9] L. M. Bogart, S. T. Bird, Exploring the relationship of conspiracy beliefs about hiv/aids to sexual
behaviors and attitudes among african-american adults, Journal of the National Medical Association
95 (2003) 1057.
[10] P. Fourie, M. Meyer, The Politics of AIDS Denialism, Routledge, New York, 2010.
[11] G. Andrade, Medical conspiracy theories: cognitive science and implications for ethics, Medicine,
Health Care and Philosophy 23 (2020) 505–518. URL: https://doi.org/10.1007/s11019-020-09951-6.
doi:10.1007/s11019-020-09951-6.
[12] M. Frenken, A. Reusch, R. Imhoff, “just because it’s a conspiracy theory doesn’t mean they’re
not out to get you”: Differentiating the correlates of judgments of plausible versus implausible
conspiracy theories, Social Psychological and Personality Science (2024) 19485506241240506. URL:
https://doi.org/10.1177/19485506241240506. doi:10.1177/19485506241240506.
[13] D. A. Bensley, Critical thinking, intelligence, and unsubstantiated beliefs: An integrative review,
Journal of Intelligence 11 (2023). URL: https://www.mdpi.com/2079-3200/11/11/207. doi:10.3390/
jintelligence11110207.
[14] A. A. Ayele, N. Babakov, J. Bevendorff, X. Bonet Casals, B. Chulvi, D. Dementieva, A. Elnagar,
D. Freitag, M. Fröbe, D. Korenčić, M. Mayerl, D. Moskovskiy, A. Mukherjee, A. Panchenko, M. Pot-
thast, F. Rangel, N. Rizwan, P. Rosso, F. Schneider, A. Smirnova, E. Stamatatos, B. Stein, M. Taulé,
D. Ustalov, X. Wang, M. Wiegmann, S. M. Yimam, E. Zangerle, Overview of pan 2024: Multi-
author writing style analysis, multilingual text detoxification, oppositional thinking analysis, and
generative ai authorship verification - condensed lab overview, in: Proceedings of the Fifteenth
International Conference of the CLEF Association CLEF-2024, Springer, 2024, pp. 3–10.
[15] D. Korenčić, B. Chulvi, X. Bonet Casals, M. Taulé, P. Rosso, F. Rangel, Overview of the oppositional
thinking analysis pan task at clef 2024, in: G. Faggioli, N. Ferro, P. Galuvakova, A. García Seco de
Herrera (Eds.), Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024.
[16] T. Waters, Magic and the british middle classes, 1750–1900, Journal of British Studies 54 (2015)
632–653. URL: http://www.jstor.org/stable/24702123.
[17] D. F. Halpern, D. S. Dunn, Critical thinking: A model of intelligence for solving real-world
problems, Journal of Intelligence 9 (2021). URL: https://www.mdpi.com/2079-3200/9/2/22. doi:10.
3390/jintelligence9020022.
[18] B. Gjoneska, Conspiratorial beliefs and cognitive styles: An integrated look on analytic thinking,
critical thinking, and scientific reasoning in relation to (dis)trust in conspiracy theories, Frontiers
in Psychology 12 (2021). URL: https://www.frontiersin.org/journals/psychology/articles/10.3389/
fpsyg.2021.736838. doi:10.3389/fpsyg.2021.736838.
[19] A. Giachanou, B. Ghanem, P. Rosso, Detection of conspiracy propagators using
psycho-linguistic characteristics, Journal of Information Science 49 (2023) 3–17.
URL: https://doi.org/10.1177/0165551520985486. doi:10.1177/0165551520985486.
arXiv:https://doi.org/10.1177/0165551520985486.
[20] G. Gorrell, E. Kochkina, M. Liakata, A. Aker, A. Zubiaga, K. Bontcheva, L. Derczynski, Semeval-2019
task 7: Rumoureval 2019: Determining rumour veracity and support for rumours, in: Proceedings
of the 13th International Workshop on Semantic Evaluation: NAACL HLT 2019, Association for
Computational Linguistics, 2019, pp. 845–854.
[21] N. Capuano, G. Fenza, V. Loia, F. D. Nota, Content-based fake news detection with machine and
deep learning: A systematic review, Neurocomputing 530 (2023) 91–103.
[22] A. Anand, T. Chakraborty, N. Park, We used neural networks to detect clickbaits: You won’t
believe what happened next!, in: Advances in Information Retrieval: 39th European Conference
on IR Research, ECIR 2017, Aberdeen, UK, April 8-13, 2017, Proceedings 39, Springer, 2017, pp.
541–547.
[23] N. Walter, J. Cohen, R. L. Holbert, Y. Morag, Fact-checking: A meta-analysis of what works and
for whom, Political communication 37 (2020) 350–375.
[24] D. Korenčić, B. Chulvi, X. B. Casals, M. Taulé, P. Rosso, Pan24 oppositional thinking analysis [data
set] (2024). URL: https://doi.org/10.5281/zenodo.11199642. doi:10.5281/zenodo.11199642.
[25] D. Chicco, N. Tötsch, G. Jurman, The matthews correlation coefficient (mcc) is more reliable
than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix
evaluation, BioData mining 14 (2021) 1–22.
[26] G. Da San Martino, S. Yu, A. Barrón-Cedeño, R. Petrov, P. Nakov, et al., Fine-grained analysis
of propaganda in news articles, in: Proceedings of EMNLP-IJCNLP 2019-2019 Conference on
Empirical Methods in Natural Language Processing and 9th International Joint Conference on
Natural Language Processing, 2019, pp. 5636–5646.
[27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[29] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula,
Q. Wang, L. Yang, et al., Big bird: Transformers for longer sequences, Advances in neural
information processing systems 33 (2020) 17283–17297.
[30] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as discriminators
rather than generators, arXiv preprint arXiv:2003.10555 (2020).
[31] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning
research 21 (2020) 1–67.
[32] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv
preprint arXiv:1911.02116 (2019).
[33] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with
gradient-disentangled embedding sharing, arXiv preprint arXiv:2111.09543 (2021).
[34] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and
evaluation data, arXiv preprint arXiv:2308.02976 (2023).
[35] J. D. la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado
Salas y María Grandury, Bertin: Efficient pre-training of a spanish language model using perplexity
sampling, Procesamiento del Lenguaje Natural 68 (2022) 13–23. URL: http://journal.sepln.org/
sepln/ojs/ojs/index.php/pln/article/view/6403.
[36] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller,
C. R. Penagos, A. G. Agirre, M. Villegas, Maria: Spanish language models, Procesamiento del
Lenguaje Natural 68 (2022). URL: https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.
mendeley. doi:10.26342/2022-68-3.
[37] X. Zhang, Y. Malkov, O. Florez, S. Park, B. McWilliams, J. Han, A. El-Kishky, Twhin-bert: A socially-
enriched pre-trained language model for multilingual tweet representations, arXiv preprint
arXiv:2209.07562 (2022).
[38] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel, mt5: A
massively multilingual pre-trained text-to-text transformer, arXiv preprint arXiv:2010.11934
(2020).