DSVS at PAN 2024: Ensemble Approach of Transformer-based Language Models for Analyzing Conspiracy Theories Against Critical Thinking Narratives Notebook for PAN at CLEF 2024 Sergio Damián1,* , Brian Herrera1 , David Vázquez1 , Hiram Calvo1 , Edgardo Felipe-Riverón1 and Cornelio Yáñez-Márquez1 1 Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), Mexico City, Mexico Abstract This paper presents a comprehensive analysis of ensemble models for the shared task "Conspiracy Theories Against Critical Thinking Narratives" for PAN at CLEF 2024. Through a data collection involving Telegram conversations on COVID-19, two distinct corpora in English and Spanish were assembled and manually labeled to differentiate between "critical" and "conspiracy" texts. The study employed ensemble models, comprising seven trained transformer-based models per language-task pair, to address two key tasks: distinguishing between critical and conspiracy texts (binary classification) and detecting spans for six different categories that can be found on the texts (multi-label span classification). The results unveiled the competitive performance of ensemble models, particularly in securing notable rankings surpassing the mean of all participants’ results in both tasks. Keywords Conspiracy Theories, Critical Thinking Narratives, Multi-label Token Classification, Ensemble Model, Small Language Models 1. Introduction Conspiracy theories (CT) are narratives that seek to explain the causes of significant situations or events for society, suggesting the existence of secret plans secretly carried out by actors who abuse their power to achieve their own objectives without caring about depriving people of their rights, freedoms, prosperity, health or knowledge [1, 2, 3]. These narratives can cause great harm, as they can modify the behavior of people who believe in them, fostering attitudes that put both believers and other members of society at risk. The potential risk increases when it comes to health-related conspiracy theories, as they can lead some people to make decisions that are detrimental to their well-being and that of those around them. In addition to the behavioral change in believers of these theories, another significant harm is the mistrust they generate towards various medical treatments and the decrease in trust in public health institutions and health professionals. This hinders the implementation of public health measures and the response to health emergencies. For these reasons, it is urgent to identify and address conspiracy theories to mitigate their harmful effects. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ sdamians2019@cic.ipn.mx (S. Damián); bherrerag2019@cic.ipn.mx (B. Herrera); dvazquezs2019@cic.ipn.mx (D. Vázquez); hcalvo@cic.ipn.mx (H. Calvo); edgardo@cic.ipn.mx (E. Felipe-Riverón); cyanez@cic.ipn.mx (C. Yáñez-Márquez) € https://github.com/sdamians (S. Damián); https://github.com/Hiram02 (D. Vázquez); http://hiramcalvo.com/ (H. Calvo)  0000-0003-2836-2102 (H. Calvo) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1.1. Medical conspiracy theories Although conspiracy theories are not limited to the field of health, they have been a persistent issue over the years, causing significant harm to the population. A clear example is the case of the smallpox vaccine, discovered by Edward Jenner in 1796, which represented a monumental advance with the potential to improve public health significantly. However, it also led to the creation of a CT [4]. It is likely that people did not properly understand how it worked, which led to the spread of rumors warning of horn growth resulting from its use. And this is not the only case of conspiracy theories related to vaccines. In fact, they have been a recurring theme. For instance, in 1981, Dr. John Wilson claimed that the DPT vaccine caused convulsions and brain damage [5]. In 1998, Andrew Wakefield published an article suggesting a link between the MMR vaccine and autism [6], although it should be noted that this article was retracted by the journal in which it was published. More recently, the COVID-19 pandemic has fueled the spread of numerous conspiracy theories regarding vaccination against this virus [7]. 1.2. Negative impacts of conspiracy theories In general, the propagation of conspiracy theories could have several negative effects, among which we can highlight some of them: • Social Division and Polarization: They exacerbate social divisions by promoting extreme and exclusionary beliefs, hindering rational dialogue and societal cohesion. • Dissemination of Misinformation: They contribute to spreading false and unverified information, leading to confusion and potentially harmful decisions. • Loss of trust in authorities and experts: They foster distrust towards governmental, scientific, and public health institutions, as well as towards experts in different areas. • Psychological Impact: They induce anxiety, fear, and paranoia among believers, negatively affecting their emotional and mental well-being. • Impaired Decision-Making: Believers may base decisions on misinformation or biased information, impeding informed and rational decision-making processes. In the health field, conspiracy theories have had significant adverse effects. In Pakistan, for example, there is a belief that the polio vaccine was developed by the CIA to sterilize Muslim men [8], which has led many people to reject it. Another example is the theory that the U.S. government created HIV/AIDS to reduce the African-American population, a widespread belief among this community that has resulted in less frequent condom use [9]. Furthermore, certain sectors of society maintain mistrust towards specific drugs, alleging they inflict greater harm than the diseases they aim to cure. For instance, there exists a theory attributing the majority of deaths among AIDS patients to retroviral drugs. This conspiracy theory holds particular influence in sub-Saharan Africa, where it receives support from influential figures [10]. There are several reasons why conspiracy theories can be widely spread. Among the most prominent ones is their propagation by celebrities through digital media [11], which causes many of their followers to start believing in them. In addition, it is difficult to absolutely determine their falsity, together with the degree of plausibility attributed to them by each person [12], significantly contributes to their dissemination. Critical thinking can help people to better evaluate the information they receive in daily life and thus avoid fraud and harmful habits. For example, critical thinking can be useful in differentiating reliable medical information from unfounded claims, helping in decision-making about appropriate treatments and lifestyle. When a person with high levels of intelligence, but low levels of critical thinking, believes in a conspiracy theory, they can generate very well-supported arguments to support the false information [13]. These arguments can be quickly propagated through digital media and are difficult to detect. This year’s goal at PAN 2024 is to analyze texts reflecting oppositional thinking, specifically dis- tinguishing between conspiracy theories and critical thinking narratives. This task addresses two significant challenges for the NLP community: (subtask 1, a binary classification task) differentiating between conspiracy and critical narratives, and (subtask 2, a multi-label span classification task) identi- fying key elements of narratives that fuel intergroup conflict. Making this distinction is crucial because mislabeling a text as conspiratorial when it is merely oppositional to mainstream views could push individuals who are simply questioning mainstream perspectives closer to conspiracy communities [14, 15]. 2. Related Work Conspiracy theories represent a significant danger, as they can negatively influence people’s behavior, affecting trust in institutions and fostering disinformation. Intelligence is often thought of as synony- mous with critical thinking, however, these terms are not the same. In reality, intelligence alone does not always translate into critical thinking. Over the course of history, there have been people with high levels of intelligence who nevertheless have demonstrated a lack of critical thinking in some areas, for instance, Sir. Arthur Conan Doyle a brilliant writer who believed in spiritualism and fairies, despite clear evidence to the contrary [16]. A recent study [13] explores the connections between critical thinking, intelligence and the predis- position to believe in conspiracy theories. The authors note that while intelligence can help people formulate more sophisticated arguments, it does not always protect them from false beliefs. On the other hand, critical thinkers use logical rules, standards of evidence and other criteria that must be met for the product of a thought to be considered good, making them less likely to believe in unsubstantiated claims. Intelligence is generally associated with good cognitive processing or intellectual abilities and the potential to learn and reason well. Intelligent people tend to perform well in basic real-world domains, such as academic performance and job success but sometimes find it difficult to adapt in other real- world situations [17]. Intelligence without critical thinking can sometimes result in more convincing arguments that support false beliefs. These persuasive arguments can mislead many people into accepting these false ideas. In a companion study [18], the impact of cognitive styles, such as analytical thinking, critical thinking, and scientific reasoning, on the propensity to believe in conspiracy theories was examined. The findings suggest that individuals who exhibit a stronger inclination towards analytical thinking and scientific reasoning are less susceptible to conspiracy theories due to their more rigorous and evidence-based approach to evaluating information. As a matter of fact, in recent years, there has been a notable surge in the recognition and analysis of conspiracy theories. This trend mirrors the growing acknowledgment of the significant impact that misinformation and disinformation can have on societies, particularly in the age of digital interconnect- edness. Research endeavors[19], have increasingly focused on understanding the dynamics behind the propagation of conspiracy theories. However, it’s crucial to recognize that the identification and mitigation of conspiracy theories are part of a broader spectrum of tasks aimed at combating misinformation and preserving the integrity of information ecosystems. Alongside the detection of conspiracy theories, researchers and practitioners are also confronted with related challenges, including the identification and containment of rumors[20], the mitigation of the spread of fake news[21], the recognition of clickbait content [22] designed to manipulate user engagement, and the indispensable task of fact-checking[23]. In this contemporary landscape, where information dissemination is facilitated by sophisticated technologies and platforms, the importance of discerning false information from genuine content cannot be overstated. The rise of AI-driven text generation capabilities, for instance, presents both opportunities and challenges. On one hand, these advancements offer innovative approaches to understanding and combating misinformation. On the other hand, they underscore the urgency of developing robust mechanisms to differentiate between authentic and fabricated texts. 3. Dataset Preprocessing The data collection involved gathering textual data from Telegram conversations concerning COVID-19. These texts were then manually labeled to distinguish between "critical" and "conspiracy" categories. Two corpora were employed for this study, one in English and the other in Spanish. Each set contained 4000 entries for training purposes and an additional 1000 entries for testing [24]. In this work, the use of k-fold cross validation was not implemented due to limited computational resources. Instead, the 10% of the training set was split for validation experiments, preserving the initial class balance provided, as shown in Table 1. The main hypothesis was that a single train-validation split could lead to a scenario where the model stability were more consistent than averaging results over multiple folds specially for subtask 2. Table 1 Statistics of both corpora. Language Label Numerical Label Train Val Total English Conspiracy 0 1105 274 1379 English Critical 1 2466 155 2621 Total - - 3571 429 4000 Spanish Conspiracy 0 1178 284 1462 Spanish Critical 1 2373 165 2538 Total - - 3551 449 4000 The dataset entries had two different representations: the original sentence and the sentence split by tokens designed for subtask 2. The approach implemented was to use the original sentence repre- sentation for subtask 1, leveraging the tokenization step to each model’s tokenizer and to use the list of tokens for subtask 2, trying to preserve the majority of the tokens labeled after the preprocess and cleaning step. The following procedures for text cleaning were implemented for both tasks: • Small combinations of numbers and letters (with lengths ranging from 2 to 4) were removed. • Combinations of alternating letters and numbers were removed (e.g. tokens such as 1df324D identified in URLs). • Special words for URLs were removed. • English contractions such as ’re or n’t were normalized by using the complete word (are and not). • Numbers in date format and hour were tagged using the labels date, hour for English and fecha, hora for Spanish. • The rest of the numbers were tagged using the label number. • Repeated strings of three or more characters were normalized (e.g. aaa to a). Significantly, both corpora manifest an inherent class distribution imbalance, characterized by a larger proportion of inputs labeled as "critical" in contrast to those categorized as "conspiracy", which is illustrated in Figure 1. 4. Methodology The baseline model provided was a transformer-based model designed for multitask learning to address both tasks. While this generally leads to better results, it can also make the model more complex and difficult to train, particularly in balancing the loss of both tasks to prevent one task from negatively impacting the performance of the other. The proposed solution in this work involved using an ensemble of transformer-based models in the form of several Small Language Models (SLMs) to address each task-language pair independently, thus training them as single-task learning models using low computational resources. This methodology Figure 1: Dataset statistics for both the English (left) and Spanish (right) corpora. The first pair of charts illustrates an imbalance of classes for both datasets. The second pair denotes the distribution of inputs by length (number of tokens) and class, and the third pair provides a zoomed-in version of the second pair, allowing for a clearer view of the distribution of inputs that are longer than the common length typically encountered in SLMs. facilitated the aggregation of multiple logits and aimed to improve overall performance. The training process consisted of developing seven distinct SLMs for each language and task. Subsequently, the top five models for each language-task pair were selected based on specific evaluation metrics. For subtask 1, the official evaluation metrics were the Matthews correlation coefficient (MCC) [25] and macro F1-score, while subtask 2 was evaluated using the span-F1 metric [26]. Figure 2 illustrates the ensemble strategy utilized in this work for subtask 1. All logits obtained by each SLM were multiplied by a weight based on the scores of the evaluation metrics. Subsequently, the logits were aggregated and rounded, to get the final outcome of the ensemble model. The same strategy was applied for subtask 2, where instead of getting a single outcome per SLM, a matrix 𝑌𝑛 ∈ R𝑗×𝑘 was obtained and aggregated aftwewards, as depicted in Figure 3. In summary, two ensemble models were evaluated per task-language pair, one using all seven trained SLMs and another using the top 5 best trained SLMs. Each ensemble model employed a mean voting classifier. 4.1. Small Language Models employed for English Corpus This work’s rigorous selection process led to the identification of several transformer-based models for both subtasks within the English corpus. The transformer library by huggingface provides wrappers for sequence classification and token classification tasks. The following enumeration provides a concise description of the models assessed. Figure 2: Diagram of the ensemble model approach for subtask 1. All logits obtained by each model are averaged and rounded to get the final prediction for each input. Figure 3: Diagram of the ensemble model approach for subtask 2. All logits obtained by each model are averaged and rounded to get the final prediction for each input. • BERT [27]: Demonstrates a significant performance in understanding context and semantics, making it a natural choice. The baseline provided was constructed utilizing it. • RoBERTa [28]: Employs an optimized pretraining and can achieve better results than BERT. • BigBird [29]: Handles long sequences through sparse attention mechanisms. The English corpus comprises several long sequences of tokens that exceed the typical maximum length (512) accepted by SLMs. • Electra [30]: Utilizes a generator-discriminator architecture for enhanced efficiency, offering robustness against adversarial attacks and enhancing generalization capabilities. • T5 [31]: Adopts a text-to-text framework that can handle diverse tasks. Although it is a text- generating model, it can be used as a binary classification by adding a classification module (a linear layer on top of the pooled output). For classification tasks, the output of the first token is processed and classified. Huggingface has an implementation of this model’s variant. • XLM-RoBERTa [32]: Extends RoBERTa to multiple languages, producing distinct representations of the inputs, potentially offering a complementary perspective on the tasks. • MDeBERTa [33]: Designs efficient multilingual representations like XLM-RoBERTa, thereby providing another perspective of the tasks. Table 2 displays the metric outcomes for each SLM to subtask 1 on the English corpus. Notably, MDeBERTa and T5 models achieved the most favorable results, outperforming the rest. Conversely, Table 3 showcases the results for the macro span-f1 metric associated with subtask 2 on the English corpus. Here, the multilingual model MDeBERTa and Electra emerged as the best models, while T5 exhibited comparatively insignificant results. Table 2 Evaluation results for each SLM trained for Task 1 on the English corpus Model MCC F1-Macro F1-Conspiracy F1-Critical XLM-RoBERTa 0.8005 0.8960 0.9256 0.8664 T5 0.8358 0.9123 0.9395 0.8851 RoBERTa 0.8204 0.9053 0.9336 0.8771 MDeBERTa 0.8347 0.9131 0.9388 0.8874 Electra 0.8292 0.9110 0.9367 0.8852 BigBird 0.8315 0.9096 0.9378 0.8814 BERT 0.8247 0.9086 0.9348 0.8824 Table 3 Evaluation results for each SLM trained for Task 2 on the English corpus Model Campaigner Neg Effect Objective Victim Agent Facilitator F1-Macro BERT 0.5712 0.4983 0.3305 0.5544 0.6631 0.3434 0.4935 BigBird 0.6107 0.4967 0.3991 0.5903 0.5814 0.3852 0.5106 Electra 0.5962 0.5140 0.3651 0.6121 0.6873 0.3964 0.5285 MDeBERTa 0.6257 0.5160 0.4048 0.6118 0.6857 0.4263 0.5450 RoBERTa 0.6365 0.5205 0.3650 0.5897 0.6724 0.3561 0.5233 T5 0.4632 0.4042 0.2621 0.5344 0.5959 0.2231 0.4138 XLM-RoBERTa 0.6274 0.5093 0.3500 0.5947 0.6457 0.3433 0.5117 4.2. Models employed with Spanish corpus In alignment with the specific demands of the Spanish corpus, a tailored selection of seven models was employed, all implemented using the Hugging Face Transformers library. The following list provides a description of these models: • BETO [34]: Encompasses proficient linguistic understanding and contextual comprehension of the Spanish language, and it served as the baseline model for the subtasks. • Bertin [35]: Contributes to the linguistic analysis of Spanish language, providing an alternative model for addressing linguistic nuances. • MarIA [36]: Demonstrates proficiency and efficacy in addressing the complexities of the Spanish language, being trained by large amounts of Spanish texts. • TwHIN-BERT [37]: Enhances capabilities in processing linguistic structures, being tailored for hate speech detection in Spanish, particularly on social media. • mT5 [38]: Offers a multilingual variant of the T5 model, and enriches the analytical repertoire available for the Spanish language. It is also a generative text model. • XLM-RoBERTa [32]: Proposes another variant of the inputs, offering an additional multilingual perspective on the tasks. • MDeBERTa [33]: As a third multilingual representation, it offers valuable insights, augmenting the analytical approach of the solution approach. Table 4 presents the metric outcomes for each trained model concerning subtask 1 for Spanish language. Remarkably, MarIA and MDeBERTa demonstrated the most promising results, surpassing its counterparts. On the other hand, Table 5 delineates the results for subtask 2 on the Spanish corpus. For this language, the multilingual model MDeBERTa emerged as the leading performer, while mT5 displayed relatively negligible results, mirroring the outcomes obtained by its counterpart in the English experiments. Table 4 Evaluation results for each SLM trained for subtask 1 on the Spanish corpus Model MCC F1-Macro F1-Conspiracy F1-Critical Bertin 0.6204 0.8033 0.8515 0.7552 BETO 0.6694 0.8284 0.8723 0.7844 MarIA 0.7029 0.8437 0.8862 0.8012 MDeBERTa 0.6882 0.8371 0.8803 0.7939 mT5 0.2750 0.4782 0.7362 0.2203 TwHIN-BERT 0.6539 0.8154 0.8671 0.7669 XLM-RoBERTA 0.6250 0.8113 0.8508 0.7718 Table 5 Evaluation results for each SLM trained for subtask 2 on the Spanish corpus Model Campaigner Neg Effect Objective Victim Agent Facilitator F1-Macro Bertin 0.6559 0.6354 0.2850 0.6239 0.5556 0.4279 0.5306 BETO 0.6785 0.6506 0.3104 0.5965 0.5314 0.4350 0.5561 MarIA 0.6740 0.6055 0.3214 0.5965 0.5314 0.4350 0.5273 MDeBERTa 0.7117 0.6554 0.3291 0.6392 0.6064 0.4988 0.5742 mT5 0.6312 0.6034 0.3474 0.5801 0.5246 0.3268 0.5022 TwHIN-BERT 0.6597 0.6369 0.3475 0.6396 0.5700 0.5015 0.5592 XLM-RoBERTA 0.6908 0.6756 0.3309 0.6367 0.5939 0.4861 0.5690 5. Results The shared task allowed a maximum of two submissions per subtask. For our submissions, we opted to present two ensemble models per subtask: an ensemble version comprising all seven models trained per language-task pair, alongside another submission featuring the top five models. Table 6 provides a comprehensive overview of the official results attained per submission for subtask 1, incorporating the attained placement, while Table 7 delineates the results for subtask 2. The best models were determined on their competitiveness across the Matthews Correlation Coefficient (MCC) metric and span-F1 metric, for both subtasks respectively. Due to complications encountered during the experimentation phase, the evaluation of the ensemble model comprising the top 5 models for Spanish was precluded. For subtask 1, the optimal ensemble model surpassed the baseline performance for the English language. However, the submitted ensemble model for the Spanish language did not exhibit a similar performance. Conversely, for subtask 2, the optimal ensemble model successfully outperformed the baselines for both languages. In this subtask, the ensemble model with five learners was the best approach for the English language, while the ensemble model with seven learners was the best for the Spanish language. The results obtained for the Spanish language were significantly higher than its baseline, which implies the learners successfully contributed different information to the final solutions. Table 6 Test results for each submission for subtask 1. The baseline is included for comparison purposes. Language Model MCC F1-Macro F1-Conspiracy F1-Critical Rank English Ensemble (7 models) 0.7970 0.8985 0.8674 0.9296 14/83 English Ensemble (5 models) 0.7943 0.9071 0.9080 0.9061 English Baseline-BERT 0.7964 0.8975 0.8632 0.9318 Spanish Ensemble (7 models) 0.6462 0.8231 0.7753 0.8708 29/78 Spanish Baseline-BETO 0.6681 0.8339 0.7872 0.8806 Table 7 Test results for each submission for subtask 2. The baseline is included for comparison purposes. Model Campaigner span-F1 span-P span-R micro-span-F1 Rank English Ensemble (7 models) 0.5460 0.5287 0.5774 0.5133 English Ensemble (5 models) 0.5598 0.5332 0.6012 0.5287 12/28 English Baseline-BERT 0.5323 0.4684 0.6334 0.4998 Spanish Ensemble (7 models) 0.5529 0.5384 0.5785 0.5323 07/25 Spanish Ensemble (5 models) 0.5483 0.5210 0.5873 0.5383 English Baseline-BETO 0.4934 0.4533 0.5621 0.4952 6. Conclusions The ensemble model’s combination of diverse SLM architectures contributed to robustness and gener- alization, thereby enhancing performance across both tasks. However, certain limitations and areas for improvement were identified. A small fixed validation set was used, but a cross-validation strategy might lead to better performance, specially for obtaining more accurate weights for the base models. The ensemble used a weighted mean voting classifier that can be replaced for a more sophisticated meta model like a logistic regression classifier. The single-task learning approach did not outperform all the baseline results obtained using a multitask learning approach. The shared knowledge from both subtasks might enhance the results and the generalization of the final predictions. The disparities in performance between tasks could be attributed to the inherent complexity and ambiguity associated with detecting different classes among texts, necessitating further exploration and refinement of the approach’s methodologies and feature representations. By leveraging insights gleaned from the model performance analysis, future iterations of the ensemble model can be refined to enhance robustness and efficacy within the domain of conspiracy theories and critical thinking narratives. Acknowledgments This work was done with partial support from the Mexican Government through Consejo Nacional de Humanidades Ciencias y Tecnologías (CONAHCYT) and Instituto Politécnico Nacional (IPN). References [1] M. R. X. Dentith, M. Orr, Secrecy and conspiracy, Episteme 15 (2018) 433–450. doi:10.1017/epi. 2017.9. [2] C. R. Sunstein, A. Vermeule, Conspiracy theories: Causes and cures*, Journal of Political Philosophy 17 (2009) 202–227. URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9760.2008.00325.x. doi:https://doi.org/10.1111/j.1467-9760.2008.00325.x. [3] J. E. Uscinski, J. M. Parent, American Conspiracy Theories, Oxford University Press, 2014. URL: https://doi.org/10.1093/acprof:oso/9780199351800.001.0001. doi:10.1093/acprof:oso/ 9780199351800.001.0001. [4] M. V. Eve Dubé, N. E. MacDonald, Vaccine hesitancy, vaccine refusal and the anti-vaccine movement: influence, impact and implications, Expert Review of Vaccines 14 (2015) 99– 117. URL: https://doi.org/10.1586/14760584.2015.964212. doi:10.1586/14760584.2015.964212. arXiv:https://doi.org/10.1586/14760584.2015.964212, pMID: 25373435. [5] J. T. Wilson, Dpt vaccine and serious neurological illness: current status of the controversy, Pediatrics 68 (1981) 650–651. [6] A. Wakefield, S. Murch, A. Anthony, J. Linnell, D. Casson, M. Malik, M. Berelowitz, A. Dhillon, M. Thomson, P. Harvey, A. Valentine, S. Davies, J. Walker-Smith, Ileal-lymphoid-nodular hyper- plasia, non-specific colitis, and pervasive developmental disorder in children, The Lancet 351 (1998) 637–641. [7] N. Corbu, R. Buturoiu, V. Frunzaru, G. Guiu, Vaccine-related conspiracy and counter-conspiracy narratives. silencing effects, Communications 49 (2024) 339–360. URL: https://doi.org/10.1515/ commun-2022-0022. doi:doi:10.1515/commun-2022-0022. [8] G. E. Andrade, A. Hussain, Polio in pakistan: Political, sociological, and epidemiological factors, Cureus 10 (2018) e3502. doi:10.7759/cureus.3502. [9] L. M. Bogart, S. T. Bird, Exploring the relationship of conspiracy beliefs about hiv/aids to sexual behaviors and attitudes among african-american adults, Journal of the National Medical Association 95 (2003) 1057. [10] P. Fourie, M. Meyer, The Politics of AIDS Denialism, Routledge, New York, 2010. [11] G. Andrade, Medical conspiracy theories: cognitive science and implications for ethics, Medicine, Health Care and Philosophy 23 (2020) 505–518. URL: https://doi.org/10.1007/s11019-020-09951-6. doi:10.1007/s11019-020-09951-6. [12] M. Frenken, A. Reusch, R. Imhoff, “just because it’s a conspiracy theory doesn’t mean they’re not out to get you”: Differentiating the correlates of judgments of plausible versus implausible conspiracy theories, Social Psychological and Personality Science (2024) 19485506241240506. URL: https://doi.org/10.1177/19485506241240506. doi:10.1177/19485506241240506. [13] D. A. Bensley, Critical thinking, intelligence, and unsubstantiated beliefs: An integrative review, Journal of Intelligence 11 (2023). URL: https://www.mdpi.com/2079-3200/11/11/207. doi:10.3390/ jintelligence11110207. [14] A. A. Ayele, N. Babakov, J. Bevendorff, X. Bonet Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Korenčić, M. Mayerl, D. Moskovskiy, A. Mukherjee, A. Panchenko, M. Pot- thast, F. Rangel, N. Rizwan, P. Rosso, F. Schneider, A. Smirnova, E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, X. Wang, M. Wiegmann, S. M. Yimam, E. Zangerle, Overview of pan 2024: Multi- author writing style analysis, multilingual text detoxification, oppositional thinking analysis, and generative ai authorship verification - condensed lab overview, in: Proceedings of the Fifteenth International Conference of the CLEF Association CLEF-2024, Springer, 2024, pp. 3–10. [15] D. Korenčić, B. Chulvi, X. Bonet Casals, M. Taulé, P. Rosso, F. Rangel, Overview of the oppositional thinking analysis pan task at clef 2024, in: G. Faggioli, N. Ferro, P. Galuvakova, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024. [16] T. Waters, Magic and the british middle classes, 1750–1900, Journal of British Studies 54 (2015) 632–653. URL: http://www.jstor.org/stable/24702123. [17] D. F. Halpern, D. S. Dunn, Critical thinking: A model of intelligence for solving real-world problems, Journal of Intelligence 9 (2021). URL: https://www.mdpi.com/2079-3200/9/2/22. doi:10. 3390/jintelligence9020022. [18] B. Gjoneska, Conspiratorial beliefs and cognitive styles: An integrated look on analytic thinking, critical thinking, and scientific reasoning in relation to (dis)trust in conspiracy theories, Frontiers in Psychology 12 (2021). URL: https://www.frontiersin.org/journals/psychology/articles/10.3389/ fpsyg.2021.736838. doi:10.3389/fpsyg.2021.736838. [19] A. Giachanou, B. Ghanem, P. Rosso, Detection of conspiracy propagators using psycho-linguistic characteristics, Journal of Information Science 49 (2023) 3–17. URL: https://doi.org/10.1177/0165551520985486. doi:10.1177/0165551520985486. arXiv:https://doi.org/10.1177/0165551520985486. [20] G. Gorrell, E. Kochkina, M. Liakata, A. Aker, A. Zubiaga, K. Bontcheva, L. Derczynski, Semeval-2019 task 7: Rumoureval 2019: Determining rumour veracity and support for rumours, in: Proceedings of the 13th International Workshop on Semantic Evaluation: NAACL HLT 2019, Association for Computational Linguistics, 2019, pp. 845–854. [21] N. Capuano, G. Fenza, V. Loia, F. D. Nota, Content-based fake news detection with machine and deep learning: A systematic review, Neurocomputing 530 (2023) 91–103. [22] A. Anand, T. Chakraborty, N. Park, We used neural networks to detect clickbaits: You won’t believe what happened next!, in: Advances in Information Retrieval: 39th European Conference on IR Research, ECIR 2017, Aberdeen, UK, April 8-13, 2017, Proceedings 39, Springer, 2017, pp. 541–547. [23] N. Walter, J. Cohen, R. L. Holbert, Y. Morag, Fact-checking: A meta-analysis of what works and for whom, Political communication 37 (2020) 350–375. [24] D. Korenčić, B. Chulvi, X. B. Casals, M. Taulé, P. Rosso, Pan24 oppositional thinking analysis [data set] (2024). URL: https://doi.org/10.5281/zenodo.11199642. doi:10.5281/zenodo.11199642. [25] D. Chicco, N. Tötsch, G. Jurman, The matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation, BioData mining 14 (2021) 1–22. [26] G. Da San Martino, S. Yu, A. Barrón-Cedeño, R. Petrov, P. Nakov, et al., Fine-grained analysis of propaganda in news articles, in: Proceedings of EMNLP-IJCNLP 2019-2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, 2019, pp. 5636–5646. [27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [29] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al., Big bird: Transformers for longer sequences, Advances in neural information processing systems 33 (2020) 17283–17297. [30] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as discriminators rather than generators, arXiv preprint arXiv:2003.10555 (2020). [31] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning research 21 (2020) 1–67. [32] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). [33] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, arXiv preprint arXiv:2111.09543 (2021). [34] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, arXiv preprint arXiv:2308.02976 (2023). [35] J. D. la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury, Bertin: Efficient pre-training of a spanish language model using perplexity sampling, Procesamiento del Lenguaje Natural 68 (2022) 13–23. URL: http://journal.sepln.org/ sepln/ojs/ojs/index.php/pln/article/view/6403. [36] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R. Penagos, A. G. Agirre, M. Villegas, Maria: Spanish language models, Procesamiento del Lenguaje Natural 68 (2022). URL: https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0. mendeley. doi:10.26342/2022-68-3. [37] X. Zhang, Y. Malkov, O. Florez, S. Park, B. McWilliams, J. Han, A. El-Kishky, Twhin-bert: A socially- enriched pre-trained language model for multilingual tweet representations, arXiv preprint arXiv:2209.07562 (2022). [38] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel, mt5: A massively multilingual pre-trained text-to-text transformer, arXiv preprint arXiv:2010.11934 (2020).