Overview of the Oppositional Thinking Analysis PAN Task
                         at CLEF 2024
                         Damir Korenčić1 , Berta Chulvi2,5 , Xavier Bonet-Casals3 , Mariona Taulé3 , Paolo Rosso4,6,* and
                         Francisco Rangel5
                         1
                           Ruđer Bošković Institute, Croatia
                         2
                           Universitat de València, Spain
                         3
                           Universitat de Barcelona, Spain
                         4
                           Universitat Politècnica de València, Spain
                         5
                           Symanto Research, Spain
                         6
                           ValgrAI Valencian Graduate School and Research Network Analysis of Artificial Analysis, Spain


                                      Abstract
                                      This paper describes the Oppositional Thinking Analysis task at CLEF 2024. The task focuses on analyzing
                                      conspiracy theories and critical thinking narratives, and is comprised of two subtasks. Subtask 1 is a binary
                                      classification task aimed at distinguishing between critical and conspiracy texts. Subtask 2 is a token classification
                                      task aimed at detecting text spans corresponding to the key elements of oppositional (critical and conspiracy)
                                      narratives. The subtasks are based on a dataset of English and Spanish COVID19-related texts obtained from
                                      oppositional Telegram channels, and labeled using a topic-agnostic annotation scheme [1]. A total of 82 teams
                                      participated in the challenge, and 17 teams published working notes papers with system descriptions. The
                                      participants employed a range of NLP methods and pushed the state-of-art performance on both subtasks beyond
                                      the performance of the strong baseline systems [1] that were provided.

                                      Keywords
                                      Conspiracy Theories, Oppositional Thinking, Computational Social Science, Natural Language Processing, Text
                                      Classification, Sequence Labeling


                         1. Introduction
                         The first edition of the Oppositional Thinking Task, held at CLEF 2024, focused on distinguishing auto-
                         matically between conspiratorial narratives and critical narratives that do not convey a conspiratorial
                         mentality. Conspiracy Theories (CTs) are causal explanations of significant events that present them
                         as a result of cover plots orchestrated by secret, powerful, and malicious groups [2]. Since conspiracy
                         narratives tend to convey a critical vision of mainstream policies, a common mistake, especially in the
                         middle of a global crisis such as a pandemic or a war, is to categorize every critical narrative against the
                         official discourse as conspiratorial. Criticism and free discussion are key values in democratic societies;
                         however, conspiracy narratives severely weaken democratic systems because they place the ultimate
                         agent of the crisis outside the control of our systems of governance. As a result, it is important not to
                         confuse critical and conspiracy narratives.
                            The interest in the automatization of the critical-conspiracy distinction was recently highlighted by
                         Korenčić et al. [1], who argued that, if models monitoring the social media messages do not differentiate
                         between critical and conspiratorial thinking, there is a high risk of pushing people toward conspiracy
                         communities. The sociopsychological basis of this process is based on Social Identity Theory. Social
                         Identity Theory (SIT) has been a cornerstone in understanding group processes and intergroup relations
                         since its inception in the early 1970s [3]. This theory posits that individuals derive a part of their
                         self-concept from their membership in social groups, which influences their behavior and attitudes

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ prosso@dsic.upv.es (P. Rosso)
                           0000-0003-4645-2937 (D. Korenčić); 0000-0003-1169-0978 (B. Chulvi); 0009-0003-8827-0215 (X. Bonet-Casals);
                          0000-0003-0089-940X (M. Taulé); 0000-0002-8922-1242 (P. Rosso); 0000-0002-6583-3682 (F. Rangel)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
towards in-group and out-group members [4, 5]. As a result, being considered a conspiracist when you
are not could be a threat to your social identity. Once the subject is the target of this accusation, a way to
repair this stigmatization is to join conspiracist groups that will give the social support needed to recover
a positive social identity. This process is not unusual. As several authors from the field of social sciences
suggest, a fully-fledged conspiratorial worldview is the final step in a progressive “spiritual journey”
that sets out questioning social and political orthodoxies [6, 7, 8]. Accordingly, the distinction between
conspiratorial and critical thinking is crucial for automated content moderation: without it, there is
a significant risk of driving individuals towards conspiracy communities. Specifically, mislabeling a
text as conspiratorial when it merely challenges mainstream perspectives could inadvertently steer
individuals who are simply questioning into the arms of conspiracy groups.
   Furthermore, in the area of computational linguistics, Korenčić et al. [1] have shown that conspiracist
narrative and critical thinking are different due to their potential social effect on public opinion discourse,
with the former being significantly more associated with violent words and expressions of anger. In
their corpus, the authors have also labelled key elements in oppositional narratives (goals, effects,
agents, and the two groups in conflict, facilitators of government decisions and campaigners against
them), demonstrating that a greater level of intergroup conflict between facilitators and campaigners is
associated especially with conspiracy narratives and correlates with a greater use of violent words and
the emotional manifestation of anger.
   Based on this recent research [1], the present task addresses two new challenges for the NLP research
community: (1) to distinguish the conspiracy narrative from other oppositional narratives that do
not express a conspiracy mentality (i.e., critical thinking); and (2) to identify the key elements of the
oppositional narrative in online messages. As demonstrated [1], predictive NLP systems for these
two tasks have value for computational social scientists who are interested in analyzing oppositional
narratives. Therefore, it is of interest to push the performance on these tasks beyond the previously
proposed NLP approaches [1]. This PAN task has attempted to achieve this goal.
   For the two tasks described above, we provide the XAI-Disinfodemic corpus [1], a multilingual (English
and Spanish) corpus consisting of 10,000 annotated Telegram messages that focus on oppositional
narratives related to the COVID-19 pandemic. For each language, a training set of 4,000 messages has
been provided to the participants, while the outputs of the systems were computed and evaluated using
the testing set consisting of 1,000 messages. These messages contain oppositional non-mainstream
views on the COVID-19 pandemic, classified into two categories: critical and conspiratorial messages.
Messages have been annotated at the span level with a topic-agnostic schema that distinguishes the
key elements of an oppositional narrative: objectives, negative effects, agents, victims, and facilitators
and campaigners (the two groups in conflict). We also provide strong baseline solutions [1]. The train
and test splits of the dataset, as well as the code of the baseline systems, are freely available1 .
   The following sections of this paper describe the key aspects of this task. Section 2 summarizes
the related work on the classification of conspiratorial narratives in NLP and on the span detection
of different elements of these narratives. Section 3 presents the dataset used in this task. Section 4
describes the two subtasks proposed above, as well as evaluation measures and baseline solutions.
Section 5 presents the systems used by the participants. Section 6 analyzes the results and the systems
of the participants. Finally, Section 7 contains conclusions and directions for future work.


2. Related Work
A recent literature review by Mahl et al. [9] indicates a rising interest in conspiracy theories within
online environments, particularly within the Social Sciences. Approximately 80% of the research focuses
on written content, with about a third using automated content analysis methods. In this chapter, we
review research from NLP area which are relevant to the present tasks.


1
    https://github.com/dkorenci/pan-clef-2024-oppositional
2.1. Conspiracy detection in NLP
The COVID-19 pandemic has been one of the topics that has garnered the most attention in the study of
conspiracy narratives since 2020. The pandemic has been fertile ground for the expansion of conspiracy
theories. Among the works oriented in this direction, Uscinski et al. [10] collected a dataset of letters
sent to a mainstream US publication, and labeled them as either containing a conspiracy or not. Another
available corpus dedicated to conspiracy theories is LOCO corpus [11] containing 96,743 texts from a
diverse collection of mainstream and conspiracy outlets. The texts are enriched with website metadata
and auto-generated topics. With more detail about the content of conspiracy theories, we find COCO, a
corpus of 3,495 texts promoting COVID-19 conspiracies [12]. The texts were manually annotated in the
COCO corpus with a fine-grained classification scheme encompassing conspiracy sub-topics.
   The problem has often been approached as a binary classification task with the goal of distinguishing
conspiratorial from non-conspiratorial text. A good example is the two recent MediaEval challenges.
Focusing on the classification of conspiracy texts [13, 14], this task led to a number of approaches
demonstrating that the state-of-the-art architecture is a multi-task classifier [15, 16, 17] based on
CT-BERT [18].
   More nuanced methodologies using fine-grained approaches, like multi-label or multi-class classi-
fications, have provided a detailed understanding [19, 20, 13, 14] of the diffusion of conspiracies. For
example, Moffitt et al. [20] developed a classifier of conspiracy tweets and used it for propagation
analysis. COVID-19 origin conspiracy theory tweets using this method and then used social cyberse-
curity methods to analyze communities, spreaders, and characteristics of the different origin-related
conspiracy theory narratives. This research found that tweets about conspiracy theories were supported
by news sites with low fact-checking scores and amplified by bots who were more likely to link to
prominent Twitter users than in non-conspiracy tweets.
   Other research in computational linguistics has dealt with different aspects related to the characteris-
tics of the disseminators of conspiracy narratives or has focused on the characteristics of the messages.
Bessi [21] employed a text scaling method to map conspiratorial texts to personality traits and analyze
these conspiracies. Giachanou et al. [19] used psychological and linguistic features to classify and
analyze the social media users who spread conspiracies. Topic modeling techniques were used by
other authors [22, 23] to extract and examine common themes within conspiracy texts. Levy et al. [24],
taking an approach different from the problem of classifying humans texts, analyze the capacity of
large language models to generate conspiracies.
   However, present research fails to differentiate between critical thinking and conspiratorial thinking,
which is the main goal of this task.

2.2. Span detection in conspiracy theories
In the field of conspiracy theories, several papers have addressed the challenge of span detection.
Samory and Mitra [23] utilized syntactic parsing to identify “motifs” (agent-action-target triplets) and
analyze the patterns of their occurrence. Introne et al. [25] propose a span-level scheme of six categories
(event, actor, goal, action, consequence, target), and use it to analyze 236 messages from anti-vaccination
forums. They distinguish between conspiracy theories and conspiratorial thinking, a category that implies
only passive support for a conspiracy. This distinction is not based on annotations grounded in theory
but on the requirement of all the categories being present in a given text. However, in practice, fewer
elements can convey a conspiracy theory in a very strong manner. Although this research identifies
different elements of discourse, it also fails to consider the role played by intergroup conflict in the
conspiracy narrative, which is addressed in the XAI-DisInfodemic corpus [1].
   Holur et al. [26] focus on oppositional elements in the conspirational narrative, detecting the so-called
insider and outsider entities within conspiracy texts by automatically labelling noun phrases. This insider
and outsider schema is based on the positive or negative sentiment that each user conveys for each
entity. Although this research starts a path that could arrive at the consideration of the important role
of intergroup conflict in conspirational narratives, it fails in the proper identification of this intergroup
conflict because objects and other inanimate realities which are clearly out of the social framework are
also identified as insiders or outsiders.
   The importance of detecting intergroup conflict, as proposed by Korenčić et al. [1], relies on the
growing and potentially violent participation of conspiratorial groups in political activities. This
connection implies that CTs aim to strengthen group cohesion and facilitate coordinated actions [27].
Consequently, detecting crucial aspects of the narrative at the level of span, such as intergroup conflict,
can provide significant insights for content moderation.


3. Dataset
This task uses the XAI-DisInfodemic corpus [1], which consists of 10,000 annotated Telegram messages,
5,000 in English and 5,000 in Spanish. These messages contain oppositional, non-mainstream views
on the COVID-19 pandemic, and were obtained from public Telegram channels in which users tend
to post messages which oppose the mainstream discourse about the pandemic. They are classified
into two categories: critical messages and conspiratorial messages. For the creation of this corpus, the
authors developed an annotation scheme to differentiate between texts hinting at the existence of a
conspiracy and those criticizing mainstream views on COVID-19 but without suggesting the existence
of a conspiracy.

                     Language     Avg.   Std. dev   Min.    Q1    median    Q3     Max.
                     Spanish      128      123       23     49      98      148     766
                     English      265      528       12     32      65      266    4,108
    Table 1
    Statistics of the text length, measured in number of words (whitespace separated tokens), for English
    and Spanish corpora: the average, the standard deviation, the minimum, the first quartile, the median,
    the third quartile, and the maximum.

   In addition to the annotation into the two classes, the XAI-Disinfodemic corpus offers a second
annotation that presents the key elements in oppositional narratives. The tagset includes six labels
which can be applied both to messages containing a conspiracy theory and messages containing critical
thinking: goals, effects, agents, facilitators (the group that collaborates with the mainstream authorities)
and campaigners (the group that conveys the oppositional message).

                                             Conspiracy Theory


        Figure 1: A Conspiracy message annotated with elements of oppositional narrative: Agents (A), Facilita-
        tors (F), Campaigners (C), Victims (V), Objectives (O), Negative Effects (E).


  Korenčić et al. [1] identified the following six categories of narrative elements (see Figure 1 for an
example annotation of a Conspiracy message, and Figure 2 for an example annotation of a Critical
message.):

   1. Agents (A): Those responsible for the actions and/or negative effects described in the comment. In
      Conspiracy, it could be the hidden power that pulls the strings (in Figure 1, “Private owned WHO”,
      “investors like Bill Gates”, “pharma companies” and “very evil beings”). In Critical, it could be the
      actors that design the mainstream public health policies (in Figure 2, “White House chief medical
                                               Critical Thinking


     Figure 2: A Critical message annotated with elements of oppositional narrative: Agents (A), Facilitators
     (F), Campaigners (C), Victims (V), Objectives (O), Negative Effects (E).


                         A               F               C               V               O               E
      All           3,329 (14.0%)   2,688 (11.3%)   4,231 (17.8%)   5,260 (22.2%)    622 (2.6%)     7,150 (30.2%)
ES    Conspiracy     1,361 (9.8%)   1,184 (8.6%)    2,133 (15.4%)   3,543 (25.6%)    23 (0.2%)      5,326 (38.5%)
      Critical      1,968 (20.0%)   1,504 (15.2%)   2,098 (21.3%)   1,717 (17.4%)    599 (6.1%)     1,824 (18.5%)
      All           6,411 (22.4%)   3,462 (12.1%)   6,416 (22.4%)   4,433 (15.5%)   2,073 (7.2%)    5,565 (19.4%)
EN    Conspiracy    3,333 (21.1%)   1,336 (8.5%)    3,839 (24.4%)   2,734 (17.3%)    615 (3.9%)     3,708 (23.5%)
      Critical      3,078 (23.9%)   2,126 (16.5%)   2,577 (20.0%)   1,699 (13.2%)   1,458 (11.3%)   1,857 (14.4%)

Table 2
Statistics for the gold span-level annotations of the narrative elements. Absolute number and percentage
of spans are given for each of the binary text classes and for all texts, and for each of the six narrative
categories: Agents (A), Facilitators (F), Campaigners (C), Victims (V), Objectives (O), Negative Effects (E).


   advisor Dr. Anthony Fauci” and “the lead of CDC director Rochelle Walensky, who questioned natural
   immunity”).
2. Facilitators (F): Those who collaborate with the agents and contribute to the execution of their
   goals. In Conspiracy, they could be governments or institutions which, either intentionally or
   unwittingly, collaborate with the conspirators and help the conspiracy move forward (in Figure 1,
   “the world governments ruled by their puppets”, “their media”, “the media” and “governments”). In
   Critical, the facilitators could be healthcare workers, mass media or authority figures who abide
   by governmental instructions (in Figure 2, “university hospitals” and “the vaccinated work - from -
   home hospital administrators who are firing her for not being vaccinated”).
3. Campaigners (C): Those who oppose the mainstream narrative. In Conspiracy, those who know
   the truth and expose it to society at large (in Figure 1, “those awake already”). In Critical, those
   who oppose the enforcement of laws and/or refuse to follow health-related instructions from the
   authorities (in Figure 2, “Dr Martin Kulldorff ”).
4. Victims (V): Those who suffer the consequences of the actions and decisions of the agents and/or
   the facilitators. In Conspiracy, the people who are deceived by those in power, and suffer, become
   ill, lose their freedom, or die as a result of a hidden plan (in Figure 1, “people”, “most people” and
   “regular people”). In Critical, the people who receive the negative consequences of the actions and
   the decisions made by those in power, and also suffer, lose their freedom, become ill, or die as a
   result of incorrect decisions (in Figure 2, “all nurses, doctors and other health care providers”).
5. Objectives (O): The intentions and purposes that the agents are trying to achieve. In Conspiracy,
   the goals of the conspirators (in Figure 1, “agenda” and “destroying us”). In Critical, the goals of
   public authorities, pharmaceutical companies, organizations, etc. (in Figure 2, “pushing vaccine
   mandates”).
6. Negative Effects (E): The negative consequences suffered by the victims as a result of the actions and
   decisions of those in power and/or their collaborators (in Figure 1, “the constant fear mongering”
          and “pay a hefty price, often with their health, lives, the loss of their loved ones”; in Figure 2, “will
          be fired if they do not get a Covid vaccine”).

  Table 2 shows the amount and the percentages of spans in the GS that have been annotated with
each label for each category (Conspiracy or Critical).


4. Task Setup
For each language, the corresponding dataset of 5,000 texts was divided into train and test sets using
stratified sampling. The train set consisted of 4,000 messages while the test set consisted of 1,000
messages. The participants had access to the train set from the start of the task, and prior to the
evaluation deadline they were provided with the unlabeled test set and asked to submit their predictions.
Each team was allowed to submit up to two predictions for each combination of subtask and language.
   The dataset, the code for building and applying the baseline systems, as well as the evaluation code
and task instructions, are made available2 .

Distinguishing Between Critical and Conspiratorial Messages (Subtask 1) This is a binary
classification task differentiating between (1) critical messages, i.e. those that question major decisions
in the public health domain, but do not promote a conspiracist mentality [1]; and (2) conspiratorial
messages, i.e. those that view the pandemic or public health decisions as a result of a malevolent
conspiracy by secret, influential groups [1]. Input data consists of a set of messages, each of which
associated with one of two categories: either CONSPIRACY or CRITICAL. The evaluation metric used
for this subtask is Matthews Correlation Coefficient (MCC) [28].

Detecting Elements of Oppositional Narratives (Subtask 2) This is a token-level classification
task aimed at recognizing text spans corresponding to the key elements of oppositional narratives [1].
Input data consists of a set of messages, each of which is accompanied by a (possibly empty) list of
span annotations. Each annotation corresponds to a narrative element, and is described by its borders
(start and end characters), as well as its category. There are six distinct span categories: AGENTS,
FACILITATORS, VICTIMS, CAMPAIGNERS, OBJECTIVES, NEGATIVE_EFFECTS. The evaluation metric
used for this subtask is macro-averaged span-F1 [29].

4.1. Evaluation Measures
As the main criterion for evaluation in Subtask 1 , we used the MCC [28]. MCC serves the same purpose
as the macro-averaged F1 measure – it aggregates performance across both classes. We opted for the
MCC measure since it works well on imbalanced datasets, while being reliable and less optimistic than
the macro-averaged F1 [30], and comparing favorably to other alternatives [28].
   For evaluation in Subtask 2 , we used the span-F1 measure [29], which is an adapted version of the
F1 measure and accounts for partially correct predictions by looking at span overlap. Specifically, a
predicted span is not required to exactly match a gold standard span in terms of start and end characters.
Instead, the proportion of overlapping characters is used to calculate precision and recall [29]. This
approach offers a fairer evaluation in tasks with long spans, and with inherent subjectivity of the span
boundaries. For tasks like traditional, non-nested Named Entity Recognition (NER), where named
entities are shorter and are expected to have well-defined boundaries, exact matching is a reasonable
method of evaluation.
   As the main criterion for evaluation we used macro-averaged span-F1, i.e., span-F1 averaged over all
six span labels corresponding to six elements of oppositional narratives described in Section 3.


2
    https://github.com/dkorenci/pan-clef-2024-oppositional
4.2. Baseline Solutions
Baselines for both subtasks are based on the approaches from Korenčić et al. [1], where more details
can be found. For each subtask, we took as a baseline the version based on the transformer model
which resulted in the lowest performance in Korenčić et al. [1]. Hyperparameters were not changed,
the models were trained on the entire train set, and then applied to the test set.

Distinguishing Critical and Conspiratorial Messages (Subtask 1) The approach for this binary
classification task is based on fine-tuning the BERT transformer model [31] from the Hugging Face3
repository, using the case-sensitive “base” version. The BETO [32] version of BERT was used for the
Spanish dataset. The number of tokens was set to 256. We tuned the models for three epochs using the
AdamW optimizer, learning rate of 2𝑒−5 , slanted triangular LR scheduler with a 10% warm-up period, a
batch size of 16, and a weight decay of 0.01. All the layers of the transformers were fine-tuned. The
dropout rate for the classification head was 0.1.

Detecting Elements of Oppositional Narratives (Subtask 2) The baseline for this sequence
labeling task is based on fine-tuning a transformer model with added token classification heads. To
account for the possibility of overlapping spans with different categories, we used six separate per-
category heads that performed BIO sequence tagging. We employed multi-task learning [33] by
connecting the per-category taggers to the same transformer backbone. Multi-task learning has several
advantages, such as improved regularization and implicit data augmentation [33], and the described
approach was successfully deployed for a similar task of span-level skill extraction [34]. We used the
same configuration and hyperparameters as in the case of Subtask 1 . The exception was the number of
epochs, which we increased to 10 in order to accommodate for the increased task complexity. The BERT
model [31] was used as the base transformer for the English dataset, while for the Spanish dataset the
BETO version of BERT [32] was used.


5. Participating Systems
A total of 82 teams submitted their solution for at least one of the tasks. The approaches included pre-
neural NLP models, small transformers such as BERT [31], and Large Language Models [35]. Techniques
such as Ensemble Methods [36] and Data Augmentation [37] were also used to improve performance.
Another important factor was the data on which the chosen transformer models were pretrained –
participants experimented with both domain-specific models such as CT-BERT [18] and multilingual
models such as mBERT [38].
   Most of the approaches relied on fine-tuning BERT-like transformers [31]. This is not surprising
since these models yield strong results for both classification [31] and sequence labeling [31], and since
baselines based on this approach were provided to the participants.
   To describe the approaches based on transformer models [39] we shall use the abbreviation SLM
(“Small” Language Models) to describe transformers with fewer than one billion parameters. For the
transformers with more than one billion parameters, we shall use the standard abbreviation LLM (Large
Language Models).

Working Notes Submissions A total of 17 participating systems had their working notes papers
accepted. Huertas-García et al. [40] tackled Subtask 1 , experimenting with a range of SLMs and with
the commercial LLM Claude4 . Vallecillo-Rodríguez et al. [41] experimented with the fine-tuning of two
LLMs: LLaMA3-8B-instruct [42] and GPT-3.5 [43]. Hu et al. [44] used SLMs with an added BiGRU LSTM
layer [45] to tackle both tasks. Damian et al. [46] approached both tasks using ensembles of mono- and
multi-lingual SLMs. Sánchez-Hermosilla et al. [47] focused on Subtask 1 using a range of SLMs, data

3
    https://huggingface.co/models
4
    https://www.anthropic.com/claude
augmentation, and ensembling techinques. Zrnić [48] experimented with mono- and multilingual SLMs
in order to tackle both tasks. Sahitaj et al. [49] approached Subtask 1 using SLMs and a LLM-based data
augmentation technique. Gómez-Romero et al. [50] used an approach based on OpenAI Embeddings
and a deep feedforward network for Subtask 1 and, in addition, did entity masking in order to increase
the models’ generality. Mahesh et al. [51] experimented with SLMs and non-neural approaches on
Subtask 1 . Zeng et al. [52] employed mono- and multi-lingual SLMs for both Subtask 1 and Subtask 2 .
Huang et al. [53] used SLMs for both tasks, and employed ensembling for Subtask 1 . Tulbure and Coll
Ardanuy [54] experimented with SLMs boosted by data augmentation and ensembling, and for Subtask
2 split the input texts into sentences. Liu et al. [55] experimented with a range of LLMs using zero-shot
chain-of-thoughts prompts to tackle Subtask 1 , and used a SLM approach for Subtask 2 . Mhalgi et al.
[56] approached Subtask 1 using data augmentation, non-neural classifiers, SLMs and LLMs, as well as
model ensembles.
   Several participants basically repeated what had been done in the baseline solution, i.e., fine-tuned
and applied one or several SLMs [57, 58, 59].
   Teams that did not submit working notes accounted for 65 submissions and provided a short descrip-
tion of their approaches. Many of these submissions were minor modifications of the provided baseline,
i.e., changing of an SLM to be fine-tuned. However, a number of these teams achieved competitive
results or provided useful datapoints using, for example, ensembling techniques, data and feature
augmentation techniques, and non-neural NLP approaches.


6. Results and Analysis
6.1. Distinguishing Critical and Conspiracy Texts (Subtask 1)
Table 6.1 displays the results of the most successful teams on Subtask 1 – the teams with performance
equal to or greater than the provided baseline.

                                   English                       Spanish
                          TEAM                MCC      TEAM                  MCC
                          IUCL [56]           0.8388   SINAI [41]            0.7429
                          AI_Fusion           0.8303   auxR                  0.7205
                          SINAI [41]          0.8297   RD-IA-FUN [40]        0.7028
                          ezio [44]           0.8212   Elias&Sergio          0.6971
                          hinlole [53]        0.8198   AI_Fusion             0.6872
                          Zleon [48]          0.8195   zhengqiaozeng [52]    0.6871
                          virmel              0.8192   virmel                0.6854
                          inaki [47]          0.8149   trustno1              0.6848
                          yeste               0.8124   Zleon [48]            0.6826
                          auxR                0.8088   ojo-bes               0.6817
                          Elias&Sergio        0.8034   tulbure [54]          0.6722
                          theateam            0.8031   sail [50]             0.6719
                          trustno1            0.7983   nlpln [55]            0.6681
                          DSVS [46]           0.7970   baseline-BETO         0.6681
                          ojo-bes             0.7969
                          sail [50]           0.7969
                          RD-IA-FUN [40]      0.7965
                          baseline-BERT       0.7964

Table 3
Performance of top teams, in terms of Matthews Correlation Coefficient (MCC), on Subtask 1 – binary classifica-
tion of text as either conspiracy or critical.
Results for English The top IUCL team [56] employed the DeBERTa model [60] fine-tuned on an
augmented dataset comprising the Subtask 1 dataset and the conspiracy-labeled examples from the
LOCO corpus [11] (cca. 16,000 examples were selected). The AI_Fusion team came a close second,
simply by relying on the fine-tuned ELECTRA model [61]. A close third was the SINAI team [41],
which used the fine-tuned LLaMA3-8B-instruct LLM [42] as a solution. Additionally, their experiments
demonstrated that fine-tuned LLMs outperform the LLM-based zero-shot approaches by a large margin
[41].
   The rest of the top-performing models on English based their approaches on SLMs, with several
teams using techniques such as ensembling and data augmentation. The Covid-twitter-BERT [18], used
by the teams ezio [44], hinlole [53], Zleon [48], and inaki [47], seems to be a successful transformer
model for this use-case. Some teams with competitive results used standard transformer models: the
theateam, trustno1, and ojo-bes teams used standard RoBERTa [62], while the virmel team used BERT
[31] and the yeste team relied on the ELECTRA model [61].
   Two fully multilingual approaches performed competitively, those of the auxR and RD-IA-FUN
[40] teams. Both approaches were based on a multilingual transformer trained on joint English and
Spanish data. The auxR team employed the Twitter-XLM-RoBERTa-large model, a derivative of the
XLM-RoBERTa model [63] domain-adapted using Twitter data, while the RD-IA-FUN [40] team used
the multilingual-e5-large model [64], a derivative of XLM-RoBERTa. The Elias&Sergio team used
monolingual RoBERTa, but fine-tuned the model using the Spanish dataset translated to English (in
addition to the English dataset).
   Notably different was the approach of the sail team [50], who used OpenAI Embeddings5 in com-
bination with a deep feed-forward neural network for fine-tuning. Additionally, they pre-processed
the texts by replacing named entities with entity classes such as ’PERSON’, in order to “enhance the
model’s generalization capabilities” [50]. They showed that, for Subtask 1 , the masked model performs
better than the non-masked one.

Results for Spanish Many of the teams that did well on Spanish also achieved top results on English.
For these teams, we will briefly describe the differences between the two approaches, and we refer the
reader to the English section of Subtask 1 for details.
   Top performance was obtained by the SINAI team [41], which relied on LLMs. In contrast to what
happened in English, the fine-tuned GPT-3.5 model [43] outperformed LLaMA3-8B-instruct [42] by a
large margin, yielding the best overall solution.
   The second and third positions are held by the two fully multilingual approaches of the auxR and
RD-IA-FUN teams [40], which also performed well on English.
   Interestingly, five out of the six following teams (Elias&Sergio, AI_Fusion, zhengqiaozeng, virmel,
trustno1, Zleon) employed standard SLM fine-tuning with PlanTL-GOB-ES/roberta-base-bne [65] as
the base model. The exception is the zhengqiaozeng team [52], which relied on the multilingual
XLM-RoBERTa model. The tulbure team [54] relied on an ensemble of three Spanish SLMs.
   The sail team [50] used the same approach as for English, based on multilingual OpenAI Embeddings.
   The nlpln team [55] made it over the baseline using an unconventional approach in the context of this
challenge - zero-shot prompting based on LLMs and the chain-of-thought prompting technique [66].
We note that the same approach scored competitively on the English classification subtask, achieving
an MCC of 0.7844 (see Table A). The nlpln team [55] tested a number of LLMs, including GPT, Claude,
and Gemini, on the full training set. The DeepSeek V2 model [67], a large mixture-of-experts LLMs,
achieved the best results. Surprisingly, the results on the test data proved this model to be relatively
competitive with fine-tuned LLMs.

Analysis The results of the top teams suggest that the most successful English transformer-based
models are the DeBERTa model [60], the ELECTRA model [61] and the large LLaMA3-8B-instruct LLM
[42]. The Covid-twitter-BERT [18] model was used by a number of high-performing teams, suggesting
5
    https://platform.openai.com/docs/guides/embeddings
that pre-training on social media data probably influences performance. However, both BERT [31] and
RoBERTa [62] were shown to be able to perform competitively. The performance edge obtained by
the IUCL team [56] suggests that the LOCO conspiracy corpus [11] is a useful resource for boosting
conspiracy-related classifiers for other use-cases.
   In Spanish, the choice of a model seems to be more important, and many of the best teams used the
Spanish ’Maria’ RoBERTa model [65], trained exclusively on the data crawled from the web, while none
of the top teams employed either the BETO [32] or BERTIN [68] models. Moreover, the top three teams
employed either fine-tuned LLMs [41] (GPT-3.5 [43]) or multilingual models [40, 63]. These teams,
especially the top one based on LLMs, outperformed the others by a significant margin. Interestingly,
none of the participants used RoBERTuito [69], a model pretrained on Spanish social media text.
   It would be interesting to perform ablation studies in both languages in order to measure the influence
of both architectural improvements and the choice of the pretraining dataset on performance.
   As for the application of the LLMs [35], the results on English show no big difference between fine-
tuned LLMs and fine-tuned SLMs. Therefore, we hypothesize that the superiority of fine-tuned GPT-3.5
[43] on Spanish is due to the pre-training data (GPT-3.5 has probably “seen” much more texts from then
social media then the Spanish SLMs). The results of the nlpln team [55] demonstrate the competitiveness,
in both languages, of the DeepSeek V2 model [67], in combination with chain-of-thoughts prompting
[66]. Therefore, this approach seems to be a good way to quickly bootstrap a conspiracy vs. critical
classifier for other use-cases and other supported languages. The approach of Sahitaj et al. [49], which
was based on using LLM-based elaboration on text’s context and argumentation as additional input for
classification, might prove beneficial for improving LLM-based zero-shot prompting.
   A number of teams opted to use non-neural text classifiers, such as LinearSVM [70] or Random Forest
[71] in combination with tf-idf- or n-gram-based features. The average score of these approaches is
0.7080 MCC for English, and 0.5814 MCC for Spanish.
   The baseline systems [1] were based on BERT [31] and BETO [32], respectively, for the English and
Spanish dataset. These models were chosen as the baseline as they yielded the weakest performance in
Korenčić et al. [1]. The best performance, corresponding to the state-of-art before this challenge, was
obtained for DeBERTaV3 [72] and ’BERTIN’ RoBERTa [68] models. When these models were applied to
the train-test split of the challenge, the MCC scores of 0.8259 and 0.6681 were obtained, respectively, for
English and Spanish. The score of DeBERTaV3 represents an improvement in relation to BERT. Even
with this improvement, the participants managed to improve upon the state-of-art performance.

6.2. Detecting Elements of the Oppositional Narratives (Subtask 2)
Table 6.2 contains the results of the most successful teams on Subtask 2 – the teams with performance
equal to or greater than that of the provided baseline.

Results for English The most successful team, tulbure [54], relied on a combination of preprocessing
techniques and data augmentation. While the provided baseline used multi-task learning to account
for overlapping spans of different categories [1], Tulbure and Coll Ardanuy [54] opted to use a single
model for all the span categories and modified the data accordingly. Additionally, each Telegram text
was segmented into sentences which were used as examples for learning. This solved the problem of
texts longer than the maximum length supported by a transformer. Data augmentation was performed
by “replacing words in the texts by synonyms or semantically-related words”, and the RoBERTa model
was used [62].
   As the remaining teams mostly relied on modifying the multi-task sequence labeling approach of the
baseline [1], this will be the assumed default approach. Only if another approach was used will the
difference be described.
   The second-placed team, Zleon [48], used a large variant of RoBERTa [62] and increased the model’s
maximum sequence length to 512. The third-placed team, hinlole [53], used Covid-twitter-BERT [18] as
the base model.
                     English                              Spanish
                     TEAM                      span-F1    TEAM               span-F1
                     tulbure [54]              0.6279     tulbure [54]       0.6129
                     Zleon [48]                0.6089     Zleon [48]         0.5875
                     hinlole [53]              0.5886     AI_Fusion          0.5777
                     oppositional_opposition   0.5866     virmel             0.5616
                     AI_Fusion                 0.5805     CHEEXIST           0.5621
                     virmel                    0.5742     miqarn             0.5603
                     miqarn                    0.5739     DSVS [46]          0.5529
                     TargaMarhuenda            0.5701     TargaMarhuenda     0.5364
                     ezio [44]                 0.5694     Elias&Sergio       0.5151
                     zhengqiaozeng [52]        0.5666     hinlole [53]       0.4994
                     Elias&Sergio              0.5627     baseline-BETO      0.4934
                     DSVS [46]                 0.5598
                     CHEEXIST                  0.5524
                     rfenthusiasts             0.5479
                     ALC-UPV-JD-2              0.5377
                     baseline-BERT             0.5323

Table 4
Performance of top teams, in terms of span-F1 metric [29] (macro-averaged over span labels), on Subtask 2 –
token classification of span-level narrative elements.


   The oppositional_opposition team used the DistilBERT model [73] in combination with Conditional
Random Fields [74]. Interestingly, the same type of model was used for Subtask 2 in Spanish, but achieved
a very low result (see Table 10 in Appendix A), as if overfitting or failing to converge. The AI_Fusion
team used the RoBERTa model [62] and chose the best model over the 50 fine-tuning epochs. The virmel
team used the RoBERTa model with the maximum sequence length set to 512. The zhengqiaozeng team
[52] employed the RoBERTa model, while the ALC_UPV_JD_2 team relied on the small ALBERT model
[75].
   The miqarn team used the multilingual mBERT model [38], trained on datasets in both languages.
This approach also performed well on the Spanish dataset.
   The TargaMarhuenda team used the RoBERTa model, and added pre-computed POS tags as input
by concatenating them to the model’s token embeddings to construct input to the initial layer of the
transformer. The Elias&Sergio team used a similar approach, but concatenated one-hot POS vectors
with the token representations of the final layer of the transformer to construct input to the token
classification head.
   The ezio team [44] modified the multi-tasking approach using “BiGRU LSTM”, a bidirectional LSTM
network based on gated recurrent units [45]. Instead of using simple per-task classification heads,
each task was assigned both a task-specific LSTM network and a task-specific classification head.
Covid-twitter-BERT [18] was used as the base model.
   The DSVS [46] team created an ensemble of token classifiers based on different SLMs such as BERT,
RoBERTa and ELECTRA, and performed “logit averaging” to obtain their final predictions.
   The CHEEXIST team used the Fake-News-Bert-Detect model, a domain-adapted version of RoBERTa.
Additionally, they replaced the final classification layer with a shallow neural network.
   The rfenthusiasts team used the DeBERTaV3 model [72] and did a data augmentation by replacing
characters in text. The same approach, when used in combination with the XLM-RoBERTa model [63],
did not work well on the Spanish dataset.

Results for Spanish All of the teams that achieved top results on the Spanish dataset did the same on
the English dataset. Therefore, here we will only briefly describe the differences, which mostly pertain
to a different choice of transformer model. Similarly as for English, the majority of the approaches
relied on the multi-task sequence labeling approach of the baseline [1].
   The same two teams - tulbure and Zleon - took the first and second place, as on the English dataset.
Both relied on the same respective approach that they used on English, with the difference of using the
Spanish ’Maria’ RoBERTa model [65].
   The AI_Fusion team, placed third, relied on the XLM-RoBERTa model [63], while the virmel team
relied on Spanish ’BERTIN’ RoBERTa model [68]. The CHEEXIST team used the ’Maria’ RoBERTa
model [65].
   The miqarn team used a single mBERT [38] model fine-tuned on both datasets, and achieved good
results on Spanish. The DSVS [46] team’s ensemble approach also achieved good results in the case of
the Spanish dataset. The ensemble consisted of a number of Spanish and multilingual models [46].
   Two approaches based on using POS tags as additional input to the model, used by the Targa-
Marhuenda and Elias&Sergio teams, relied on the Spanish RoBERTa model. The hinlole team [53] relied
on the Spanish BETO model [32].

Analysis The system that clearly outperformed the others in both languages was the one of the tulbure
team [54]. Its sentence-level processing of texts shows that signals for the inference of the elements of
oppositional narrative are largely sentence-local. It would be interesting to perform ablation studies to
determine how much data augmentation influences performance in contrast to sentence segmenting.
Further improvements might be achieved by way of using multi-task learning and transformers other
than RoBERTa, as well as other data augmentation techniques, possibly based on LLMs.
   The competitive results of the Zleon team [48] and several other teams relying on the multi-task
baseline approach show its effectiveness in combination with an improved choice of the backbone
SLM and increased maximum sequence length. Covid-twitter-BERT [18], used by the second- and
third-placed teams, seems to be a successful choice for English.
   The performance of Subtask 2 seems to be less influenced by the choice of the transformer model,
especially in the case of Spanish. Concretely, a larger variety of models appear among the top teams
and, in the case of Spanish, all three families of models (BETO [32], BERTIN [68], and ’Maria’ [65]) are
represented.
   The approach of the miqarn team, based on the multilingual mBERT model [38], worked well for
both languages and could be a good approach for the task of inferring the elements of oppositional
narrative in other languages, especially under-resourced ones.
   The baseline systems [1] were based on BERT [31] and BETO [32] models, respectively, for the
English and Spanish dataset. They were chosen since they yielded the weakest performance in Korenčić
et al. [1]. Top performance, corresponding to the state-of-art before this challenge, was obtained for
DeBERTaV3 [72] and BERTIN [68] models. When these models were applied to the train-test split of
the challenge, the MCC scores of 0.5786 and 0.5369 were obtained, respectively, for English and Spanish.
These scores represent an improvement in relation to the baseline, but even so the participants managed
to significantly raise the state-of-art performance on the task.


7. Conclusions
The Oppositional Thinking Analysis PAN Task presented to the NLP community two subtasks: distin-
guishing between critical and conspiratorial messages, and detecting elements of oppositional narratives.
These subtasks are of interest to computational social scientists interested in text-based analysis of
oppositional thinking [1].
   A total of 82 teams participated in the challenge, while 17 teams provided working notes papers. The
teams devised a range of solutions, the most successful of which exceeded previous state-of-the-art
[1] for both subtasks. The new solutions have the potential to facilitate researchers in applying the
domain-agnostic annotation schemes proposed in Korenčić et al. [1] to new corpora.
   For Subtask 1 the most successful submitted English system [56] relied on augmentation using the
large news conspiracy corpus LOCO [11]. The best result for Spanish was achieved using a fine-tuned
GPT-3.5 [41]. The multilingual approach of Huertas-García et al. [40] also proved competitive. An
LLM-based zero-shot approach of Liu et al. [55] achieved results competitive with supervised baselines
on Subtask 1 and demonstrated a cost-effective way to bootstrap conspiracy vs. critical classifiers
for new use-cases. The experiments also point to the need to create better small-scale transformer
models for Spanish, as the solutions that work best on the Spanish dataset rely either on LLMs, or on
multilingual SLMs.
   For Subtask 2, the top system in both languages relied on a combination of data augmentation by
word replacement and sentence-level processing [54]. Most of the other systems relied on improving
the provided baseline solution by changing the underlying transformer model, or by modifying the
training procedure.
   There are many possible directions for creating even better-performing systems. Crafting new domain-
specific SLMs would probably be beneficial, as demonstrated by the effectiveness of Covid-twitter-BERT
[18] on both subtasks. Having in mind the difficulty of creating high-quality annotated data, further
work on the LLM-based zero- and few-shot approaches would be beneficial for practitioners. Similarly,
multi-lingual approaches adaptable to new languages with few annotated examples [76] would also be
an interesting and potentially effective direction to pursue. If the topic-agnostic annotation scheme [1]
used for this task is applied to create new labeled corpora, it would be interesting to use these corpora
for benchmarking the approach of Gómez-Romero et al. [50], which focuses on the generalization
capabilities of the models.


Acknowledgments
The shared task on Oppositional Thinking Analysis was organised in the framework of the XAI-
DisInfodemics: eXplainable AI for disinformation and conspiracy detection during infodemics (MICIN
PLEC2021-007681), funded by MCIN/AEI/ 10.13039/501100011033 and by European Union NextGener-
ationEU/PRTR. The work of Damir Korenčić and Berta Chulvi was conducted while at Universitat
Politècnica de València.


References
 [1] D. Korenčić, B. Chulvi, X. Bonet, T. Mariona, A. Toselli, P. Rosso, What distinguishes conspiracy
     from critical narratives? A computational analysis of oppositional discourse, Expert Systems
     (2024). doi:10.1111/exsy.13671.
 [2] K. M. Douglas, R. M. Sutton, What are conspiracy theories? A definitional approach to their
     correlates, consequences, and communication, Annual Review of Psychology 74 (2023) 271–298.
     URL: https://doi.org/10.1146/annurev-psych-032420-031329.
 [3] H. Tajfel, J. C. Turner, An integrative theory of intergroup relations, Psychology of intergroup
     relations (1979) 33–47.
 [4] R. Brown, Social identity theory: past achievements, current problems and future challenges,
     European Journal of Social Psychology 30 (2000) 745–778. doi:10.1002/1099-0992(200011/
     12)30:6<745::AID-EJSP24>3.0.CO;2-O.
 [5] M. A. Hogg, Social identity theory (2016). doi:10.1007/978-3-319-29869-6_1.
 [6] R. M. Sutton, K. M. Douglas, Rabbit hole syndrome: Inadvertent, accelerating, and entrenched
     commitment to conspiracy beliefs, Current Opinion in Psychology 48 (2022) 101462. URL:
     https://www.sciencedirect.com/science/article/pii/S2352250X2200183X. doi:https://doi.org/
     10.1016/j.copsyc.2022.101462.
 [7] E. Funkhouser,           A tribal mind: Beliefs that signal group identity or commit-
     ment,          Mind & Language 37 (2022) 444–464. URL: https://onlinelibrary.wiley.
     com/doi/abs/10.1111/mila.12326.                 doi:https://doi.org/10.1111/mila.12326.
     arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/mila.12326.
 [8] B. Franks, A. Bangerter, M. W. Bauer, M. Hall, M. C. Noort, Beyond “monologicality”? exploring
     conspiracist worldviews, Frontiers in Psychology 8 (2017). URL: https://www.frontiersin.org/
     articles/10.3389/fpsyg.2017.00861. doi:10.3389/fpsyg.2017.00861.
 [9] D. Mahl, M. S. Schäfer, J. Zeng, Conspiracy theories in online environments: An in-
     terdisciplinary literature review and agenda for future research, New Media & Society
     0 (2022) 14614448221075759. URL: https://doi.org/10.1177/14614448221075759. doi:10.1177/
     14614448221075759. arXiv:https://doi.org/10.1177/14614448221075759.
[10] J. E. Uscinski, J. Parent, B. Torres, Conspiracy Theories are for Losers, 2011. URL: https://papers.
     ssrn.com/abstract=1901755, aPSA 2011 Annual Meeting Paper.
[11] A. Miani, T. Hills, A. Bangerter, Loco: The 88-million-word language of conspiracy corpus,
     Behavior research methods (2021) 1–24.
[12] J. Langguth, D. T. Schroeder, P. Filkuková, S. Brenner, J. Phillips, K. Pogorelov, Coco: an annotated
     twitter dataset of covid-19 conspiracy theories, Journal of Computational Social Science (2023)
     1–42.
[13] K. Pogorelov, D. T. Schroeder, S. Brenner, J. Langguth, FakeNews: Corona Virus and Conspiracies
     Multimedia Analysis Task at MediaEval 2021, in: Working Notes Proceedings of the MediaEval
     2021 Workshop Bergen, Norway and Online, 2021.
[14] K. Pogorelov, D. T. Schroeder, S. Brenner, A. Maulana, J. Langguth, Combining tweets and
     connections graph for fakenews detection at mediaeval 2022, in: Proceedings of the MediaEval
     2022 Workshop, Bergen, Norway and Online, 12-13 January 2023., 2023.
[15] Y. Peskine, G. Alfarano, I. Harrando, P. Papotti, R. Troncy, Detecting covid-19-related conspiracy
     theories in tweets, in: MediaEval 2021, MediaEval Benchmarking Initiative for Multimedia
     Evaluation Workshop, 13-15 December 2021, 2021.
[16] Y. Peskine, P. Papotti, R. Troncy, Detection of COVID-19-Related Conpiracy Theories in Tweets
     using Transformer-Based Models and Node Embedding Techniques, in: Working Notes Proceedings
     of the MediaEval 2022 Workshop Bergen, Norway and Online, 2023.
[17] D. Korenčić, I. Grubišić, A. H. Toselli, B. Chulvi, P. Rosso, Tackling Covid-19 Conspiracies on
     Twitter using BERT Ensembles, GPT-3 Augmentation, and Graph NNs, in: Working Notes
     Proceedings of the MediaEval 2022 Workshop Bergen, Norway and Online, 2023. URL: https:
     //2022.multimediaeval.com/paper8969.pdf.
[18] M. Müller, M. Salathé, P. E. Kummervold, Covid-twitter-bert: A natural language processing model
     to analyse covid-19 content on twitter, Frontiers in Artificial Intelligence 6 (2023). URL: https:
     //www.frontiersin.org/articles/10.3389/frai.2023.1023281. doi:10.3389/frai.2023.1023281.
[19] A. Giachanou, B. Ghanem, P. Rosso, Detection of conspiracy propagators using psycho-linguistic
     characteristics, Journal of Information Science 49 (2021) 3–17. doi:10.1177/0165551520985486.
[20] J. D. Moffitt, C. King, K. M. Carley, Hunting conspiracy theories during the covid-19 pandemic,
     Social Media + Society 7 (2021). doi:10.1177/20563051211043212.
[21] A. Bessi, Personality traits and echo chambers on facebook, Computers in Human Behavior
     65 (2016) 319–324. URL: https://www.sciencedirect.com/science/article/pii/S0747563216305817.
     doi:10.1016/j.chb.2016.08.016.
[22] C. Klein, P. Clutton, V. Polito, Topic Modeling Reveals Distinct Interests within an Online Conspir-
     acy Forum, Frontiers in Psychology 9 (2018). URL: https://www.frontiersin.org/articles/10.3389/
     fpsyg.2018.00189.
[23] M. Samory, T. Mitra, ’The Government Spies Using Our Webcams’: The Language of Conspiracy
     Theories in Online Discussions, Proceedings of the ACM on Human-Computer Interaction 2 (2018)
     1–24. URL: https://dl.acm.org/doi/10.1145/3274421. doi:10.1145/3274421.
[24] S. Levy, M. Saxon, W. Y. Wang, Investigating Memorization of Conspiracy Theories in Text
     Generation, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021,
     Association for Computational Linguistics, Online, 2021, pp. 4718–4729. URL: https://aclanthology.
     org/2021.findings-acl.416. doi:10.18653/v1/2021.findings-acl.416.
[25] J. Introne, A. Korsunska, L. Krsova, Z. Zhang, Mapping the Narrative Ecosystem of Conspiracy
     Theories in Online Anti-vaccination Discussions, in: International Conference on Social Media
     and Society, Association for Computing Machinery, 2020, pp. 184–192. URL: https://dl.acm.org/
     doi/10.1145/3400806.3400828. doi:10.1145/3400806.3400828.
[26] P. Holur, T. Wang, S. Shahsavari, T. Tangherlini, V. Roychowdhury, Which side are you on?
     Insider-Outsider classification in conspiracy-theoretic social media, in: Proceedings of the 60th
     Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
     Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 4975–4987. URL: https:
     //aclanthology.org/2022.acl-long.341. doi:10.18653/v1/2022.acl-long.341.
[27] P. Wagner-Egger, A. Bangerter, S. Delouvée, S. Dieguez, Awake together: Sociopsychological
     processes of engagement in conspiracist communities, Current Opinion in Psychology 47 (2022)
     101417. URL: https://www.sciencedirect.com/science/article/pii/S2352250X22001385. doi:https:
     //doi.org/10.1016/j.copsyc.2022.101417.
[28] D. Chicco, N. Tötsch, G. Jurman, The Matthews correlation coefficient (MCC) is more reliable
     than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix
     evaluation, BioData Mining 14 (2021) 13. URL: https://doi.org/10.1186/s13040-021-00244-z. doi:10.
     1186/s13040-021-00244-z.
[29] G. Da San Martino, S. Yu, A. Barrón-Cedeño, R. Petrov, P. Nakov, Fine-Grained Analysis of
     Propaganda in News Articles, in: Proceedings of the 2019 Conference on Empirical Methods in
     Natural Language Processing and the 9th International Joint Conference on Natural Language
     Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019,
     pp. 5636–5646. URL: https://aclanthology.org/D19-1565. doi:10.18653/v1/D19-1565.
[30] D. Chicco, G. Jurman, The advantages of the Matthews correlation coefficient (MCC) over
     F1 score and accuracy in binary classification evaluation, BMC Genomics 21 (2020) 6. URL:
     https://doi.org/10.1186/s12864-019-6413-7. doi:10.1186/s12864-019-6413-7.
[31] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transform-
     ers for Language Understanding, in: Proceedings of the 2019 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
     1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota,
     2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/v1/N19-1423.
[32] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish Pre-trained BERT
     Model and Evaluation Data, 2023. URL: http://arxiv.org/abs/2308.02976. arXiv:2308.02976,
     arXiv:2308.02976.
[33] S. Ruder, An Overview of Multi-Task Learning in Deep Neural Networks, 2017. URL: http://arxiv.
     org/abs/1706.05098, arXiv:1706.05098.
[34] M. Zhang, K. Jensen, S. Sonniks, B. Plank, SkillSpan: Hard and soft skill extraction from
     English job postings, in: Proceedings of the 2022 Conference of the North American Chap-
     ter of the Association for Computational Linguistics: Human Language Technologies, Asso-
     ciation for Computational Linguistics, Seattle, United States, 2022, pp. 4962–4984. URL: https:
     //aclanthology.org/2022.naacl-main.366. doi:10.18653/v1/2022.naacl-main.366.
[35] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du,
     C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, J.-R. Wen, A survey
     of large language models, 2023. URL: https://arxiv.org/abs/2303.18223. arXiv:2303.18223.
[36] T. G. Dietterich, Ensemble methods in machine learning, in: Multiple Classifier Systems, Springer
     Berlin Heidelberg, Berlin, Heidelberg, 2000, pp. 1–15.
[37] C. Shorten, T. M. Khoshgoftaar, B. Furht, Text data augmentation for deep learning, Journal of big
     Data 8 (2021) 101.
[38] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual BERT?, in: A. Korhonen,
     D. Traum, L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for
     Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp.
     4996–5001. URL: https://aclanthology.org/P19-1493. doi:10.18653/v1/P19-1493.
[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin,
     Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
     wanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30,
     Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/
     3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[40] Á. Huertas-García, C. Martí-González, J. Muñoz, E. Ambite, Small Language Models and Large
     Language Models in Oppositional thinking analysis: Capabilities and Biases and Challenges, in:
     G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 -
     Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[41] M. Vallecillo-Rodríguez, M. Martín-Valdivia, A. Montejo-Ráez, SINAI at PAN 2024 Oppositional
     Thinking Analysis: Exploring the fine-tuning performance of LLMs, in: G. Faggioli, N. Ferro,
     P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of
     the Evaluation Forum, CEUR-WS.org, 2024.
[42] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
     E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and efficient
     foundation language models, 2023. URL: https://arxiv.org/abs/2302.13971. arXiv:2302.13971.
[43] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
     D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess,
     J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are
     few-shot learners, in: Advances in Neural Information Processing Systems, volume 33, Cur-
     ran Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper/2020/hash/
     1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
[44] Q. Hu, Z. Han, J. Peng, M. Guo, C. Liu, An Oppositional Thinking Analysis Method Using BERT-
     based Model with BiGRU, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.),
     Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[45] K. Cho, B. van Merriënboer, D. Bahdanau, Y. Bengio, On the properties of neural machine
     translation: Encoder–decoder approaches, in: Proceedings of SSST-8, Eighth Workshop on Syntax,
     Semantics and Structure in Statistical Translation, 2014, pp. 103–111.
[46] S. Damian, B. Herrera-Gonzalez, D. Vazquez-Santana, H. Calvo, E. Felipe-Riverón, C. Yáñez-
     Márquez, DSVS at PAN 2024: Ensemble Approach of Large Language Models for Analyzing
     Conspiracy Theories Against Critical Thinking Narratives, in: G. Faggioli, N. Ferro, P. Galuščáková,
     A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation
     Forum, CEUR-WS.org, 2024.
[47] I. Sánchez-Hermosilla, A. Panizo Lledot, D. Camacho, A Study on NLP Model Ensembles and Data
     Augmentation Techniques for Separating Critical Thinking from Conspiracy Theories in English
     Texts, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF
     2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[48] L. Zrnić, Conspiracy theory detection using transformers with multi-task and multilingual
     approaches, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of
     CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[49] A. Sahitaj, P. Sahitaj, S. Mohtaj, S. Möller, V. Schmitt, Towards a Computational Framework for
     Distinguishing Critical and Conspiratorial Texts by Elaborating on the Context and Argumentation
     with LLMs, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of
     CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[50] J. Gómez-Romero, S. González-Silot, A. Montoro-Montarroso, M. Molina-Solana, E. Martínez
     Cámara, Detection of conspiracy-related messages in Telegram with anonymized named entities,
     in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 -
     Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[51] S. Mahesh, S. Divakaran, K. Girish, S. Lakshmaiah, Binary Battle: Leveraging ML and TL Models
     to Distinguish between Conspiracy Theories and Critical Thinking, in: G. Faggioli, N. Ferro,
     P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of
     the Evaluation Forum, CEUR-WS.org, 2024.
[52] Z. Zeng, Z. Han, J. Ye, Y. Tan, H. Cao, Z. Li, R. Huang, A Conspiracy Theory Text Detection Method
     based on RoBERTa and XLM-RoBERTa Models, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S.
     de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum,
     CEUR-WS.org, 2024.
[53] J. Huang, Z. Han, R. Zhu, M. Guo, K. Sun, Conspiracy Theory Text Classification Based on CT-BERT
     and BETO Models, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working
     Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[54] A. Tulbure, M. Coll Ardanuy, Conspiracy vs critical thinking using an ensemble of transformers
     with data augmentation techniques, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera
     (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org,
     2024.
[55] B. Liu, Z. Han, H. Cao, An Approach to Classifying Conspiratorial and Critical Public Health
     Narratives, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of
     CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[56] S. Mhalgi, S. Pulipaka, S. Kübler, IUCL at PAN 2024: Using Data Augmentation for Conspiracy
     Theory Detection, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working
     Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[57] P. Balasundaram, K. Swaminathan, O. Sampath, P. Km, Oppositional Thinking Analysis: Conspiracy
     Theories vs Critical Thinking Narratives, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S.
     de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum,
     CEUR-WS.org, 2024.
[58] A. Albladi, C. Seals, Detection of Conspiracy vs. Critical Narratives and Their Elements using NLP,
     in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 -
     Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[59] D. Espinosa, G. Sidorov, E. Ricárdez-Vázquez, Using BERT to Identify Conspiracy Theories, in:
     G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 -
     Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
[60] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, in:
     International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?
     id=XPZIaotutsD.
[61] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as discriminators
     rather than generators, 2020. URL: https://arxiv.org/abs/2003.10555. arXiv:2003.10555.
[62] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692,
     arXiv:1907.11692.
[63] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
     L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, 2020.
     URL: https://arxiv.org/abs/1911.02116. arXiv:1911.02116.
[64] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, F. Wei, Multilingual e5 text embeddings: A
     technical report, 2024. URL: https://arxiv.org/abs/2402.05672. arXiv:2402.05672.
[65] A. Gutiérrez-Fandiño, J. Armengol-Estapé, M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo, C. P.
     Carrino, C. Armentano-Oller, C. Rodriguez-Penagos, A. Gonzalez-Agirre, M. Villegas, Maria:
     Spanish language models, Procesamiento del Lenguaje Natural (2022) 39–60. URL: https://doi.org/
     10.26342/2022-68-3. doi:10.26342/2022-68-3.
[66] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-of-thought
     prompting elicits reasoning in large language models, 2023. URL: https://arxiv.org/abs/2201.11903.
     arXiv:2201.11903.
[67] DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo,
     D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Yang,
     H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Chen, J. Yuan,
     J. Qiu, J. Song, K. Dong, K. Gao, K. Guan, L. Wang, L. Zhang, L. Xu, L. Xia, L. Zhao, L. Zhang, M. Li,
     M. Wang, M. Zhang, M. Zhang, M. Tang, M. Li, N. Tian, P. Huang, P. Wang, P. Zhang, Q. Zhu,
     Q. Chen, Q. Du, R. J. Chen, R. L. Jin, R. Ge, R. Pan, R. Xu, R. Chen, S. S. Li, S. Lu, S. Zhou, S. Chen,
     S. Wu, S. Ye, S. Ma, S. Wang, S. Zhou, S. Yu, S. Zhou, S. Zheng, T. Wang, T. Pei, T. Yuan, T. Sun,
     W. L. Xiao, W. Zeng, W. An, W. Liu, W. Liang, W. Gao, W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi,
     X. Liu, X. Wang, X. Shen, X. Chen, X. Chen, X. Nie, X. Sun, X. Wang, X. Liu, X. Xie, X. Yu, X. Song,
     X. Zhou, X. Yang, X. Lu, X. Su, Y. Wu, Y. K. Li, Y. X. Wei, Y. X. Zhu, Y. Xu, Y. Huang, Y. Li, Y. Zhao,
     Y. Sun, Y. Li, Y. Wang, Y. Zheng, Y. Zhang, Y. Xiong, Y. Zhao, Y. He, Y. Tang, Y. Piao, Y. Dong, Y. Tan,
     Y. Liu, Y. Wang, Y. Guo, Y. Zhu, Y. Wang, Y. Zou, Y. Zha, Y. Ma, Y. Yan, Y. You, Y. Liu, Z. Z. Ren,
     Z. Ren, Z. Sha, Z. Fu, Z. Huang, Z. Zhang, Z. Xie, Z. Hao, Z. Shao, Z. Wen, Z. Xu, Z. Zhang, Z. Li,
     Z. Wang, Z. Gu, Z. Li, Z. Xie, Deepseek-v2: A strong, economical, and efficient mixture-of-experts
     language model, 2024. URL: https://arxiv.org/abs/2405.04434. arXiv:2405.04434.
[68] J. D. l. Rosa, E. G. Ponferrada, M. Romero, P. Villegas, P. G. d. P. Salas, M. Grandury, BERTIN:
     Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling, Procesamiento
     del Lenguaje Natural 68 (2022) 13–23. URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/
     article/view/6403, number: 0.
[69] J. M. Pérez, D. A. Furman, L. A. Alemany, F. Luque, Robertuito: a pre-trained language model for
     social media text in spanish, 2022. URL: https://arxiv.org/abs/2111.09453. arXiv:2111.09453.
[70] T. Joachims, Text categorization with support vector machines: Learning with many relevant
     features, in: C. Nédellec, C. Rouveirol (Eds.), Machine Learning: ECML-98, Springer Berlin
     Heidelberg, Berlin, Heidelberg, 1998, pp. 137–142.
[71] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
[72] P. He, J. Gao, W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training
     with Gradient-Disentangled Embedding Sharing, in: International Conference on Learning
     Representations, 2023. URL: https://openreview.net/forum?id=sE7-XhLxHA.
[73] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
     cheaper and lighter, 2020. URL: https://arxiv.org/abs/1910.01108. arXiv:1910.01108.
[74] J. D. Lafferty, A. McCallum, F. C. N. Pereira, Conditional random fields: Probabilistic models
     for segmenting and labeling sequence data, in: Proceedings of the Eighteenth International
     Conference on Machine Learning, ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco,
     CA, USA, 2001, p. 282–289.
[75] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-
     supervised learning of language representations, 2020. URL: https://arxiv.org/abs/1909.11942.
     arXiv:1909.11942.
[76] F. D. Schmidt, I. Vulić, G. Glavaš, Don’t stop fine-tuning: On training regimes for few-shot cross-
     lingual transfer with multilingual language models, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.),
     Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Associ-
     ation for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 10725–10742. URL:
     https://aclanthology.org/2022.emnlp-main.736. doi:10.18653/v1/2022.emnlp-main.736.
 TASK 1 - ENGLISH
 POSITION               TEAM                       MCC       F1-MACRO       F1-CONSPIRACY        F1-CRITICAL
 1                      IUCL [56]                  0.8388    0.9194         0.8947               0.9441
 2                      AI_Fusion                  0.8303    0.9147         0.8866               0.9429
 3                      SINAI [41]                 0.8297    0.9149         0.8886               0.9412
 4                      ezio [44]                  0.8212    0.9097         0.8792               0.9402
 5                      hinlole [53]               0.8198    0.9098         0.8811               0.9386
 6                      Zleon [48]                 0.8195    0.9096         0.8804               0.9388
 7                      virmel                     0.8192    0.9092         0.8793               0.9391
 8                      inaki [47]                 0.8149    0.9072         0.8770               0.9374
 9                      yeste                      0.8124    0.9057         0.8746               0.9368
 10                     auxR                       0.8088    0.9043         0.8739               0.9347
 11                     Elias&Sergio               0.8034    0.9012         0.8687               0.9338
 12                     theateam                   0.8031    0.8999         0.8650               0.9347
 13                     trustno1                   0.7983    0.8991         0.8675               0.9307
 14                     DSVS [46]                  0.7970    0.8985         0.8674               0.9296
 15                     sail [50]                  0.7969    0.8978         0.8687               0.9268
 16                     ojo-bes                    0.7969    0.8981         0.8648               0.9314
 17                     RD-IA-FUN [40]             0.7965    0.8977         0.8636               0.9317
                        baseline-BERT              0.7964    0.8975         0.8632               0.9318
 18                     aish_team [58]             0.7917    0.8944         0.8580               0.9309
 19                     rfenthusiasts              0.7902    0.8948         0.8605               0.9291
 20                     Dap_upv                    0.7898    0.8944         0.8593               0.9294
 21                     oppositional_opposition    0.7894    0.8935         0.8571               0.9300
 22                     miqarn                     0.7881    0.8938         0.8593               0.9283
 23                     CHEEXIST                   0.7875    0.8932         0.8576               0.9287
 24                     tulbure [54]               0.7872    0.8917         0.8536               0.9297
 25                     XplaiNLP [49]              0.7871    0.8922         0.8550               0.9294
 26                     TheGymNerds                0.7854    0.8923         0.8567               0.9278
 27                     nlpln [55]                 0.7844    0.8922         0.8580               0.9263
 28                     RalloRico                  0.7771    0.8879         0.8559               0.9198
 29                     LasGarcias                 0.7758    0.8855         0.8447               0.9263
 30                     zhengqiaozeng [52]         0.7758    0.8866         0.8476               0.9256
 31                     ALC-UPV-JD-2               0.7725    0.8860         0.8491               0.9230
 32                     LorenaEloy                 0.7713    0.8847         0.8455               0.9239
 33                     lnr-alhu                   0.7708    0.8853         0.8488               0.9219
 34                     NACKO                      0.7692    0.8838         0.8446               0.9230
 35                     paranoia-pulverizers       0.7680    0.8838         0.8462               0.9215
 36                     DiTana                     0.7653    0.8806         0.8490               0.9123
 37                     FredYNed                   0.7643    0.8806         0.8392               0.9220
 38                     dannuchihaxxx [59]         0.7643    0.8801         0.8377               0.9224
 39                     lnr-detectives             0.7631    0.8806         0.8472               0.9141
 40                     TargaMarhuenda             0.7617    0.8807         0.8424               0.9190
 41                     Trainers                   0.7596    0.8797         0.8412               0.9182

Table 5
Results and rankings of the teams participating on Task 1 – binary classification of text as either conspiracy or
critical, for English texts. Performance metrics are: Matthews correlation coefficient, macro-averaged F1, and
per-class binary F1’s.


A. Appendix: Detailed Results
 TASK 1 - ENGLISH (cont.)
 POSITION                      TEAM                    MCC       F1-MACRO       F1-CONSPIRACY        F1-CRITICAL
 42                            thetaylorswiftteam      0.7577    0.8755         0.8302               0.9208
 43                            locasporlnr             0.7575    0.8787         0.8399               0.9174
 44                            lnr-adri                0.7552    0.8759         0.8326               0.9192
 45                            TokoAI                  0.7542    0.8767         0.8363               0.9172
 46                            ede                     0.7539    0.8769         0.8384               0.9155
 47                            lnr-verdnav             0.7529    0.8746         0.8308               0.9185
 48                            lnr-dahe                0.7488    0.8736         0.8308               0.9163
 49                            epistemologos           0.7486    0.8742         0.8341               0.9143
 50                            lucia&ainhoa            0.7473    0.8733         0.8316               0.9150
 51                            pistacchio              0.7414    0.8678         0.8200               0.9155
 52                            lnr-BraulioPaula        0.7393    0.8658         0.8165               0.9152
 53                            Marc_Coral              0.7392    0.8663         0.8176               0.9150
 54                            Ramon&Cajal             0.7284    0.8633         0.8169               0.9096
 55                            lnr-lladrogal           0.7253    0.8603         0.8106               0.9100
 56                            lnr-fanny-nuria         0.7253    0.8594         0.8082               0.9106
 57                            MarcosJavi              0.7190    0.8583         0.8097               0.9069
 58                            lnr-cla                 0.7168    0.8573         0.8085               0.9061
 59                            lnr-jacobantonio        0.7168    0.8573         0.8085               0.9061
 60                            MUCS [51]               0.7162    0.8538         0.7994               0.9082
 61                            lnr-aina-julia          0.7157    0.8574         0.8102               0.9046
 62                            LaDolceVita             0.7072    0.8519         0.8000               0.9037
 63                            alopfer                 0.7056    0.8518         0.8012               0.9023
 64                            lnr-luqrud              0.7056    0.8518         0.8012               0.9023
 65                            LNR-JoanPau             0.7051    0.8426         0.7793               0.9058
 66                            lnr-carla               0.7000    0.8476         0.7932               0.9020
 67                            lnr-Inetum              0.6981    0.8328         0.7617               0.9039
 68                            lnr-antonio             0.6852    0.8300         0.7598               0.9002
 69                            LluisJorge              0.6784    0.8382         0.7830               0.8934
 70                            anselmo-team            0.6725    0.8341         0.7752               0.8930
 71                            lnr-pavid               0.5959    0.7974         0.7297               0.8651
 72                            LNRMADME                0.5469    0.7717         0.6914               0.8521
 73                            lnr-mariagb_elenaog     0.5069    0.7250         0.5966               0.8534
 74                            LNR_08                  0.4429    0.6834         0.5276               0.8391
 75                            Kaprov [57]             0.3700    0.6240         0.4224               0.8255
 76                            lnr_cebusqui            0.0482    0.4760         0.1847               0.7674
 77                            jtommor                 0.0403    0.5167         0.3312               0.7023
 78                            eledu                   -0.4598   0.2350         0.2740               0.1960
 79                            david-canet             -0.6310   0.1632         0.1883               0.1381
 80                            lnr-guilty              -0.6595   0.1433         0.2247               0.0619
 81                            lnrANRI                 -0.7551   0.1072         0.1474               0.0670
 82                            ROCurve                 -0.8009   0.0884         0.1112               0.0656

Table 6
Results and rankings of the teams participating on Task 1 – binary classification of text as either conspiracy or
critical, for English texts. Performance metrics are: Matthews correlation coefficient, macro-averaged F1, and
per-class binary F1’s.
 TASK 1 - SPANISH
 POSITION               TEAM                       MCC       F1-MACRO       F1-CONSPIRACY        F1-CRITICAL
 1                      SINAI [41]                 0.7429    0.8705         0.8319               0.9091
 2                      auxR                       0.7205    0.8572         0.8112               0.9032
 3                      RD-IA-FUN                  0.7028    0.8497         0.8035               0.8960
 4                      Elias&Sergio               0.6971    0.8485         0.8087               0.8884
 5                      AI_Fusion                  0.6872    0.8419         0.7931               0.8908
 6                      zhengqiaozeng [52]         0.6871    0.8417         0.7925               0.8909
 7                      virmel                     0.6854    0.8426         0.8022               0.8831
 8                      trustno1                   0.6848    0.8400         0.7895               0.8906
 9                      Zleon [48]                 0.6826    0.8410         0.7955               0.8865
 10                     ojo-bes                    0.6817    0.8395         0.8026               0.8764
 11                     tulbure [54]               0.6722    0.8293         0.7699               0.8887
 12                     sail [50]                  0.6719    0.8299         0.7713               0.8884
 13                     nlpln [55]                 0.6681    0.8339         0.7872               0.8806
                        baseline-BETO              0.6681    0.8339         0.7872               0.8806
 14                     pistacchio                 0.6678    0.8327         0.7822               0.8833
 15                     rfenthusiasts              0.6656    0.8255         0.7643               0.8868
 16                     XplaiNLP [49]              0.6622    0.8274         0.7708               0.8840
 17                     yeste                      0.6609    0.8291         0.7770               0.8812
 18                     oppositional_opposition    0.6601    0.8274         0.7724               0.8825
 19                     epistemologos              0.6562    0.8264         0.7728               0.8801
 20                     miqarn                     0.6562    0.8264         0.7728               0.8801
 21                     theateam                   0.6557    0.8252         0.7695               0.8810
 22                     ezio [44]                  0.6535    0.8242         0.7683               0.8801
 23                     lucia&ainhoa               0.6524    0.8260         0.7765               0.8754
 24                     TargaMarhuenda             0.6516    0.8240         0.7692               0.8787
 25                     TokoAI                     0.6516    0.8240         0.7692               0.8787
 26                     paranoia-pulverizers       0.6494    0.8246         0.7762               0.8730
 27                     NACKO                      0.6467    0.8232         0.7739               0.8726
 28                     ALC-UPV-JD-2               0.6467    0.8227         0.7705               0.8748
 29                     DSVS [46]                  0.6462    0.8231         0.7753               0.8709
 30                     RD-IA-FUN                  0.6445    0.8160         0.7523               0.8796
 31                     locasporlnr                0.6437    0.8216         0.7709               0.8723
 32                     DiTana                     0.6377    0.8187         0.7677               0.8696
 33                     lnr-BraulioPaula           0.6358    0.8173         0.7731               0.8615
 34                     Dap_upv                    0.6306    0.8115         0.7493               0.8737
 35                     TheGymNerds                0.6306    0.8106         0.7470               0.8743
 36                     MUCS [51]                  0.6293    0.8060         0.7363               0.8756
 37                     LasGarcias                 0.6247    0.8122         0.7594               0.8649
 38                     lnr-dahe                   0.6196    0.8066         0.7437               0.8694
 39                     lnr-adri                   0.6194    0.8060         0.7422               0.8698
 40                     hinlole [53]               0.6192    0.8048         0.7391               0.8706
 41                     RalloRico                  0.6105    0.8018         0.7370               0.8666
 42                     lnr-aina-julia             0.6103    0.7978         0.7264               0.8692
 43                     lnr-verdnav                0.6101    0.7991         0.7298               0.8684
 44                     thetaylorswiftteam         0.6066    0.8025         0.7436               0.8613
 45                     lnr-alhu                   0.6024    0.7991         0.7358               0.8624
 46                     lnr-luqrud                 0.6010    0.7945         0.7237               0.8654
 47                     lnr-lladrogal              0.5967    0.7942         0.7256               0.8627
 48                     ede                        0.5965    0.7967         0.7341               0.8593
 49                     Fred&Ned                   0.5931    0.7940         0.7283               0.8597
 50                     LaDolceVita                0.5921    0.7818         0.6981               0.8656
 51                     LNR-JoanPau                0.5920    0.7916         0.7218               0.8614

Table 7
Results and rankings of the teams participating on Task 1 – binary classification of text as either conspiracy or
critical, for Spanish texts. Performance metrics are: Matthews correlation coefficient, macro-averaged F1, and
per-class binary F1’s.
 TASK 1 - SPANISH (cont.)
 POSITION                     TEAM                     MCC       F1-MACRO       F1-CONSPIRACY        F1-CRITICAL
 52                           anselmo-team             0.5899    0.7860         0.7085               0.8634
 53                           Ramon&Cajal              0.5858    0.7916         0.7281               0.8552
 54                           lnr-fanny-nuria          0.5813    0.7874         0.7181               0.8567
 55                           lnr-antonio              0.5736    0.7816         0.7071               0.8561
 56                           LluisJorge               0.5690    0.7750         0.6929               0.8571
 57                           lnr-cla                  0.5651    0.7788         0.7055               0.8520
 58                           lnr-jacobantonio         0.5651    0.7788         0.7055               0.8520
 59                           lnr-pavid                0.5569    0.7771         0.7089               0.8453
 60                           alopfer                  0.5520    0.7727         0.6984               0.8470
 61                           LNRMADME                 0.5490    0.7704         0.6937               0.8471
 62                           lnr-carla                0.5484    0.7686         0.6890               0.8482
 63                           LorenaEloy               0.5433    0.7621         0.6751               0.8492
 64                           CHEEXIST                 0.5379    0.5995         0.5621               0.5456
 65                           lnr-guilty               0.5273    0.7620         0.6880               0.8360
 66                           eledu                    0.5057    0.7263         0.6098               0.8429
 67                           lnr-mariagb_elenaog      0.4966    0.7325         0.6271               0.8379
 68                           dannuchihaxxx [59]       0.4727    0.7310         0.6382               0.8238
 69                           lnr-detectives           0.4029    0.6734         0.6509               0.6960
 70                           LNR_08                   0.0608    0.4771         0.2000               0.7542
 71                           jtommor                  0.0105    0.5051         0.3813               0.6288
 72                           lnr-Inetum               0.0000    0.3880         0.0000               0.7760
 73                           Marc_Coral               0.0000    0.2679         0.5359               0.0000
 74                           MarcosJavi               -0.0389   0.3887         0.0054               0.7720
 75                           lnr_cebusqui             -0.4112   0.2481         0.3466               0.1496
 76                           david-canet              -0.5058   0.2114         0.3029               0.1199
 77                           lnrANRI                  -0.6146   0.1766         0.1939               0.1593
 78                           ROCurve                  -0.6457   0.1628         0.1770               0.1485

Table 8
Results and rankings of the teams participating on Task 1 – binary classification of text as either conspiracy or
critical, for Spanish texts. Performance metrics are: Matthews correlation coefficient, macro-averaged F1, and
per-class binary F1’s.
        TASK 2 - ENGLISH
        POSITION               TEAM                        span-F1    span-P    span-R     micro-span-F1
        1                      tulbure [54]                0.6279     0.5859    0.6790     0.6120
        2                      Zleon [48]                  0.6089     0.5537    0.6881     0.5856
        3                      hinlole [53]                0.5886     0.5243    0.6834     0.5571
        4                      oppositional_opposition     0.5866     0.5347    0.6586     0.5344
        5                      AI_Fusion                   0.5805     0.5585    0.6082     0.5437
        6                      virmel                      0.5742     0.5235    0.6477     0.5540
        7                      miqarn                      0.5739     0.5184    0.6462     0.5325
        8                      TargaMarhuenda              0.5701     0.5161    0.6477     0.5437
        9                      ezio [44]                   0.5694     0.5229    0.6340     0.5389
        10                     zhengqiaozeng [52]          0.5666     0.5122    0.6485     0.5421
        11                     Elias&Sergio                0.5627     0.5149    0.6364     0.5248
        12                     DSVS [46]                   0.5598     0.5332    0.6012     0.5287
        13                     CHEEXIST                    0.5524     0.4767    0.6845     0.5299
        14                     rfenthusiasts               0.5479     0.5381    0.5666     0.5408
        15                     ALC-UPV-JD-2                0.5377     0.4643    0.6562     0.4956
                               baseline-BERT               0.5323     0.4684    0.6334     0.4998
        16                     Dap_upv                     0.5272     0.4617    0.6297     0.4973
        17                     aish_team [58]              0.5213     0.4181    0.7456     0.2571
        18                     SINAI [41]                  0.4582     0.5553    0.4279     0.4571
        19                     Trainers                    0.3382     0.5124    0.2609     0.2858
        20                     nlpln [55]                  0.3339     0.5286    0.3303     0.2710
        21                     ROCurve                     0.2996     0.3154    0.3031     0.3425
        22                     TokoAI                      0.2760     0.1870    0.6119     0.2677
        23                     DiTana                      0.2756     0.5259    0.1947     0.2599
        24                     TheGymNerds                 0.2070     0.2076    0.2127     0.2329
        25                     epistemologos               0.1709     0.1286    0.3244     0.1201
        26                     theateam                    0.1503     0.1401    0.1652     0.0387
        27                     LaDolceVita                 0.0726     0.2040    0.0453     0.0630
        28                     kaprov [57]                 0.0150     0.0261    0.0165     0.0600

Table 9
Results and rankings of the teams participating on Task 2 – token classification of span-level narrative elements,
for English texts. Performance metrics are: span-F1 (macro-averaged over span labels), span-precision, span-recall,
and micro-averaged span-F1 [29].
        TASK 2 - SPANISH
        POSITION              TEAM                        span-F1    span-P    span-R     micro-span-F1
        1                     tulbure [54]                0.6129     0.6159    0.6129     0.6108
        2                     Zleon [48]                  0.5875     0.5439    0.6474     0.5939
        3                     AI_Fusion                   0.5777     0.5437    0.6189     0.5843
        4                     CHEEXIST                    0.5621     0.5379    0.5995     0.5456
        5                     virmel                      0.5616     0.4963    0.6584     0.5620
        6                     miqarn                      0.5603     0.5117    0.6273     0.5618
        7                     DSVS [46]                   0.5529     0.5384    0.5785     0.5323
        8                     TargaMarhuenda              0.5364     0.5128    0.5710     0.5385
        9                     Elias&Sergio                0.5151     0.4864    0.5533     0.5231
        10                    hinlole [53]                0.4994     0.4530    0.5740     0.4890
                              baseline-BETO               0.4934     0.4533    0.5621     0.4952
        11                    Dap_upv                     0.4914     0.4555    0.5474     0.4917
        12                    zhengqiaozeng [52]          0.4903     0.4507    0.5494     0.4874
        13                    ALC-UPV-JD-2                0.4885     0.4509    0.5458     0.4683
        14                    ezio [44]                   0.4869     0.4623    0.5229     0.4947
        15                    nlpln [55]                  0.4672     0.5174    0.4426     0.2961
        16                    rfenthusiasts               0.4666     0.5104    0.4341     0.4697
        17                    SIANI                       0.4151     0.4630    0.4054     0.4781
        18                    TheGymNerds                 0.3984     0.3621    0.4483     0.5024
        19                    DiTana                      0.3004     0.4490    0.2362     0.3117
        20                    ROCurve                     0.2649     0.2706    0.2627     0.3562
        21                    TokoAI                      0.1878     0.1189    0.5659     0.1739
        22                    epistemologos               0.1657     0.1906    0.1864     0.1534
        23                    LaDolceVita                 0.1056     0.1158    0.0975     0.1321
        24                    theateam                    0.0994     0.1051    0.0962     0.0358
        25                    oppositional_opposition     0.0037     0.0349    0.0022     0.0014

Table 10
Results and rankings of the teams participating on Task 2 – token classification of span-level narrative elements,
for Spanish texts. Performance metrics are: span-F1 (macro-averaged over span labels), span-precision, span-
recall, and micro-averaged span-F1 [29].