=Paper=
{{Paper
|id=Vol-3740/paper-284
|storemode=property
|title=SINAI at PAN 2024 Oppositional Thinking Analysis: Exploring the Fine-Tuning Performance
of Large Language Models
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-284.pdf
|volume=Vol-3740
|authors=María Estrella Vallecillo-Rodríguez,María Teresa Martín-Valdivia,Arturo Montejo-Ráez
|dblpUrl=https://dblp.org/rec/conf/clef/RodriguezMM24
}}
==SINAI at PAN 2024 Oppositional Thinking Analysis: Exploring the Fine-Tuning Performance
of Large Language Models==
SINAI at PAN 2024 Oppositional Thinking Analysis:
Exploring the Fine-Tuning Performance of Large Language
Models
Notebook for PAN at CLEF 2024
María Estrella Vallecillo-Rodríguez1 , María Teresa Martín-Valdivia1 and
Arturo Montejo-Ráez1
1
Computer Science Department, SINAI, CEATIC, Universidad de Jaén, 23071, Spain
Abstract
This article describes the participation of the SINAI research group in the shared task “Oppositional Thinking
Analysis: Conspiracy theories vs critical thinking narratives” in CLEF 2024. This task is composed of 2 subtasks
subtask 1 which consists of a binary classification between critical and conspiracy texts and subtask 2 which
consists of a token-level classification of the element of the oppositional narrative. The proposed system for both
subtasks consists of the use of LLMs (LLaMA3 or GPT-3.5) where we apply an instruction tuned for the specific
subtask. We think that these types of models have more knowledge and can reason to distinguish each type of
text or elements of the texts and the instruction tuned will potentiate this, helping the models to distinguish
between the classes. In the final leaderboard, our proposal obtained 3rd and 1st place for task 1 in English and
Spanish respectively. In subtask 2 our systems reached the 18th position for English and 17th for Spanish.
Keywords
Large Language Models, QLoRA, Zero-Shot Learning, Oppositional Thinking Analysis,
1. Introduction
Nowadays, social networks are the most widely used means of communication by people. In them,
users share various aspects of their lives, express their opinions, ideas, and even share current news.
The problem is that not all the news published on social networks are true and users sometimes just by
reading them are already spreading them on the network, without stopping to check the information.
One type of message that is harmful to the social networking community is conspiracy theories, defined
by the European Union [1] as: "The belief that certain events or situations are secretly manipulated
behind the scenes by powerful forces with negative intentions". These theories are harmful because
they can generate serious consequences in society, such as spreading distrust in public institutions or
scientific information, feeding discrimination, and justifying hate crimes, among other consequences.
However, there are other types of texts that can be found in social networks known as critical thinking
narratives. In them, users express their opinions, sometimes argued and sometimes based on events
that have happened to them or to acquaintances. A more concrete definition of what critical thinking
narratives are is provided by the Oxford dictionary [2] “the process of analyzing information in order
to make a logical decision about the extent to which you believe something to be true or false”. If this
critical thinking issues a judgment that opposes the main idea, we will be talking about an oppositional
critical thinking narrative.
These two narratives explained above are challenging to distinguish, especially for language models
that analyze social network content. Therefore, the organizers of the shared task “Oppositional Thinking
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
*
Corresponding author.
†
These authors contributed equally.
$ mevallec@ujaen.es (M. E. Vallecillo-Rodríguez); maite@ujaen.es (M. T. Martín-Valdivia); amontejo@ujaen.es
(A. Montejo-Ráez)
0000-0001-7140-6268 (M. E. Vallecillo-Rodríguez); 0000-0002-2874-0401 (M. T. Martín-Valdivia); 0000-0002-8643-2714
(A. Montejo-Ráez)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Analysis” in the PAN Lab [3] of CLEF 2024 propose 2 subtasks. The first subtask is to distinguish
the conspiratorial narrative from other oppositional narratives that do not express a conspiratorial
mindset (i.e., critical thinking). This is a binary classification task between two classes (CRITICAL or
CONSPIRATIVE). The second subtask is to identify the key elements of a narrative that fuels intergroup
conflict in oppositional thinking in online messages. This task is a token-level classification task in
which models have to recognize text spaces corresponding to such key elements of oppositionalist
narratives (AGENT, FACILITATOR, VICTIM, CAMPAIGNER, TARGET, NEGATIVE_EFFECT). These
two subtasks are proposed to be applied in English and Spanish messages. The data of the task is
extracted from the Telegram social network and is related to the COVID-19 pandemic.
Our proposal consists of the use of generative LLMs such as GPT-3.5 and LLaMA3-8B-instruct that
are trained with instructions to detect conspiratorial and critical texts as well as the different elements
of these narratives. To train the models we want to apply QLoRA, which is a method to train LLMs
efficiently, and the OpenAI API. We think that this adaptation to the task will be crucial to help the
models learn the differences between the different classes. With this proposal, we intend to study how
the training of the models affects the achievement of the objectives of the proposed subtasks, and what
differences exist between the size of the models, and their performance. There are recent studies that
focus on the study of conspiracy theories in social networks, such as [4] in which the authors create a
dataset that includes accounts dedicated to conspiracy theories and a control group of randomly selected
users. They then perform a comparative analysis of the topics covered, profile characteristics, and
behaviors. Using machine learning algorithms and features from bot, troll, and linguistic literature, they
successfully classified conspiracy theory users with high accuracy. In contrast, other studies attempt to
use and analyze the performance of generative LLMs to detect conspiratorial texts. Diab et al. [5] tries
to address the detection of conspiracy theories by training a BERT model and then compare with the
performance of the GPT model without applying any training to it. Their study finds that GPT fails to
apply logical reasoning. However, other studies such as [6] which focuses on detecting conspiratorial
Telegram messages in German, show a comparison between applying supervised tuning approaches
(BERT models) and instruction-based approaches (LLaMA2, GPT-3.5 and GPT-4), which require little or
no additional training data. Their work shows that both approaches can be used effectively, highlighting
that among the highest results is GPT-4 with Zero-Shot Learning (ZSL) instruction and including
a definition of what a conspiracy theory is. Peskine et al. [7] attempt to generate definitions from
examples and use them for zero classification of fine-grained multi-label conspiracy theory. They show
that improving class label definitions has a direct consequence on subsequent classification results. This
makes us think that it is very important to refine the instruction we give to the model. Some studies
analyze how well instruction-based models perform if they adjust a task. An example of such studies is
[8] in which they use a LLaMA model containing emotional information and apply training based on
different instructions (emotion recognition, sentiment, and conspiracy theories). Their results show
that this model largely outperforms several open-source domain-general LLMs.
The remainder of the paper is organized as follows: Section 2 presents an overview of important
details about the proposed system for the shared task. The used data and the methodology followed to
achieve the goal of the task are described in Section 3. In Section 4 we show the results obtained in
our experiments during the development phase and the evaluation phase. Finally, we conclude with a
discussion in Section 5.
2. System overview
The developed system to achieve Oppositional Thinking Analysis shared task [9] at CLEF 2024 is
described in this section.
To achieve both subtasks we want to study how LLMs such as LLaMA3 or GPT-3.5 which are
generative models can be adapted to a classification task as proposed in this shared task. In addition
we want to study whether the differences between the size of the models influenced the classification
of each text. For this reason, we plan to apply an instruction-based training of the models. The first
step of this method is to create a good instruction or prompt in which the models show good results in
pre-training tests. To do this, we provide different examples to the models and ask the selected models
what are the differences between critical and conspiratorial texts for subtask 1 and a definition of each
element of oppositional narratives for subtask 2. We will feed the prompt with the information these
models give us in their response, as we believe this information will help the model to detect each type
of text or element. The used prompt are presented in Appendix A. To train the GPT-3.5 model, we use
the OpenAI API with 1 epoch and to train LLaMA we use a method called QLoRA [10]. This approach
facilitated a faster and more affordable process as it significantly reduced the hardware requirements.
The model was loaded in 4 bits with the quantization data type NF4. As computational data type bf16
was used. Finally, LoRA update matrices were applied to the linear layers of the model. The LoRA rank
was set to 16, the scaling factor (LoRA alpha) to 64 and the dropout to 0.05. We used a learning rate of
2e-4 and 1 example for the batch size, 10 epochs with an early stop of 3 epochs.
Furthermore, as we can see in Section 3.1 the dataset for subtask 2 is unbalanced, especially in the
Spanish dataset where the class ‘OBJECTIVE’ appears fewer times. Since we do not have enough
instances for the model to learn and considering this class inserts noise in the training of the models,
we exclude this class during the training.
3. Experimental setup
3.1. Data
The dataset of this shared task [11] is composed of 10,000 messages of Telegram written in English
and Spanish. These messages are related to the COVID-19 pandemic and labelled according to the
annotation scheme for Task 1 and Task 2. The labels of Subtask 1 are CONSPIRACY and CRITICAL and
the labels that we can associate to the span texts of the Subtask 2 are AGENT, FACILITATOR, VICTIM,
CAMPAIGNER, OBJECTIVE, NEGATIVE_EFFECT. The dataset is divided into two splits, the first to
train the developed systems and the second to test these systems.
The distribution of the train split of the dataset for Subtask 1: Distinguishing between critical and
conspiracy texts, and Subtask 2: Detecting elements of the oppositional narratives are presented in
Figures 1 and 2, respectively. We can observe that, for Subtask 1, the majority class is ‘CRITICAL’
although the dataset is not very unbalanced. Since Subtask 2 is a token-level classification, each text
can have multiple labels and the same label can be repeated for each text instance. So, for Subtask 2, we
can observe two figures, the first one (Subfigure 2a) represents the number of the texts in the dataset
where the labels are. The label ‘X’ represents the texts where no label appears for the task. The second
figure (Subfigure 2b) represents the number of times each label appears in the dataset. In each figure,
we can see unbalanced data, for example, the minority class is ‘OBJECTIVE’ and for Spanish, this label
appears in only 338 texts and fewer occurrences only 493 in front of 898 times of the same class appear
in the English dataset with 1602 occurrences.
Figure 1: Distribution of the different classes presented in the Oppositional Thinking Analysis dataset for
Subtask 1 (Distinguishing between critical and conspiracy texts).
(a) Number of instances of each class. (b) Total of ocurrences of each class.
Figure 2: Distribution and number of occurrences of the different classes presented in the Oppositional Thinking
Analysis dataset for Subtask 2 (Detecting elements of the oppositional narratives).
To carry out the experiments proposed in Section 3. We will divide the training set provided by the
task organizers into three subsets. One to train the models, another to perform validation of our system
during training, and finally a test one to evaluate how well our systems perform and select the best
experiments to submit the final results to the task. The partitioning performed was done in a stratified
way to maintain the same percentage of labels in the partitions created. The number of text instances
of each class in the dataset for each subset can be seen in Table 1.
Table 1
Distribution of the number of instances of each class in the splits created to run our experiments.
Spanish English
Subtask Label
Train Validation Test Train Validation Test
CRITICAL 2284 127 127 2359 131 131
Subtask 1 CONSPIRACY 1316 73 73 1241 69 69
Total 3600 200 200 3600 200 200
AGENT 919 51 51 1843 102 103
FACILITATOR 953 53 53 954 53 53
VICTIM 1804 100 100 1536 86 85
Subtask 2 CAMPAIGNER 1293 72 72 1975 110 109
OBJETIVE 304 17 17 808 45 45
NEGATIVE_EFFECT 1747 97 97 1738 96 97
Total 3607 194 198 3594 201 205
3.2. Experiments and Selected Models
To achieve the goal of the Oppositional Thinking Analysis shared task, we selected the following models:
LLaMA3-8B-instruct [12], and GPT-3.5[13]. With these models, we want to study how the training of
the models affects the achievement of the objectives of the proposed subtasks, and what differences
exist between the size of the models, and their performance. Moreover, we propose two experiments for
each task. Each experiment has a different configuration. The proposed experiments are the following:
• Baseline. This experiment employs the use of the model with a prompt strategy based on ZSL,
providing reasoning for its responses. Our goal is to establish a reference experiment to evaluate
the effectiveness of the proposed systems. In this case, we selected the GPT-3.5 model because we
consider that a model with more parameters has more knowledge of the task and will be able to
distinguish between the different classes of both subtasks without previous knowledge of them.
• Fine-tuning. This experiment applies techniques for efficient instruction learning of LLMs. To
apply this experiment we are going to select the LLaMA3-8B-instruct model which is an open
model and we have full control of its parameters and the GPT-3.5 model which belongs to a
company and its use is not free. It also has more restricted parameters. To train LLaMA we will use
a technique for efficient learning of LLMs called QLoRA (Quantified Low-Rank Adaptation) [10].
This method accelerates the training process and makes it more accessible. QLoRA enables us to
train models with a large number of hyperparameters using minimal hardware resources. This is
achieved by not requiring the training of all model parameters and through the quantization of
the numbers used during the training process. In this experiment, we load the selected model
with 4 bits with the quantization data type NF4. The computational data type bf16 will be used.
The LoRA update matrices were applied to the linear layers of the model. The LoRA rank is to
be set to 16, the scale factor (LoRA alpha) to 64, and the dropout to 0.05. In addition, we used a
learning rate of 2e-4, 1 example for the batch size, and 10 epochs with an early stop of 3 epochs.
In the case of the GPT-3.5 model, we will use the openAI API and train the model with 1 epoch.
For subtask 2, as seen in Section 3.1, we have unbalanced data, so we will propose two variants of
this experiment:
– FT_all: Using all the labels in the dataset.
– FT_withoutObjective: Excluding the minority class. This class is ‘OBJECTIVE’.
Because the use of GPT-3.5 is not free, to train the GPT-3.5 model for subtask 2 we only apply the
fine tuning of the best variant of LLaMA training for subtask 2. As we can see in Section 4.1 the
best variants for each language are to use all labels for English and to exclude the OBJECTIVE
class for Spanish.
4. Results
In this section, we present the results obtained by the system developed as part of our participation in
the “Oppositional thinking analysis” task. To evaluate our systems, we use the official metrics given by
the organizers. Specifically, the MCC metric [14] that is a single-value classification metric which helps
to summarize the confusion matrix or an error matrix. The MCC ranges between -1 and +1. A coefficient
of +1 represents perfect prediction, 0 represents average random prediction and -1 represents inverse
prediction. Moreover, for this task, the macro F1 score (harmonic mean of precision and recall for a
more balanced summarization of model performance) and the specific macro F1 score for each class are
provided. For subtask 2, the span F1 score [15] is used as the official metric. This metric calculates F1
measures per each class of the dataset and for each span identified. In addition, the organizers provide
span recall and precision and the micro span F1 score. The experiments are conducted in two phases,
the development phase, where we select the best models, and the evaluation phase where we evaluate
the selected models and choose the best model to appear in the leaderboard of the evaluation campaign.
4.1. Development Phase
In order to select the best model for each subtask we trained the models selected in Section 3.2 with a
subset of the train split provided by the organizers and evaluated them with other subsets of the train
split. The results obtained in the development phase are shown in Tables 2, and 3.
Table 2 shows the results obtained in the experiments proposed above for subtask 1. As can be seen,
the fine-tuning of LLaMA-8B-instruct shows promising results in all the metrics evaluated and obtains
the best result when applied to English data. Its performance in Spanish is good, although it does not
outperform the fine-tuning of GPT-3.5 model. This may be because LLaMA-8B-instruct does not have
as extensive knowledge of Spanish as GPT-3.5, having been trained with more English data. In addition,
the GPT-3.5 model shows greater consistency across languages by obtaining very similar results in
both languages. If we look at the performance of the GPT-3.5 model with the ZSL experiment we see a
big difference with the fitted models, as it does not achieve such promising results. This highlights the
importance of adjusting the models to improve their performance in the proposed task.
The results of subtask 2 are shown in Table 3. For this subtask, we can see how LLaMA3-8B-instruct
shows better results for English when trained with all classes compared to when trained without taking
the minority class into account. However, for Spanish, the results of training with all classes show
Table 2
Results of the different experiments for Subtask 1 (distinguishing between critical and conspiracy texts) on test
split of the train set of Oppositional Thinking Analysis. The selected model for the evaluation phase is shown in
bold.
Model Experiment Lang. MCC F1-macro F1-conspiracy F1-critical
EN 0.7874 0.8930 0.8571 0.9288
LLaMA3-8B-instruct FT_all
ES 0.6413 0.8204 0.7692 0.8716
EN 0.7345 0.8672 0.8261 0.9084
FT_all
ES 0.7156 0.8552 0.8088 0.9015
GPT-3.5
EN 0.4322 0.7014 0.6506 0.7521
ZSL
ES 0.4301 0.7143 0.6286 0.8000
lower performance than expected, being even below the ZSL strategy where the GPT-3.5 model has
not been fitted to the task. Probably because having a very underrepresented class with few examples
inserts noise during the model training process. For that reason, if we remove the minority class
(OBJECTIVE) from Spanish we get a result more similar to what we would expect. As in the previous
task, LLaMA3-8B-intruct performs better in English and GPT fits better in Spanish. If we look at the
ZSL experiment we can see that this strategy is not effective for the task compared to model fitting,
demonstrating the need to train the models to understand the differences between the different classes
we have and where in the text they may appear.
Model Experiment Lang. span-P span-R span-F1 micro-span-F1
EN 0.4810 0.4481 0.4461 0.4977
FT_withoutObjective
ES 0.4801 0.4068 0.4155 0.4862
LLaMA3-8B-instruct
EN 0.6013 0.4802 0.5140 0.5136
FT_all
ES 0.2500 0.0024 0.0048 0.0050
FT_all EN 0.5225 0.4059 0.4532 0.4801
FT_withoutObjective ES 0.4806 0.3907 0.4282 0.5349
GPT-3.5
EN 0.4246 0.1007 0.1563 0.1493
ZSL
ES 0.3740 0.7228 0.1147 0.1363
Table 3
Results of the different experiments for Subtask 2 (detecting elements of the oppositional narratives) on test split
of the train set of Oppositional Thinking Analysis. The selected model for the evaluation phase is shown in bold.
4.2. Evaluation Phase
In the evaluation phase, we use the trained models of the development phase and evaluate them on the
test set provided by the organizers. The systems submitted and their results for each run in subtasks 1
and 2 are presented in Tables 4, and 5 respectively.
Regarding subtask 1, because the LLaMA3-8B-instruct adjustment obtained the best results for English
and because it was free, we decided to send these results in the 2 runs. In Spanish, we decided to
send on one side the adjusted LLaMA3-8B-instruct model and on the other GPT-3.5. As can be seen
in Table 4 we can see how the performance of the adjusted GPT-3.5 model outperforms the adjusted
LLaMA3-8B-instruct model. This is not surprising since the same thing happened in the development
phase and may be due to the fact that the prior knowledge that GPT has about Spanish is higher than
that of LLaMA3-8B-instruct.
Table 4
Results of the different proposed strategies for Subtask 1 (distinguishing between critical and conspiracy texts)
on Oppositional Thinking Analysis test set. The selected model for the leaderboard is shown in bold.
Run Model Experiment Lang. MCC F1-macro F1-conspiracy F1-critical
EN 0.8297 0.9149 0.8886 0.9412
Run 1 LLaMA3-8B-instruct FT_all
ES 0.6780 0.8363 0.7841 0.8886
LLaMA3-8B-instruct EN 0.8297 0.9149 0.8886 0.9412
Run 2 FT_all
GPT-3.5 ES 0.7429 0.8705 0.8319 0.9091
The results of the systems submitted for subtask 2 can be seen in Table 5. For each submission we were
allowed for this task we decided to submit an adjusted model with all classes for English and removing
the minority class for Spanish. Since the differences between GPT-3.5 Spanish and LLaMA3-8B-instruct
are minimal, we decided not to make combinations between these models for Spanish and to send the
predictions made by each model for Spanish and English. As can be seen in this table, the model that
best fits the task is LLaMA3-8B-instruct, probably because it has been trained with more epochs than
GPT-3.5 and the task is somewhat more complex than the first one, since we have to choose between 6
classes and the parts in which it appears.
Table 5
Results of the different proposed strategies for Subtask 2 (detecting elements of the oppositional narratives) on
Oppositional Thinking Analysis 2024 test set. The selected model for the leaderboard is shown in bold.
Run Model Experiment Language span-P span-R span-F1 micro-span-F1
FT_all English 0.5342 0.4243 0.4723 0.4945
Run 1 GPT-3.5
FT_withoutObjective Spanish 0.4487 0.3674 0.4024 0.5149
FT_all English 0.5553 0.4279 0.4582 0.4571
Run 2 LLaMA3-8B-instruct
FT_withoutObjective Spanish 0.4630 0.4054 0.4151 0.4781
Finally, we want to emphasize that the results obtained in both tasks by LLaMA3-8B-instruct are
striking due to the large difference between the number of parameters that LLaMA3-8B-instruct has
in comparison with GPT-3.5. This makes us think that as long as we have quality data and that they
are representative of the classes, it is not so important to select models that are very large, since by
training them a little more epochs we can obtain very similar and even better results to those with a
large number of parameters.
4.3. Error Analysis
For each subtask, we present an error analysis of the final selected models in the test split used during
our development phase.
For the first task, in Table 6 what we can see how difficult it is to recognize each of the labeled classes.
For example in the first text for Spanish (id 9256) we can see how the comment is a criticism of the
decision to change the brand at the time of putting the third dose of a vaccine, but also has part of
conspiracy to say that to kill all carry the same thing, so the model assign the CONSPIRACY class. In
the second example for Spanish (id 9076) we see a typical sentence of conspiracy theories ( “they try to
make us believe”), but the model is not able to detect it and thinks that it is more oriented to criticize
how the different COVID variants are created. On the other hand, if we look at the English texts, we can
see how just the conspiracy title of a thread of conversations where opinions are going to be exposed,
already helps the model to classify it as CONSPIRACY instead of CRITICAL (id 151). Moreover, in the
second English text (id 177), the purpose of the message is a conspiracy, but LLaMA3-8B-instruct model
labels it as critical, probably because it thinks that is spreading an opinion of something that has been
said in a podcast like AlexJonesShow.
On the other hand, in the texts related to Subtask 2, we find examples such as the ones shown in
Table 7. If we look at the Spanish example (id 4263) as we have removed the OBJECTIVE class from the
Spanish model, the model should not predict anything. However, it predicts various CAMPAIGNERS
that are not even entities that promote something in the conspiracy. This suggests that the model is
hesitant to recognize these types of entities. In the case of the English text, we can see how it is difficult
for the model to recognize the negative effects that do not carry negations or negative words such as
death. We can also see how it confuses the class CAMPAIGNER with FACILITATOR as in the case of
“the " " scientific clerisy " " ”.
Table 6
GPT-3.5 model for Spanish and LLaMA3-8B-instruct model for English error analysis for Subtask 1 (Distinguishing
between critical and conspiracy texts). Examples of predictions from the test split created from the train split of
Oppositional Thinking Analysis shared task dataset.
Model Lang. Id. Text Gold Label predicted Label
9256 AHORA TE DICEN QUE SI LA TERCERA DOSIS ES DE UNA MARCA DISTINTA A LAS PRIMERAS ... CRITICAL CONSPIRACY
GPT-3.5 ES
ENTONCES ES MÁS EFICAZ ( PARA MATAR QUERRÁN DECIR , TODAS LLEVAN LO MISMO ) https ://
www . infosalus . com / asistencia / noticia - administrar - tercera - dosis - vacuna - covid - 19 - compania -
diferente - dos - primeras - eficaz - 20220425145746 . html
(NOW THEY TELL YOU THAT IF THE THIRD DOSE IS OF A DIFFERENT BRAND THAN THE FIRST ... THEN
IT IS MORE EFFECTIVE (TO KILL, THEY MEAN, THEY ALL CARRY THE SAME STUFF). https :// www .
infosalus . com / asistencia / noticia - administrar - tercera - dosis - vacuna - covid - 19 - compania - diferente -
dos - primeras - eficaz - 20220425145746 . html)
9076 Son los vacunados los que generan las variantes y los que contagian a los no vacunados , y no al revés CONSPIRACY CRITICAL
como intentan hacernos creer
(It is the vaccinated who generate the variants and who infect the unvaccinated, and not the other way around
as they try to make us believe.)
LLaMA3 151 What Else Could They Have Lied to You About ? Tune into my conversation on Radical , with Maajid CRITICAL CONSPIRACY
EN
8B-instruct Nawaz ... –> drtesslawrie . substack . com / p / on - what - else - could - they - have - lied I ’m delighted to
share this wonderful conversation I had recently with Maajid Nawaz . Maajid is , amongst many things ,
a podcaster , an author with his own Substack here , and he was also a host at this year ’s Better Way
Conference . I really enjoyed speaking with him — he asks good questions — and we covered not just
health but also the nefarious aims of the World Economic Forum and Big Pharma , the need for us to take
control of our own health and also how to positively and practically prepare for challenging times ahead .
Watch it here , and I hope you enjoy it . Have a wonderful Sunday , Tess Follow Me : –> @ audreywest
177 # AlexJonesShow : It ’s Official ! mRNA Covid Vaccines Are Euthanizing Thousands of Old People CONSPIRACY CRITICAL
Worldwide ! - https :// ifw . io / hw8 Get Live Broadcast Alerts ! - Text : ’ SHOW ’ To : ( 833 ) 470 - 0222
$ 50 Off Alexapure Pro Water Filtration System : https :// www . infowarsstore . com / alexapure - pro -
water - filtration - system
Table 7
LLaMA3-8B-instruct model error analysis for Subtask 2 (Detecting elements of the oppositional narratives).
Examples of predictions from the test split created from the train split of Oppositional Thinking Analysis shared
task dataset.
Lang. Id. Text Gold Labels predicted Labels
ES 4263 ¿ LAS VACUNAS COVID INSTALARON « CARGAS ÚTILES » DE MARBURG {’text’: ’El uso de señales externas para con- {’text’: ’MARBURG’,’chars’: 52-59, ’category’:
QUE SERÁN LIBERADAS POR SEÑALES 5 G ? En la vacuna COVID - 19 se trolar implantes neurales usando nanotec- ’CAMPAIGNER’},
instaló nanotecnología que transportaba cargas útiles de virus quiméricos . No nología’, ’category’: ’OBJECTIVE’, ’chars’: {’text’: ’cargas útiles [...] quiméricos’,’chars’:
es ciencia ficción . El uso de señales externas para controlar implantes neurales 225-307, ’english_text’: ” } 165-198, ’category’: ’CAMPAIGNER’, ’en-
usando nanotecnología está bien descrito en patentes y literatura médica . https glish_text’: ’chimeric [...] payloads’,},
:// ejercitoremanente . com / 2022 / 04 / 26 / las - vacunas - covid - instalaron - {’text’: ’El uso de [...] literatura médica’,’chars’:
cargas - utiles - de - marburg - que - seran - liberadas - por - senales - 5 g / 225-358, ’category’: ’CAMPAIGNER’, ’en-
(DID THE COVID VACCINES INSTALL MARBURG “PAYLOADS” TO BE RE- glish_text’: ’The use [...] medical literature’},
LEASED BY 5 G SIGNALS ? In the COVID - 19 vaccine was installed nanotech- {’text’: ’ejercitoremanente’,’chars’: 371-388,
nology carrying chimeric virus payloads . It is not science fiction . The use of ’category’: ’CAMPAIGNER’},
external signals to control neural implants using nanotechnology is well described {’text’: ’cargas - utiles - de - marburg’,’chars’:
in patents and medical literature . https :// ejercitoremanente . com / 2022 / 04 / 26 451-481, ’category’: ’CAMPAIGNER’}
/ las - vacunas - covid - instalaron - cargas - utiles - de - marburg - que - seran -
liberadas - por - senales - 5 g / )
EN 11360 " Stanford professor who challenged lockdowns and ’ scientific clerisy ’ declares {’text’: "Stanford professor who [...] clerisy ’", {’text’: ’a medical [...] Stanford Univer-
academic freedom ’ dead ’ - FOX NEWS After his life became a " " living hell " ’category’: ’CAMPAIGNER’, ’chars’: 2-72}, sity’,’chars’: 264-306, ’category’: ’CAM-
" for challenging coronavirus lockdown orders and the " " scientific clerisy " " {’text’: "scientific clerisy ’", ’category’: ’FACIL- PAIGNER’},
during the pandemic , a medical professor at Stanford University claims that " ITATOR’, ’chars’: 52-72}, {’text’: ’academic [...] dead’,’chars’: 323-347,
" academic freedom is dead . " " SOURCE @ TheGreatResetTimes Follow us : {’text’: ’his life [...] " " living hell " "’, ’category’: ’category’: ’NEGATIVE_EFFECT’},
Telegram | Chat Group | Twitter " ’NEGATIVE_EFFECT’, ’chars’: 125-162}, {’text’: ’TheGreatResetTimes’,’chars’: 363-381,
{’text’: ’the " " scientific clerisy " "’, ’category’: ’category’: ’CAMPAIGNER’},
’FACILITATOR’, ’chars’: 211-241}, {’text’: ’a {’text’: ’the " " scientific [...] the pan-
medical [...] Stanford University’, ’category’: demic’,’chars’: 211-261, ’category’: ’CAM-
’CAMPAIGNER’, ’chars’: 264-306}, PAIGNER’}
{’text’: ’TheGreatResetTimes’, ’category’:
’CAMPAIGNER’, ’chars’: 363-381}
5. Conclusion
This paper presents the participation of SINAI research group in the Oppositional Thinking Analysis
shared task at CLEF 2024. In the two subtasks, we explore how different fine-tuned LLMs (GPT-3.5
and LLaMA3-8B-instruct) perform using previous knowledge. For the first subtask, we have seen that
GPT-3.5 model works better for Spanish than LLaMA3-8B-instruct model when fine-tuned to the task,
while LLaMA3-8B-instruct performs better for English. In the second subtask, we found that LLaMA3-
8B-instruct achieved better results than GPT-3.5 in both languages. We conclude that, in general,
fine-tuning LLMs is effective for conducting oppositional thinking analysis tasks, especially when
the number of classes is fewer. Furthermore, the good performance obtained by LLaMA3-8B-instruct
demonstrates that it is not always necessary to use larger models; rather, we need models trained with
quality data and given well-constructed input prompts so that they can effectively understand the task
at hand. As future work, we plan to further analyze the misclassification of each class and provide the
model with a complete definition to help in its reasoning. Additionally, since the detection of critical
thinking is subjective, we aim to study how the classification of models is affected by texts with lower
agreement among annotators and whether annotators’ sociodemographic characteristics influence
their reasoning. Finally, we want to investigate if the models are overfitted to the task data and if they
perform well with other datasets.
Acknowledgments
This work has been partially supported by Project CONSENSO (PID2021-122263OB-C21), Project
MODERATES (TED2021-130145B-I00), and Project SocialTox (PDC2022-133146-C21) funded by
MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR.
References
[1] European Comission, Identifying conspiracy theories, https://commission.
europa.eu/strategy-and-policy/coronavirus-response/fighting-disinformation/
identifying-conspiracy-theories_en, Publication date unknown. Accessed: 13/06/2024.
[2] Oxford Learner’s Dictionaries, Definition of critical thinking, https://www.
oxfordlearnersdictionaries.com/definition/english/critical-thinking?q=critical+thinking,
Publication date unknown. Accessed: 13/06/2024.
[3] A. A. Ayele, N. Babakov, J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag,
M. Fröbe, D. Korenčić, M. Mayerl, D. Moskovskiy, A. Mukherjee, A. Panchenko, M. Potthast,
F. Rangel, N. Rizwan, P. Rosso, F. Schneider, A. Smirnova, E. Stamatatos, B. Stein, M. Taulé,
D. Ustalov, X. Wang, M. Wiegmann, S. M. Yimam, E. Zangerle, Overview of PAN 2024: Multi-
Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis,
and Generative AI Authorship Verification - Condensed Lab Overview, in: Experimental IR
Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International
Conference of the CLEF Association CLEF-2024, Lecture Notes in Computer Science, Springer,
Berlin Heidelberg New York, 2024.
[4] M. Gambini, S. Tardelli, M. Tesconi, The anatomy of conspiracy theorists: Unveiling traits using a
comprehensive twitter dataset, Comput. Commun. 217 (2024) 25–40. URL: https://doi.org/10.1016/
j.comcom.2024.01.027. doi:10.1016/j.comcom.2024.01.027.
[5] A. Diab, R. Nefriana, Y.-R. Lin, Classifying conspiratorial narratives at scale: False alarms and
erroneous connections, Proceedings of the International AAAI Conference on Web and Social
Media 18 (2024) 340–353. URL: https://ojs.aaai.org/index.php/ICWSM/article/view/31318. doi:10.
1609/icwsm.v18i1.31318.
[6] M. Pustet, E. Steffen, H. Mihaljević, Detection of conspiracy theories beyond keyword bias in
german-language telegram using large language models, 2024. arXiv:2404.17985.
[7] Y. Peskine, D. Korenčić, I. Grubisic, P. Papotti, R. Troncy, P. Rosso, Definitions matter: Guiding GPT
for multi-label classification, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for
Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore,
2023, pp. 4054–4063. URL: https://aclanthology.org/2023.findings-emnlp.267. doi:10.18653/v1/
2023.findings-emnlp.267.
[8] Z. Liu, B. Liu, P. Thompson, K. Yang, S. Ananiadou, Conspemollm: Conspiracy theory detection
using an emotion-based large language model, 2024. arXiv:2403.06765.
[9] D. Korenčić, B. Chulvi, X. B. Casals, M. Taulé, P. Rosso, F. Rangel, Overview of the oppositional
thinking analysis pan task at clef 2024, in: G. Faggioli, N. Ferro, P. Galuvakova, A. G. S. de Herrera
(Eds.), Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, 2024.
[10] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient finetuning of quantized
llms, 2023. arXiv:2305.14314.
[11] D. Korenčić, B. Chulvi, X. Bonet Casals, M. Taulé, P. Rosso, Pan24 oppositional thinking analysis,
2024. URL: https://doi.org/10.5281/zenodo.11199642. doi:10.5281/zenodo.11199642.
[12] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/blob/main/
MODEL_CARD.md.
[13] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child,
A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Lan-
guage models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Asso-
ciates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/
1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[14] D. Chicco, N. Tötsch, G. Jurman, The matthews correlation coefficient (mcc) is more reliable
than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix
evaluation, BioData Mining 14 (2021). doi:10.1186/s13040-021-00244-z.
[15] G. Da San Martino, S. Yu, A. Barrón-Cedeño, R. Petrov, P. Nakov, Fine-grained analysis of
propaganda in news article, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational
Linguistics, Hong Kong, China, 2019, pp. 5636–5646. URL: https://aclanthology.org/D19-1565.
doi:10.18653/v1/D19-1565.
A. Used Prompt
The prompts used for our experiment with ZSL and for tuning the selected models are presented in
Table 8.
Subtask Prompt
Subtask 1 You are an expert in the classification of critical and conspiratorial texts. Your task is to identify these
CRITICAL and CONSPIRACY texts.
CRITICAL messages criticize decisions made by an individual, a group of people, or a committee of experts.
They may also expose personal concerns or opinions on an issue or decisions that have been made over
time and are contradictory. Moreover, they make a claim about the theme, without delving into complex
or implausible theories
CONSPIRACY messages, on the other hand, see decisions as the result of a malevolent conspiracy by
secret and influential groups. There are some differences between CRITICAL and CONSPIRACY messages:
1. Degree of Speculation: CRITICAL texts may contain unsubstantiated personal claims, but CONSPIRACY
texts often go further by proposing complex and implausible theories. These theories lack solid evidence
and are based on extreme speculation.
2. Level of Alarmism: CRITICAL texts may use alarming language. CONSPIRACY texts tend to be even
more sensationalist and apocalyptic. They often include claims of impending catastrophic events or the
existence of an ‘imminent danger’ that only the ‘awakened’ can see.
3. Global Conspiracy Tone: CRITICAL texts suggest specific concerns while CONSPIRACY texts often
address much broader issues, such as the existence of a ‘secret world government’ or the manipulation of
reality by unknown entities.
Now you are going to receive a TEXT and based on everything explained above, argue your response,
reasoning step by step, and put at the end of your answer the keyword ‘LABEL’ with the assigned class
(CRITICAL or CONSPIRACY).
TEXT: " "
Subtask 2 You are an expert in detecting elements of the texts. Since conspiracy narratives are a special type of
causal explanation, your task consists in the recognition of text spans corresponding to the key elements
of a text.
Step 1: Identify all of the negative effects mentioned in the text and relate them to the oppositional
narrative. A negative effect is a harmful consequence or negative impact related to conspiracy theories or
critical aspects. Put these negative effects in the same form that they appear in the text in different lines
with the keyword “NEGATIVE_EFFECT”.
Step 2: Identify if there is an explicitly stated objective of the oppositional narrative. An explicit objective
refers to a clear and direct statement outlining the goal or purpose of the narrative being presented. This
objective is typically stated overtly within the text, providing insight into what the proponents of the
narrative are trying to achieve or promote. Put these objectives in the same form that they appear in the
text in different lines with the keyword “OBJECTIVE”.
Step 3: Identify if there are victims of the oppositional texts. A victim is a specific individual or group that
is negatively affected by the negative effects identified in step 1, harmful actions or policies described in
the text. Put all victims with the keyword “VICTIM”.
Step 4: Identify if there are conspirators in the text. A conspirator refers to the entity responsible for
planning, executing, or supporting the main action or policy being discussed in the text. Moreover, a
conspirator is responsible for the NEGATIVE_EFECTS Put all the conspirators identified with the keyword
“AGENT”.
Step 5: Identify if there is any facilitator in the text. A facilitator is a collaborator or entity that supports
the agents in executing the main actions or policies discussed in the text. They assist in the achievement
of the objectives outlined by the conspirators, often playing a role in enabling or promoting the negative
effects on the victims. Put all the facilitators identified with the keyword “FACILITATOR”.
Step 6: Detect the campaigners that appear in the text. A campaigner is an entity or someone who
unmasks the conspiracy agenda, opposes the conspiracy narrative, and works to expose or challenge it.
Moreover, a campaigner actively opposing the mainstream narrative and promoting his own opinion. Put
all the campaigners identified with the keyword “CAMPAIGNERS”.
Please answer each step with the exact part of the text and explain your answer for each step. If there is
not a specific and clear element, do not provide it.
TEXT: " "
Table 8
Used prompt for each subtask of the Oppositional Thinking Analysis shared task.