Arthur Schopenhauer at Touché 2024: Multi-Lingual Text
                         Classification Using Ensembles of Large Language Models
                         Notebook for the Touché Lab at CLEF 2024

                         Hamza Yunis


                                     Abstract
                                     This paper describes the submitted approach of Team Arthur Schopenhauer to Task 1 of the Touché
                                     lab at CLEF 2024. The goal of this task is twofold: detecting human values in texts (Subtask 1), and
                                     recognizing whether these values are attained or constrained (Subtask 2). The approach described
                                     in this paper simplifies Subtask 1 by restricting the detected values in a text to a maximum of one
                                     value. It also simplifies Subtask 2 by handling it separately from Subtask 1; that is, human values and
                                     attainment are detected independently of each other. This simplification strategy proved successful, as
                                     the submitted approach was ranked 2nd among the participating teams’ best submissions (a single
                                     team can make multiple submissions) in Subtask 1 and was ranked 1st in Subtask 2. The described
                                     simplification results in two text-classification tasks, which are handled by fine-tuning and ensembling
                                     multiple BERT-based models.

                                     Keywords
                                     Touché, Human Value Detection, BERT, Large Language Models, Ensembling


                         1. Introduction
                         The decisions that a human individual makes are affected by the values that are held by this
                         individual [1]. Human values also affect an individual’s attitudes towards various issues and, by
                         extension, the arguments that they express in writing [2]. Task 4 of SemEval-2023 [3] had the
                         goal of identifying human values behind textual arguments.
                           The subject of this paper is Task 1 (Human Value Detection) of the Touché [4] lab (hosted
                         at CLEF 2024), which in turn consists of two subtasks: Subtask 1 has the goal of identifying
                         the human values that a specific piece of text references, while the goal of Subtask 2 is to
                         recognize whether these values are attained or constrained in the text. For example, both of
                         the following two texts (obtained form the task’s dataset) reference the value “Universalism:
                         concern”. However, this value is attained in the first text and constrained in the second:

                                “Widely considered one of the darkest days of the Troubles, relatives of the victims have
                                met regularly to mourn their loss and campaign for justice.”
                                “We were hoping that we would get recourse to justice for our dead family members and
                                that hasn’t happened.”

                         This paper describes the approach submitted by Team Arthur Schopenhauer to the aforemen-
                         tioned lab. The approach achieved the 9th best score in Subtask 1 (all higher-performing
                         approaches were submitted by a single team, which means Team Arthur Schopenhauer was
                         ranked 2nd ), whereas it achieved the best score in Subtask 2.
                            As the previous example demonstrates, detecting human values in texts is very challenging
                         and cannot be performed using classical NLP methods. For this reason, our approach relies
                         on modern BERT-based architectures, which have demonstrated a high capability in natural-
                         language-understanding tasks [5].


                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                          email: hamza.uns88@gmail.com (H. Yunis)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC
                                   BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Background
The dataset provided by the organizers consists of the training set (labeled), the validation set
(labeled), and the test set (not labeled). It stems from the ValuesML project, itself part of a
broad JRC initiative that aims for a deep insight into values and identities [6]. Once the models
of our approach were developed, they were applied to the test set to predict its labels. The
predicted labels were then submitted to the organizers via the TIRA platform [7] for evaluation.
   The labeled part of the dataset contains zero-one labels for 19 human values, with two label
columns for each value corresponding to constrained and attained, totaling 38 columns. A text
may constrain or attain a specific value, but not both. However, there are texts where it is
unclear whether the referenced value is attained or constrained, in which case both columns
corresponding to the value are filled with 0.5.
   The dataset contains texts from 9 languages. In addition, the organizers provide automated
English translations for non-English texts. However, due to concerns regarding the accuracy
of the translations, our approach uses the original texts and relies on multi-lingual language
models.


3. System Overview
Our submitted approach tackles the two described subtasks independently. Furthermore, the
labeled datasets were divided into English and non-English texts, for each of which a different
set of models was fine-tuned. Upon applying the fine-tuned models to the test set, which was
used for the final submission, the texts in the test set were split in the same manner and the
appropriate models were applied to each part.

3.1. Task Simplification
This section describes how the original subtasks were transformed in order to simplify the model
fine-tuning process.

3.1.1. Simplifying Subtask 1
Subtask 1, in its given form, corresponds to a multi-label classification problem, because a single
text may refer to multiple human values; that is, a single data instance may belong to multiple
classes simultaneously. However, preliminary data analysis showed that approximately 94% of
the labeled texts have either one label or no label. Therefore, for the sake of simplicity, it was
decided to restrict the fine-tuning process to these instances, which turns the problem into a
single-label classification problem. This simplification required introducing the no-label class for
texts that have no label.

3.1.2. Simplifying Subtask 2
Subtask 2 was tackled independently of Subtask 1, which means the models of Subtask 2
were fine-tuned to predict a given text’s attainment, regardless of the human value that the
text references. Accordingly, the simplified version of Subtask 2 corresponds to a single-label
classification problem with two classes, namely attained and constrained.

3.2. Data Preprocessing
The major steps of data preprocessing are shown in Figure 1. It begins by merging together the
training and validation sets, then unuseful data is cleaned from the merged set, after which the
dataset is reshaped to reflect the task simplification described in Section 3.1, and finally a new
validation set is created within which the different strata of the labeled data are proportionally
represented.

 Original training
                     Merge                                                                 New training set
       set

                             Merged dataset     Filter out    Reshape
                                                                        Create new split
                                              unuseful data   dataset
     Original                                                                              New validation
  validation set                                                                                set


Figure 1: Data preprocessing pipeline.


3.2.1. Filtering Out Unuseful Data
After merging the training and validation sets, the following rows were removed from the merged
set with the help of the pandas [8] library:

    • Rows with duplicate texts (first occurrence kept).
    • Rows with more than one label (in accordance with task simplification from Section 3.1.1).
    • Rows with two words or less (believed to be noisy).

3.2.2. Reshaping the Dataset
To reflect the task simplification described in Section 3.1, the original 38 label columns were
replaced by the following 2 columns:

hv_value A numeric code for the human value referenced by the text (including no-label).

attainment A numeric code for attainment (constrained, attained, or unknown). The unknown
      code is assigned to texts that do not have a human value label, or for which the attainment
      was unclear in the original dataset.

In addition, rows with the human value Humility were removed from the dataset. The reason for
this additional filtering is that such rows are rare in the dataset and, after initial experiments,
the fine-tuned models could not predict Humility with any accuracy.

3.2.3. Creating a New Split
The last step of data preprocessing was creating a new train-validation split. The validation set
was created using the proportional allocation strategy with a sampling rate of 0.1, whereby each
stratum is specified by a combination of language and label, for example all rows with language
“EN” and label “Conformity: interpersonal” form one stratum. Splitting was achieved using the
function train_test_split from scikit-learn [9] using the fixed random state 66 (not related
to the random seed used when fine-tuning the models). This way, the validation set could be
reproduced in different Python sessions.

3.3. Fine-Tuning the Models
For both subtasks, the approach relies on the pretrained models microsoft/deberta-v2-xxlarge
[10] for English texts and FacebookAI/xlm-roberta-large [11] for non-English texts, both obtained
from the Hugging Face Hub [12]. The process of producing the fine-tuned models is depicted in
Figure 2.
                                                English training set


       Training set                                                                             English fine-tuned
                                                                           Fine-tuning
                                                                                                      model
                                               English validation set


                                                                           Pretrained
                                                                        deberta-v2-xxlarge
                        Language-based split

                                                                           Pretrained
                                                                           xlm-roberta
                                               Non-English training
                                                      set

       Validation set                                                                           Non-English fine-
                                                                           Fine-tuning
                                                                                                  tuned model
                                                   Non-English
                                                   validation set


Figure 2: Conceptual overview of the fine-tuning process.


  For Subtask 1, bagging [13] was applied using two four-model ensembles, one for each language
subset. For Subtask 2, only one model was fine-tuned for each language subset, because using
multiple models offered no improvement of predicative performance during experimentation.
The classification heads of the fine-tuned models had 19 outputs1 for Subtask 1 and 2 outputs
for Subtask 2. It should be noted that the models for Subtask 2 were fine-tuned only on the
data with known attainment; that is, rows with the unknown value in the attainment column
were excluded.
  Our approach applies the commonly used cross-entropy loss function [14]. However, due to
observed class imbalance in Subtask 1, the use of the weighted cross-entropy loss function [15] was
contemplated. Our experiments showed that using the weighted cross-entropy loss function (using
inverse class frequencies as weights) delivers higher performance for some low-frequency classes,
but lower performance overall; therefore, a combination of both weighted and non-weighted
cross-entropy loss functions was used in each ensemble.
  All ten models of our approach were fine-tuned using the same train-validation split, but
with different hyperparameters, as described in Table 1. The remaining hyperparameters are
described in Appendix A.
  Fine-tuning was performed using PyTorch [16] directly, rather than the Hugging Face Trainer
API. During fine-tuning, checkpointing was used, so the model checkpoint with the best F1-score
(macro) was kept.

3.4. Ensembling Strategy
Ensembling is relevant only to Subtask 1, because for Subtask 2, only one model is used with
each language subset.
  Each of the models in Table 1 produces a predicted label2 , along with a probability of that

1
    The Humility class was removed from the original 19 classes and the no-label class was added.
2
    For details on extracting predictions from neural network outputs, see https://www.learnpytorch.io/02_pytorch_
    classification/
Table 1
Overview of the fine-tuned models used in the submitted approach. For the remaining hyperparameters, see
Appendix A.
        Model Name        Languages        Architecture      Random Seed            Loss Function
                                                   Subtask 1
           Model 1          English      deberta-v2-xxlarge       66                Cross-Entropy
           Model 2          English      deberta-v2-xxlarge       66            Weighted Cross-Entropy
           Model 3          English      deberta-v2-xxlarge       67                Cross-Entropy
           Model 4          English      deberta-v2-xxlarge       67            Weighted Cross-Entropy
           Model 5        Non-English       xlm-roberta           66                Cross-Entropy
           Model 6        Non-English       xlm-roberta           66            Weighted Cross-Entropy
           Model 7        Non-English       xlm-roberta           67                Cross-Entropy
           Model 8        Non-English       xlm-roberta           67            Weighted Cross-Entropy
                                                   Subtask 2
           Model 9          English      deberta-v2-xxlarge       66                 Cross-Entropy
           Model 10       Non-English       xlm-roberta           66                 Cross-Entropy


label. One common way to ensemble predictions is soft voting3 . Our approach adjusts the
original soft voting strategy by employing the concept of safe prediction, for want of a better
term, which will be used to denote a prediction whose probability exceeds a certain threshold.
With this definition of a safe prediction, ensembling was achieved using Algorithm 1 (pruned
soft voting). The rationale behind this algorithm is as follows: If one of the predictions is
safe, while the others are not, then it should be chosen as the final prediction, regardless of the
remaining predictions.
   The threshold used in the final submission was obtained by repeatedly applying Algorithm 1
with a different threshold to the validation set and selecting the threshold that produced the best
macro F1-score. For the English ensemble, the optimal threshold was 0.44, for the non-English
ensemble: 0.49.
   Table 2 displays a performance comparison between pruned soft voting and ordinary soft
voting using the validation set and shows that pruned soft voting offers a marginal improvement.
However, it should be noted that, since the threshold for pruned soft voting was optimized using
the validation set itself, the evaluation scores of pruned soft voting will be at least as high as
those of soft voting, because soft voting is equivalent to pruned soft voting with threshold 0.
   One point to consider when evaluating the ensembling strategies is that the no-label class (see
3.1.1) is included in the calculation of the F1-score (macro). This class will not be used in the
final evaluation of the approach by the organizers, so for each evaluation that we performed, a
corresponding adjusted F1-score which does not include the no-label class was calculated.

Table 2
Results of applying the final trained ensembles to the validation set using different voting strategies. The
adjusted F1-scores correspond to F1-scores that do not include the no-label class (see 3.1.1).
                        F1-score           F1-score           Adjusted F1-score       Adjusted F1-score
      Languages
                      (soft voting)   (pruned soft voting)      (soft voting)        (pruned soft voting)
        English          0.4012             0.4405                 0.3799                  0.4211
      Non-English        0.3963             0.4036                 0.3788                  0.3867
       Combined          0.3989             0.4077                 0.3807                  0.3902


3
    For details on soft voting, see https://machinelearningmastery.com/voting-ensembles-with-python/
Algorithm 1 Pruned Soft Voting
Input:
  Sequence 𝑆 of pairs (𝑙1 , 𝑝1 ), . . . , (𝑙𝑛 , 𝑝𝑛 ) of predicted values for the label of one data instance,
  coupled with their prediction probabilities
  Probability threshold 𝑇
Output:
  One final label prediction

  if exists at least one probability 𝑝𝑖 in 𝑆 such that 𝑝𝑖 ≥ 𝑇 then
     return the final label prediction by applying soft voting only to those pairs (𝑙𝑖 , 𝑝𝑖 ) in 𝑆 with
     𝑝𝑖 ≥ 𝑇
  else
     return the final label prediction by applying soft voting to the entire sequence 𝑆
  end if


4. Results
Upon submitting the approach, the fine-tuned models were applied to the test set and the results
were exported to a .tsv file in the required format, which was submitted to the organizers. As
texts with two words or less are believed to be noisy, the models of Subtask 1 were not applied
to these texts; rather, no-label was manually predicted.
   The evaluation results that were reported by the organizers are shown in Table 3 and Table
4. The reported F1-score for Subtask 1 (0.35) is significantly lower than the adjusted F1-score
(0.3902) produced during our evaluation using the validation set (Table 2). This was expected
for three reasons:

   1. The the Humility label was not included when calculating the F1-scores in our evaluations.
      Since our submission never predicts the Humility label, the F1-score for this label was 0
      when our submission was evaluated by the organizers, thus reducing the overall macro-
      averaged F1-score.
   2. The filtering described in Section 3.2.1 was not applied to the test set. In particular, the
      test set does contain texts with multiple labels, which our approach cannot cope with.
   3. In the process of fine-tuning the models, the validation set was used for checkpointing;
      therefore, the models have a degree of overfitness to the validation set.


5. Conclusion and Future Work
This paper presented the approach of Team Arthur Schopenhauer to Task 1 of the Touché lab
at CLEF 2024. The main idea of the approach is simplifying the given subtasks. It simplifies
Subtask 1 by eliminating the possibility of detecting multiple human values in a single text,
and simplifies Subtask 2 by eliminating the possibility of detecting any dependence between
referenced human values and attainment in a text. The source code for the approach is available
under the following like: https://github.com/h-uns/clef2024-human-value-detection.
   For future work, there are two notable areas of experimentation for improving the submitted
approach:

   • Using larger or newer model architectures than the ones used in the approach.
   • Developing separate, specialized models that detect only certain subsets of human values,
     rather than all 19 values. Reducing the number of detectable human values is expected
     to improve the training efficiency of each model. In addition, combining such models can
     facilitate detecting multiple human values in a single text.
Table 3
Achieved F1 -score of each submission on the test dataset for Subtask 1. A ✓ indicates that the submission
used the automatic translation to English. Baseline submissions shown in gray.
                                                                                               F1 -score


                                                                                                         Benevolence: dependability
                                                                                                         Conformity: interpersonal


                                                                                                                                      Universalism: tolerance
                                                      Self-direction: thought


                                                                                                                                      Universalism: concern
                                                                                                                                      Universalism: nature
                                                                                Self-direction: action


                                                                                                         Benevolence: caring
                                                                                Power: dominance


                                                                                Security: personal


                                                                                                         Conformity: rules
                                                                                Security: societal
                                                                                Power: resources
                                                                                Achievement
                                                                                Stimulation
                                                                                Hedonism


                                                                                                         Tradition


                                                                                                         Humility
                                                                                Face
                                                All
         Submission                        EN

         valueeval24-arthur-schopenhauer        35 12 24 33 35 40 37 47 24 38 46 49 50 19 00 32 31 46 60 27
         valueeval24-bert-baseline-en      ✓    24 00 13 24 16 32 27 35 08 24 40 46 42 00 00 18 22 37 55 02
         valueeval24-random-baseline       ✓    06 02 07 05 02 11 08 10 03 04 14 03 11 03 00 05 04 09 04 02


Table 4
Achieved F1 -score of each submission on the test dataset for Subtask 2. A ✓ indicates that the submission
used the automatic translation to English. Baseline submissions shown in gray.
                                                                                               F1 -score


                                                                                                         Benevolence: dependability
                                                                                                         Conformity: interpersonal


                                                                                                                                      Universalism: tolerance
                                                      Self-direction: thought


                                                                                                                                      Universalism: concern
                                                                                                                                      Universalism: nature
                                                                                Self-direction: action


                                                                                                         Benevolence: caring
                                                                                Power: dominance


                                                                                Security: personal


                                                                                                         Conformity: rules
                                                                                Security: societal
                                                                                Power: resources
                                                                                Achievement
                                                                                Stimulation
                                                                                Hedonism


                                                                                                         Tradition


                                                                                                         Humility
                                                                                Face
                                                All


         Submission                        EN

         valueeval24-arthur-schopenhauer        83 77 83 85 88 87 73 84 80 82 84 78 80 79 74 91 89 86 85 81
         valueeval24-bert-baseline-en      ✓    81 83 79 86 88 84 77 80 74 84 81 78 78 79 87 89 86 85 81 78
         valueeval24-random-baseline       ✓    52 51 47 54 52 53 55 53 52 52 50 54 53 49 45 53 56 52 49 56


Acknowledgments
The approaches [17] and [18] from SemEval-2023 provided a valuable kickstart for this approach.


References
 [1] S. H. Schwartz, J. Cieciuch, M. Vecchione, E. Davidov, R. Fischer, C. Beierlein, A. Ramos,
     M. Verkasalo, J.-E. Lönnqvist, K. Demirutku, et al., Refining the Theory of Basic Individual
     Values, Journal of personality and social psychology 103 (2012). doi:10.1037/a0029393.
 [2] J. Kiesel, M. Alshomary, N. Handke, X. Cai, H. Wachsmuth, B. Stein, Identifying the Human
     Values behind Arguments, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), 60th Annual
     Meeting of the Association for Computational Linguistics (ACL 2022), Association for
     Computational Linguistics, 2022, pp. 4459–4471. doi:10.18653/v1/2022.acl-long.306.
 [3] J. Kiesel, M. Alshomary, N. Mirzakhmedova, M. Heinrich, N. Handke, H. Wachsmuth,
     B. Stein, SemEval-2023 Task 4: ValueEval: Identification of Human Values behind
     Arguments, in: R. Kumar, A. K. Ojha, A. S. Doğruöz, G. D. S. Martino, H. T. Madabushi
     (Eds.), 17th International Workshop on Semantic Evaluation (SemEval 2023), Association
     for Computational Linguistics, Toronto, Canada, 2023, pp. 2287–2303. doi:10.18653/v1/
     2023.semeval-1.313.
 [4] J. Kiesel, Ç. Çöltekin, M. Heinrich, M. Fröbe, M. Alshomary, B. D. Longueville, T. Erjavec,
     N. Handke, M. Kopp, N. Ljubešić, K. Meden, N. Mirzakhmedova, V. Morkevičius, T. Reitis-
     Münstermann, M. Scharfbillig, N. Stefanovitch, H. Wachsmuth, M. Potthast, B. Stein,
     Overview of Touché 2024: Argumentation Systems, in: L. Goeuriot, P. Mulhem, G. Quénot,
     D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli,
     N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction.
     Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF
     2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024.
 [5] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
     org/abs/1810.04805. arXiv:1810.04805.
 [6] M. Scharfbillig, L. Smillie, D. Mair, M. Sienkiewicz, J. Keimer, R. Pinho Dos Santos,
     H. Vinagreiro Alves, E. Vecchione, L. Scheunemann, Values and Identities - a Policymaker’s
     Guide, Technical Report KJ-NA-30800-EN-N, European Commission’s Joint Research
     Centre, Luxembourg, 2021. doi:10.2760/349527.
 [7] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein,
     M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in:
     J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz,
     A. Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR
     Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg
     New York, 2023, pp. 236–241. doi:10.1007/978-3-031-28241-6_20.
 [8] T. pandas development team, pandas-dev/pandas: Pandas, 2020. URL: https://doi.org/10.
     5281/zenodo.3509134. doi:10.5281/zenodo.3509134.
 [9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
     M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
     Learning Research 12 (2011) 2825–2830.
[10] P. He, X. Liu, J. Gao, W. Chen, Deberta: decoding-enhanced bert with disentangled
     attention, in: 9th International Conference on Learning Representations (ICLR 2021),
     OpenReview.net, 2021. URL: https://openreview.net/forum?id=XPZIaotutsD.
[11] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
     scale, in: D. Jurafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.), Proceedings of the 58th
     Annual Meeting of the Association for Computational Linguistics (ACL 2020), ACL, 2020,
     pp. 8440–8451. doi:10.18653/V1/2020.ACL-MAIN.747.
[12] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault,
     R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu,
     C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Huggingface’s transformers:
     State-of-the-art natural language processing, 2020. arXiv:1910.03771.
[13] L. Breiman, Bagging predictors, Machine learning 24 (1996) 123–140.
[14] A. Mao, M. Mohri, Y. Zhong, Cross-entropy loss functions: Theoretical analysis and
     applications, 2023. URL: https://arxiv.org/abs/2304.07288. arXiv:2304.07288.
[15] T. H. Phan, K. Yamamoto, Resolving class imbalance in object detection with weighted
     cross entropy losses, arXiv preprint arXiv:2006.01413 (2020).
[16] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
     N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
     S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style,
     high-performance deep learning library, 2019. arXiv:1912.01703.
[17] D. Schroter, D. Dementieva, G. Groh, Adam-smith at SemEval-2023 task 4: Discovering
     human values in arguments with ensembles of transformer-based models, in: A. K. Ojha,
     A. S. Doğruöz, G. Da San Martino, H. Tayyar Madabushi, R. Kumar, E. Sartori (Eds.),
     Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023),
     Association for Computational Linguistics, Toronto, Canada, 2023, pp. 532–541. URL:
     https://aclanthology.org/2023.semeval-1.74. doi:10.18653/v1/2023.semeval-1.74.
[18] G. Balikas, John-arthur at SemEval-2023 task 4: Fine-tuning large language models for
     arguments classification, in: A. K. Ojha, A. S. Doğruöz, G. Da San Martino, H. Tay-
     yar Madabushi, R. Kumar, E. Sartori (Eds.), Proceedings of the 17th International Work-
     shop on Semantic Evaluation (SemEval-2023), Association for Computational Linguistics,
     Toronto, Canada, 2023, pp. 1428–1432. URL: https://aclanthology.org/2023.semeval-1.197.
     doi:10.18653/v1/2023.semeval-1.197.


A. Hyperparameters

Table 5
Overview of hyperparameters used in fine-tuning the models.
               Hyperparameter             Value
               Number of Epochs           10 for Subtask 1 (non-English); 12 for the rest
               Batch Size                 8 for deberta-v2-xxlarge; 16 for xlm-roberta
               Optimizer                  AdamW
               Learning Rate              1 × 10−6
               Learning Rate Scheduler    Constant
               Weight Decay               0.0 for base models; 0.01 for classification heads