1. Introduction

Arthur Schopenhauer at Touché 2024: Multi-Lingual Text Classification Using Ensembles of Large Language Models

Hamza Yunis

This paper describes the submitted approach of Team Arthur Schopenhauer to Task 1 of the Touché lab at CLEF 2024. The goal of this task is twofold: detecting human values in texts (Subtask 1), and recognizing whether these values are attained or constrained (Subtask 2). The approach described in this paper simplifies Subtask 1 by restricting the detected values in a text to a maximum of one value. It also simplifies Subtask 2 by handling it separately from Subtask 1; that is, human values and attainment are detected independently of each other. This simplification strategy proved successful, as the submitted approach was ranked 2nd among the participating teams' best submissions (a single team can make multiple submissions) in Subtask 1 and was ranked 1st in Subtask 2. The described simplification results in two text-classification tasks, which are handled by fine-tuning and ensembling multiple BERT-based models.

eol>Touché Human Value Detection BERT Large Language Models Ensembling

1. Introduction 2. Background

The dataset provided by the organizers consists of the training set (labeled), the validation set (labeled), and the test set (not labeled). It stems from the ValuesML project, itself part of a broad JRC initiative that aims for a deep insight into values and identities [ 6 ]. Once the models of our approach were developed, they were applied to the test set to predict its labels. The predicted labels were then submitted to the organizers via the TIRA platform [ 7 ] for evaluation.

The labeled part of the dataset contains zero-one labels for 19 human values, with two label columns for each value corresponding to constrained and attained, totaling 38 columns. A text may constrain or attain a specific value, but not both. However, there are texts where it is unclear whether the referenced value is attained or constrained, in which case both columns corresponding to the value are filled with 0.5.

The dataset contains texts from 9 languages. In addition, the organizers provide automated English translations for non-English texts. However, due to concerns regarding the accuracy of the translations, our approach uses the original texts and relies on multi-lingual language models.

3. System Overview

Our submitted approach tackles the two described subtasks independently. Furthermore, the labeled datasets were divided into English and non-English texts, for each of which a diferent set of models was fine-tuned. Upon applying the fine-tuned models to the test set, which was used for the final submission, the texts in the test set were split in the same manner and the appropriate models were applied to each part.

3.1. Task Simplification

This section describes how the original subtasks were transformed in order to simplify the model ifne-tuning process.

3.1.1. Simplifying Subtask 1

Subtask 1, in its given form, corresponds to a multi-label classification problem, because a single text may refer to multiple human values; that is, a single data instance may belong to multiple classes simultaneously. However, preliminary data analysis showed that approximately 94% of the labeled texts have either one label or no label. Therefore, for the sake of simplicity, it was decided to restrict the fine-tuning process to these instances, which turns the problem into a single-label classification problem. This simplification required introducing the no-label class for texts that have no label.

3.1.2. Simplifying Subtask 2

Subtask 2 was tackled independently of Subtask 1, which means the models of Subtask 2 were fine-tuned to predict a given text’s attainment, regardless of the human value that the text references. Accordingly, the simplified version of Subtask 2 corresponds to a single-label classification problem with two classes, namely attained and constrained.

3.2. Data Preprocessing

The major steps of data preprocessing are shown in Figure 1. It begins by merging together the training and validation sets, then unuseful data is cleaned from the merged set, after which the dataset is reshaped to reflect the task simplification described in Section 3.1, and finally a new validation set is created within which the diferent strata of the labeled data are proportionally represented.

Original training Merge

set

Original validation set

Merged dataset

Filter out unuseful data

Reshape dataset

Create new split

New training set New validation set

3.2.1. Filtering Out Unuseful Data

After merging the training and validation sets, the following rows were removed from the merged set with the help of the pandas [ 8 ] library: • Rows with duplicate texts (first occurrence kept). • Rows with more than one label (in accordance with task simplification from Section 3.1.1). • Rows with two words or less (believed to be noisy).

3.2.2. Reshaping the Dataset

To reflect the task simplification described in Section 3.1, the original 38 label columns were replaced by the following 2 columns: hv_value A numeric code for the human value referenced by the text (including no-label). attainment A numeric code for attainment (constrained, attained, or unknown). The unknown code is assigned to texts that do not have a human value label, or for which the attainment was unclear in the original dataset.

In addition, rows with the human value Humility were removed from the dataset. The reason for this additional filtering is that such rows are rare in the dataset and, after initial experiments, the fine-tuned models could not predict Humility with any accuracy.

3.2.3. Creating a New Split

The last step of data preprocessing was creating a new train-validation split. The validation set was created using the proportional allocation strategy with a sampling rate of 0.1, whereby each stratum is specified by a combination of language and label, for example all rows with language “EN” and label “Conformity: interpersonal” form one stratum. Splitting was achieved using the function train_test_split from scikit-learn [ 9 ] using the fixed random state 66 (not related to the random seed used when fine-tuning the models). This way, the validation set could be reproduced in diferent Python sessions.

3.3. Fine-Tuning the Models

For both subtasks, the approach relies on the pretrained models microsoft/deberta-v2-xxlarge [ 10 ] for English texts and FacebookAI/xlm-roberta-large [ 11 ] for non-English texts, both obtained from the Hugging Face Hub [ 12 ]. The process of producing the fine-tuned models is depicted in Figure 2.

Training set Validation set

Language-based split

English training set English validation set Non-English training

set Non-English validation set

Fine-tuning

Pretrained deberta-v2-xxlarge

Pretrained xlm-roberta Fine-tuning

English fine-tuned

model Non-English finetuned model

For Subtask 1, bagging [ 13 ] was applied using two four-model ensembles, one for each language subset. For Subtask 2, only one model was fine-tuned for each language subset, because using multiple models ofered no improvement of predicative performance during experimentation. The classification heads of the fine-tuned models had 19 outputs 1 for Subtask 1 and 2 outputs for Subtask 2. It should be noted that the models for Subtask 2 were fine-tuned only on the data with known attainment; that is, rows with the unknown value in the attainment column were excluded.

Our approach applies the commonly used cross-entropy loss function [ 14 ]. However, due to observed class imbalance in Subtask 1, the use of the weighted cross-entropy loss function [ 15 ] was contemplated. Our experiments showed that using the weighted cross-entropy loss function (using inverse class frequencies as weights) delivers higher performance for some low-frequency classes, but lower performance overall; therefore, a combination of both weighted and non-weighted cross-entropy loss functions was used in each ensemble.

All ten models of our approach were fine-tuned using the same train-validation split, but with diferent hyperparameters, as described in Table 1. The remaining hyperparameters are described in Appendix A.

Fine-tuning was performed using PyTorch [ 16 ] directly, rather than the Hugging Face Trainer API. During fine-tuning, checkpointing was used, so the model checkpoint with the best F1-score (macro) was kept.

3.4. Ensembling Strategy

Ensembling is relevant only to Subtask 1, because for Subtask 2, only one model is used with each language subset.

Each of the models in Table 1 produces a predicted label2, along with a probability of that 1The Humility class was removed from the original 19 classes and the no-label class was added. 2For details on extracting predictions from neural network outputs, see https://www.learnpytorch.io/02_pytorch_ classification/ label. One common way to ensemble predictions is soft voting3. Our approach adjusts the original soft voting strategy by employing the concept of safe prediction, for want of a better term, which will be used to denote a prediction whose probability exceeds a certain threshold. With this definition of a safe prediction, ensembling was achieved using Algorithm 1 ( pruned soft voting). The rationale behind this algorithm is as follows: If one of the predictions is safe, while the others are not, then it should be chosen as the final prediction, regardless of the remaining predictions.

The threshold used in the final submission was obtained by repeatedly applying Algorithm 1 with a diferent threshold to the validation set and selecting the threshold that produced the best macro F1-score. For the English ensemble, the optimal threshold was 0.44, for the non-English ensemble: 0.49.

Table 2 displays a performance comparison between pruned soft voting and ordinary soft voting using the validation set and shows that pruned soft voting ofers a marginal improvement. However, it should be noted that, since the threshold for pruned soft voting was optimized using the validation set itself, the evaluation scores of pruned soft voting will be at least as high as those of soft voting, because soft voting is equivalent to pruned soft voting with threshold 0.

One point to consider when evaluating the ensembling strategies is that the no-label class (see 3.1.1) is included in the calculation of the F1-score (macro). This class will not be used in the ifnal evaluation of the approach by the organizers, so for each evaluation that we performed, a corresponding adjusted F1-score which does not include the no-label class was calculated. 3For details on soft voting, see https://machinelearningmastery.com/voting-ensembles-with-python/ Algorithm 1 Pruned Soft Voting Input:

Sequence of pairs ( 1, 1), . . . , ( , ) of predicted values for the label of one data instance, coupled with their prediction probabilities

Probability threshold Output:

One final label prediction if exists at least one probability in such that ≥ then return the final label prediction by applying soft voting only to those pairs ( , ) in with ≥ else

return the final label prediction by applying soft voting to the entire sequence end if

4. Results

Upon submitting the approach, the fine-tuned models were applied to the test set and the results were exported to a .tsv file in the required format, which was submitted to the organizers. As texts with two words or less are believed to be noisy, the models of Subtask 1 were not applied to these texts; rather, no-label was manually predicted.

The evaluation results that were reported by the organizers are shown in Table 3 and Table 4. The reported F1-score for Subtask 1 (0.35) is significantly lower than the adjusted F1-score (0.3902) produced during our evaluation using the validation set (Table 2). This was expected for three reasons: 1. The the Humility label was not included when calculating the F1-scores in our evaluations.

Since our submission never predicts the Humility label, the F1-score for this label was 0 when our submission was evaluated by the organizers, thus reducing the overall macroaveraged F1-score. 2. The filtering described in Section 3.2.1 was not applied to the test set. In particular, the test set does contain texts with multiple labels, which our approach cannot cope with. 3. In the process of fine-tuning the models, the validation set was used for checkpointing; therefore, the models have a degree of overfitness to the validation set.

5. Conclusion and Future Work

This paper presented the approach of Team Arthur Schopenhauer to Task 1 of the Touché lab at CLEF 2024. The main idea of the approach is simplifying the given subtasks. It simplifies Subtask 1 by eliminating the possibility of detecting multiple human values in a single text, and simplifies Subtask 2 by eliminating the possibility of detecting any dependence between referenced human values and attainment in a text. The source code for the approach is available under the following like: https://github.com/h-uns/clef2024-human-value-detection.

For future work, there are two notable areas of experimentation for improving the submitted approach: • Using larger or newer model architectures than the ones used in the approach. • Developing separate, specialized models that detect only certain subsets of human values, rather than all 19 values. Reducing the number of detectable human values is expected to improve the training eficiency of each model. In addition, combining such models can facilitate detecting multiple human values in a single text. ii:tttrceoogndhuh ii:i-ttrcceaoondn iltaon isnm teevnm :irceaodnnm :ssrrrceeou :iltsrreaoypn i:ilttsrceaoy iiton i:ltsrreoyum i:ilfttsrrreeaooypnnm ility l:ircceeaogvnn l:iiltceeeeaovynpdndb li:ssrrcceeaonnm li:tssrreeaaunm li:ltssrrceeeaaonm

EN llA lf-eS lfeS itSum eodH icehA eoPw eoPw ceaF ceuS ceSu radT fonC onC uHm eenB eenB ivnU ivnU ivnU Submission valueeval24-arthur-schopenhauer 83 77 83 85 88 87 73 84 80 82 84 78 80 79 74 91 89 86 85 81 valueeval24-bert-baseline-en ✓ 81 83 79 86 88 84 77 80 74 84 81 78 78 79 87 89 86 85 81 78 valueeval24-random-baseline ✓ 52 51 47 54 52 53 55 53 52 52 50 54 53 49 45 53 56 52 49 56

Acknowledgments

The approaches [ 17 ] and [ 18 ] from SemEval-2023 provided a valuable kickstart for this approach.

[1]

S. H.

Schwartz ,

Cieciuch ,

Vecchione , E. Davidov,

Fischer ,

Beierlein ,

Ramos ,

Verkasalo ,

J.-E.

Lönnqvist ,

Demirutku , et al., Refining the Theory of Basic Individual Values, Journal of personality and social psychology 103 ( 2012 ). doi: 10 .1037/a0029393.

[2]

Kiesel ,

Alshomary ,

Handke ,

Cai ,

Wachsmuth ,

Stein , Identifying the Human Values behind Arguments , in: S. Muresan,

Nakov , A . Villavicencio (Eds.), 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022 ), Association for Computational Linguistics, 2022 , pp. 4459 - 4471 . doi: 10 .18653/v1/ 2022 . acl-long . 306 .

[3]

Kiesel ,

Alshomary ,

Mirzakhmedova ,

Heinrich ,

Handke ,

Wachsmuth ,

Stein , SemEval -2023 Task 4: ValueEval: Identification of Human Values behind Arguments , in: R. Kumar , A. K.

Ojha , A. S.

Doğruöz , G. D. S.

Martino , H. T. Madabushi (Eds.), 17th International Workshop on Semantic Evaluation (SemEval 2023 ), Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 2287 - 2303 . doi: 10 .18653/v1/ 2023 .semeval- 1 . 313 .

[4]

Kiesel , Ç. Çöltekin,

Heinrich ,

Fröbe ,

Alshomary ,

B. D.

Longueville ,

Erjavec ,

Handke ,

Kopp ,

Ljubešić ,

Meden ,

Mirzakhmedova ,

Morkevičius , T. ReitisMünstermann, M. Scharfbillig,

Stefanovitch ,

Wachsmuth ,

Potthast ,

Stein , Overview of Touché 2024: Argumentation Systems , in: L. Goeuriot , P.

Mulhem , G.

Quénot , D.

Schwab , L.

Soulier , G. M. D. Nunzio , P. Galuščáková , A. G. S. de Herrera , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024 .

[5]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , CoRR abs/ 1810 .04805 ( 2018 ). URL: http://arxiv. org/abs/ 1810 .04805. arXiv: 1810 .04805.

[6]

Scharfbillig ,

Smillie ,

Mair ,

Sienkiewicz ,

Keimer , R. Pinho Dos Santos,

H. Vinagreiro

Alves ,

Vecchione , L. Scheunemann, Values and Identities - a Policymaker's Guide , Technical Report KJ-NA-30800-EN-N, European Commission's Joint Research Centre, Luxembourg , 2021 . doi: 10 .2760/349527.

[7]

Fröbe ,

Wiegmann ,

Kolyada ,

Grahm ,

Elstner ,

Loebe ,

Hagen ,

Stein ,

Potthast , Continuous Integration for Reproducible Shared Tasks with TIRA.io , in: J. Kamps , L.

Goeuriot , F.

Crestani , M.

Maistro , H.

Joho , B.

Davis , C.

Gurrin , U.

Kruschwitz , A . Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023 , pp. 236 - 241 . doi: 10 .1007/978-3- 031 -28241-6_ 20 .

[8] T. pandas development team , pandas-dev/pandas: Pandas, 2020 . URL: https://doi.org/10. 5281/zenodo.3509134. doi: 10 .5281/zenodo.3509134.

[9]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , E. Duchesnay, Scikit-learn: Machine learning in Python , Journal of Machine Learning Research 12 ( 2011 ) 2825 - 2830 .

[10]

He ,

Liu ,

Gao , W. Chen, Deberta: decoding-enhanced bert with disentangled attention , in: 9th International Conference on Learning Representations (ICLR 2021 ), OpenReview.net, 2021 . URL: https://openreview.net/forum?id=XPZIaotutsD.

[11]

Conneau ,

Khandelwal ,

Goyal ,

Chaudhary ,

Wenzek ,

Guzmán , E. Grave,

Ott ,

Zettlemoyer ,

Stoyanov , Unsupervised cross-lingual representation learning at scale , in: D. Jurafsky , J.

Chai , N.

Schluter , J. R.

Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020 ), ACL, 2020 , pp. 8440 - 8451 . doi: 10 .18653/V1/ 2020 .ACL-MAIN. 747 .

[12]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Davison ,

Shleifer , P. von Platen, C. Ma,

Jernite ,

Plu ,

Xu ,

T. L.

Scao ,

Gugger ,

Drame ,

Lhoest ,

A. M.

Rush , Huggingface's transformers: State-of-the-art natural language processing , 2020 . arXiv: 1910 .03771.

[13]

Breiman , Bagging predictors, Machine learning 24 ( 1996 ) 123 - 140 .

[14]

Mao ,

Mohri ,

Zhong , Cross-entropy loss functions: Theoretical analysis and applications, 2023 . URL: https://arxiv.org/abs/2304.07288. arXiv: 2304 . 07288 .

[15]

T. H.

Phan ,

Yamamoto , Resolving class imbalance in object detection with weighted cross entropy losses , arXiv preprint arXiv: 2006 . 01413 ( 2020 ).

[16]

Paszke ,

Gross ,

Massa ,

Lerer ,

Bradbury , G. Chanan,

Killeen ,

Lin ,

Gimelshein ,

Antiga ,

Desmaison ,

Köpf ,

Yang ,

DeVito ,

Raison ,

Tejani ,

Chilamkurthy ,

Steiner ,

Fang ,

Bai ,

Chintala , Pytorch: An imperative style, high-performance deep learning library , 2019 . arXiv: 1912 .01703.

[17]

Schroter ,

Dementieva , G. Groh, Adam-smith at SemEval-2023 task 4: Discovering human values in arguments with ensembles of transformer-based models , in: A. K. Ojha , A. S.

Doğruöz , G. Da San Martino, H. Tayyar Madabushi, R.

Kumar , E. Sartori (Eds.), Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023) , Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 532 - 541 . URL: https://aclanthology.org/ 2023 .semeval- 1 .74. doi: 10 .18653/v1/ 2023 .semeval- 1 . 74 .

[18]

Balikas , John-arthur at SemEval -2023 task 4: Fine-tuning large language models for arguments classification , in: A. K. Ojha , A. S.

Doğruöz , G. Da San Martino, H. Tayyar Madabushi, R.

Kumar , E. Sartori (Eds.), Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023) , Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 1428 - 1432 . URL: https://aclanthology.org/ 2023 .semeval- 1 .197. doi: 10 .18653/v1/ 2023 .semeval- 1 . 197 .