=Paper=
{{Paper
|id=Vol-3740/paper-338
|storemode=property
|title=Philo of Alexandria at Touché: A Cascade Model Approach to Human Value Detection
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-338.pdf
|volume=Vol-3740
|authors=Víctor Yeste,Mariona Coll-Ardanuy,Paolo Rosso
|dblpUrl=https://dblp.org/rec/conf/clef/YesteAR24
}}
==Philo of Alexandria at Touché: A Cascade Model Approach to Human Value Detection==
Philo of Alexandria at Touché: A Cascade Model Approach
to Human Value Detection
Notebook for the Touché Lab at CLEF 2024
Víctor Yeste1,2,* , Mariona Coll-Ardanuy1 and Paolo Rosso1,3
1
PRHLT Research Center, Universitat Politècnica de València, 46022, Valencia, Spain
2
Universidad Europea de Valencia, 46010, Valencia, Spain
3
Valencian Graduate School and Research Network of Artificial Intelligence (ValgrAI)
Abstract
This paper describes our contribution to the Human Value Detection shared task at CLEF 2024. Our submitted
system approaches the task of human value detection and attainment using a sequence of two models: a multi-
label text classifier based on DeBERTa is used first to predict the human values present in the text. Then, a
follow-up natural language inference binary classifier based on DeBERTa is applied to discern whether the values
that are present in the text are attained or constrained. This cascade model approach improves the granularity
of text classification. Our approach outperforms all baselines, achieving a Macro F1-score of 0.28 on sub-task 1
(human value detection) and a Macro F1-score of 0.82 on sub-task 2 (value attainment prediction).
Keywords
human value detection, text classification, multi-label classification
1. Introduction
The task of human value detection involves applying natural language processing to identify whether
human values are present in texts, and to determine whether such values appear as attained or con-
strained. These values have been ordered in a circular motivational continuum by Schwartz et al. (2012)
[1], in which 19 values were defined based on their compatible and conflicting motivations, expression
of self-protection vs. growth, and personal vs. social focus.
The Human Value Detection at CLEF 2024 task (ValueEval’24) [2] consists of two sub-tasks: the first
is to detect the presence or absence of each of these 19 values, while the second is to detect whether
the value is attained or constrained. The dataset provided for both tasks consist of approximately 3000
human-annotated texts between 400 and 800 words created by the ValuesML project [3]. The data
is provided at the sentence-level (44,758 sentences for training, 14,904 sentences for validation, and
14,569 sentences are kept for testing), in which each sentence is annotated in a multi-label setting and a
single-level taxonomy consisting of 38 labels, expressing each human value’s attained and constrained
versions. As the original dataset is multilingual and contains texts in several languages, an automatically
translated version to English of the training, validation, and test dataset was provided for every team
that wished to create an approach without a multilingual perspective.
The present work includes a cascade model approach consisting of two consecutive models: a multi-
label text classifier used to predict which of the 19 human values are present in the text, followed by a
binary classifier which treats the task of determining the attainment or not of the value as a stance
classification problem, in which both the text and the value are passed as input, and the expected output
is whether the value appears as attained or constrained. Our approach outperforms all the baselines
provided by the organizers, including a baseline based on BERT. This paper includes a detailed system
overview, the experiments we have performed, the results and discussion, and some conclusions and
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
*
Corresponding author.
$ vicyesmo@upv.es (V. Yeste); mcoll@prhlt.upv.es (M. Coll-Ardanuy); prosso@dsic.upv.es (P. Rosso)
https://victoryeste.com (V. Yeste)
0000-0002-3660-8347 (V. Yeste); 0000-0001-8455-7196 (M. Coll-Ardanuy); 0000-0002-8922-1242 (P. Rosso)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
future studies that could continue this work. The code for the proposed system, as well as for all our
experiments, is available on GitHub.1
2. System Overview
This section presents our cascade model approach, in which two models are dedicated to each of the
proposed sub-tasks, and combined to achieve the prediction in the required format. Our approach uses
the automatic translated texts to English. Our system introduces a cascade model approach for the
detection and stance classification of the predefined set of human values. It consists of two subsystems:
one for detecting the presence of each human value and another for establishing the stance (if the
sentence attains or constrains) of each human value. Each subsystem is fine-tuned separately, in both
cases using a DeBERTa model2 [4] as base, for the task of sequence classification using the HuggingFace
implementation.3
• Subsystem 1: Its primary function is to identify the presence of human values within sentences.
By combining the ‘attained’ and ‘constrained’ labels to indicate an overall presence, it streamlines
the multi-label classification task, simplifying it to a binary classification for each of the 19 human
values (presence vs. absence). The model for the proposed subsystem is available at HuggingFace.4
• Subsystem 2: it receives the outputs of subsystem 1 and classifies the stance towards each
present human value in a binary classification (attained vs. constrained). This system transforms
the sentences dataset into premise-hypothesis pairs, where each sentence is the premise, a value
is the hypothesis, and the ‘attained’ and ‘constrained’ labels are the stance. The model for the
proposed subsystem is available at HuggingFace.5
Given that subsystem 1 focuses on detecting the presence of human values in the text, and subsystem 2
focuses on the stances towards each detected human value, this cascade model approach improves the
granularity of text classification. As can be seen in the Results section, it also enhances the performance
of the final predictions.
3. Experiments
Experiments were carried out on Google Colab in Python 3.10.12 and Nvidia Tesla, as well as 12.7 GB of
System RAM and 15 GB of GPU RAM. HuggingFace transformers [5] have been used as frameworks
for all the experiments in this study. Training has been designed with flexibility and performance,
and evaluation metrics have been calculated upon training completion and validation with the task
validation dataset. F1 scores for each label and a macro-average F1 score were used to evaluate each
experiment, enabling a comprehensive analysis of individual and overall effectiveness.
3.1. Preliminary Experiments
Our initial experiments involved using a single model approach to classify each text into
the predefined set of human value stance labels (i.e., the 38 labels determining whether the
sentence attains or constrains each of the 19 human values). The objective was to lever-
age the powerful features of well-known transformer models for this purpose, and to de-
termine which was the best suited for the task. We experimented with the following pre-
trained models: google-bert/bert-base-uncased [6],6 FacebookAI/roberta-base7
1
https://github.com/VictorMYeste/touche-human-value-detection
2
https://huggingface.co/microsoft/deberta-base
3
https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification
4
https://huggingface.co/VictorYeste/deberta-based-human-value-detection
5
https://huggingface.co/VictorYeste/deberta-based-human-value-stance-detection
6
https://huggingface.co/google-bert/bert-base-uncased
7
https://huggingface.co/FacebookAI/roberta-base
[7], microsoft/deberta-base8 [4], google/electra-base-discriminator9 [8] and
xlnet-base-cased10 [9]. These pre-trained models were initialized for sequence classifica-
tion and, for task 1, configured for the multi-label classification setting.
Each selected model was fine-tuned on the task training dataset and validated with the task validation
dataset. The sentences were tokenized using the specific tokenizer from Huggingface Transformers for
each model. All models were fine-tuned with a batch size of 8, for 5 training epochs, a learning rate
of 2𝑒−5, and a weight decay 0.01. A linear learning rate scheduler was implemented using 0 warmup
steps on BERT and RoBERTa, using Adam as an optimizer and incorporating weight decay directly to
improve regularization and prevent overfitting. The final model, DeBERTa, was selected based on the
fact that it produced the highest macro F1-score on the training and validation dataset.
3.2. System Experiments
Our cascade models approach has been developed by fine-tuning two DeBERTa models in sequence,
therefore converting the approach of dividing the challenge into two sub-tasks into reality. In both
cases, we used the same experimental settings as described in the preliminary experiments section.
First, we transformed each pair of attained and constrained labels into presence labels, understanding
presence as an OR operation between both labels. The DeBERTa model was fine-tuned for multi-label
classification of the 19 available human values, and was trained on the task training dataset and validated
on the task validation dataset to evaluate the effectiveness of this subsystem alone to detect the presence
of human values. This step ensures the ability to answer the first sub-task of the challenge with a
significantly reduced complexity as the output space is 19-dimensional instead of 38-dimensional,
translated into a smaller number of possible label combinations.
Second, subsystem 2 receives subsystem 1 results as inputs and applies an approach of natural
language inference, where each sentence is considered a premise, human values labels are considered
different hypotheses, and “attained” and “constrained” are the labels. With this technique, the model
tries to determine a logical entailment relationship between this pair of sequences. This inference
establishes the stance of the sentence toward each human value, which answers sub-task 2 of the
proposed challenge.
Finally, it is important to note that, in order to adjust the predictions of our cascade approach to
the format required by the shared task, we had to do one small modification to our system. While our
system is conceived to apply the second model only for those values that have been found to be present
in the text, the format required to participate in both tasks11 meant that, in order to produce our results
file, we applied the subsystem 2 model to each sentence-value pair, instead of only those values that
have been predicted to be in the sentence. To ensure that values detected as absent remain below the
0.5 threshold that is used by the evaluator to determine that the value is not present, in those cases in
which the value has not been predicted by the first model, we multiply the second model prediction
score by the first model prediction score, divided by two.
4. Results
In our preliminary experiments, our models were trained and evaluated with the provided training
and validation datasets, generating an individual F1-score for every human label and a generic Macro
F1-score, which were used to compare the effectiveness of the different models. The model with
8
https://huggingface.co/microsoft/deberta-base
9
https://huggingface.co/google/electra-base-discriminator
10
https://huggingface.co/xlnet/xlnet-base-cased
11
Only one file had to be submitted for both tasks, with 38 columns for each of the 38 labels (i.e. 19 human value pairs). Task 1
was evaluated based on the sum of values between the attained and constrained columns of the value (which should be larger
than 0.5 if the value is present), and task 2 was evaluated based on which of the two columns (‘attained’ or ‘constrained’)
had the larger value. The organizers recommended avoiding setting the same number for both attained and constrained,
even if our system predicted that the value was not referenced in the text.
Table 1
Achieved F1 -score (0.score) of each submission on the test dataset for subtask 1. A ✓ indicates that the submission
used the automatic translation to English. Baseline submissions shown in gray.
F1 -score
Benevolence: dependability
Conformity: interpersonal
Universalism: tolerance
Self-direction: thought
Universalism: concern
Universalism: nature
Self-direction: action
Benevolence: caring
Power: dominance
Conformity: rules
Security: personal
Security: societal
Power: resources
Achievement
Stimulation
Hedonism
Tradition
Humility
Face
All
Submission EN
philo-of-alexandria (our approach) ✓ 28 08 22 27 31 35 31 34 17 33 40 47 42 09 00 21 28 40 57 21
valueeval24-bert-baseline-en ✓ 24 00 13 24 16 32 27 35 08 24 40 46 42 00 00 18 22 37 55 02
valueeval24-random-baseline 06 02 07 05 02 11 08 10 04 05 13 03 11 03 00 04 04 09 04 02
valueeval24-random-baseline ✓ 06 02 07 05 02 11 08 10 03 04 14 03 11 03 00 05 04 09 04 02
the highest effectiveness was found to be DeBERTa with a Macro F1-Score of 0.20. However, while
DeBERTa presented the highest Macro F1-score, some models achieved higher individual F1-scores
for some human values: BERT was better on ‘tradition attained’; RoBERTa on ‘achievement attained’,
‘security: societal constrained’, ‘universalism: concern attained’, and ‘universalism: nature attained’;
Electra on ‘power: dominance attained’, ‘power: resources constrained’, ‘security: societal attained’,
‘universalism: concern attained’, and ‘universalism: concern constrained’; and XLNet on ‘power:
resources attained’, ‘power: resources constrained’, ‘security: societal attained’, ‘conformity: rules
constrained’, ‘benevolence: dependability attained’, ‘universalism: concern attained’, and ‘universalism:
concern constrained’. These results could indicate that using a different model for each human value
could be an interesting approach. As DeBERTa was selected as the best overall model, our system was
developed using two cascade DeBERTa models.
Table 1 shows the results of our system for subtask 1. As it can be seen, our system outperforms all
baselines, including the BERT-based baseline, by 0.04 in terms of F1-score. It is interesting to note that
both our approach and the BERT baseline generally perform similarly well on the same values (such
as ‘security: societal’, ‘tradition’, ‘conformity: rules’, and ‘universalism: nature’), and similarly bad on
other values (such as ‘self-direction: thought’, ‘conformity: interpersonal’, and ‘humility’), while some
other values have significant increases with our approach (such as ‘universalism: tolerance’ and ‘face’).
Overall, our approach matches or outperforms the BERT baseline for all values, except for ‘power:
resources’.
Table 2 shows the results of our system for subtask 2. While our approach outperforms the BERT
baseline, the F1-score is only slightly higher (0.82 over 0.81). Our approach only outperforms the BERT
baseline on 12 of the 19 possible values. Our model is best at predicting ‘hedonism’ and ‘benevolence:
caring’, and significantly worse than the baseline in predicting ‘humility’, with which our first model
also failed.
5. Conclusions
This work proposes a system to resolve the challenge sub-tasks related to human values detection. Our
approach uses cascade DeBERTa models, where the first detects the presence of each human value, and
the second detects if the sentence attains or constrains the present human values in each sentence. The
latter approach improves the effectiveness of the baseline at the test dataset by 4 on sub-task 1 and by 1
on sub-task 2. These models were trained on a subset of 44,758 sentences in English, validated on a
subset of 14,904 sentences, and tested on a separate subset of 14,569 sentences.
Table 2
Achieved F1 -score (0.score) of each submission on the test dataset for subtask 2. A ✓ indicates that the submission
used the automatic translation to English. Baseline submissions shown in gray.
F1 -score
Benevolence: dependability
Conformity: interpersonal
Universalism: tolerance
Self-direction: thought
Universalism: concern
Universalism: nature
Self-direction: action
Benevolence: caring
Power: dominance
Conformity: rules
Security: personal
Security: societal
Power: resources
Achievement
Stimulation
Hedonism
Tradition
Humility
Face
All
Submission EN
philo-of-alexandria (our approach) ✓ 82 85 80 85 91 86 79 80 78 85 80 82 77 78 77 93 89 84 83 79
valueeval24-bert-baseline-en ✓ 81 83 79 86 88 84 77 80 74 84 81 78 78 79 87 89 86 85 81 78
valueeval24-random-baseline 53 55 49 52 54 52 56 56 50 48 54 50 54 55 61 55 51 48 51 51
valueeval24-random-baseline ✓ 52 51 47 54 52 53 55 53 52 52 50 54 53 49 45 53 56 52 49 56
Future work could involve implementing a separated detection model for each human value, adapting
each model to its characteristics depending on which model performs better in each case. Considering
the complexity and subtlety of this task, adding linguistic and statistical characteristics to texts could
enrich their context and improve the effectiveness of the models.
Acknowledgments
Work for this paper was conducted as part of the PhD Program in Computer Science at the Universitat
Politècnica de València. The work of Mariona Coll Ardanuy and Paolo Rosso was funded by the research
project FairTransNLP, grant PID2021-124361OB-C31, funded by MCIN/AEI/10.13039/501100011033 and
by ERDF, EU A way of making Europe.
References
[1] S. H. Schwartz, J. Cieciuch, M. Vecchione, E. Davidov, R. Fischer, C. Beierlein, A. Ramos, M. Verkasalo,
J.-E. Lönnqvist, K. Demirutku, et al., Refining the theory of basic individual values, Journal of
personality and social psychology 103 (2012) 663.
[2] J. Kiesel, Ç. Çöltekin, M. Heinrich, M. Fröbe, M. Alshomary, B. D. Longueville, T. Erjavec, N. Handke,
M. Kopp, N. Ljubešić, K. Meden, N. Mirzakhmedova, V. Morkevičius, T. Reitis-Münstermann,
M. Scharfbillig, N. Stefanovitch, H. Wachsmuth, M. Potthast, B. Stein, Overview of Touché 2024:
Argumentation Systems, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D.
Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets
Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Confer-
ence of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin
Heidelberg New York, 2024.
[3] The ValuesML Team, Touché24-ValueEval, 2024. doi:10.5281/zenodo.10663363.
[4] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, in:
International Conference on Learning Representations, 2020.
[5] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
towicz, et al., Transformers: State-of-the-art natural language processing, in: Proceedings of the
2020 conference on Empirical Methods in Natural Language Processing: system demonstrations,
2020, pp. 38–45.
[6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423.
[7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692.
[8] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as discriminators
rather than generators, in: International Conference on Learning Representations, 2019.
[9] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive
pretraining for language understanding, Advances in neural information processing systems 32
(2019).