=Paper=
{{Paper
|id=Vol-3740/paper-324
|storemode=property
|title=Edward Said at Touché: Human Value Detection Using Transformers and Upsampling
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-324.pdf
|volume=Vol-3740
|authors=Aisha Nur Aydin,Shaden Shaar,Claire Cardie
|dblpUrl=https://dblp.org/rec/conf/clef/AydinSC24
}}
==Edward Said at Touché: Human Value Detection Using Transformers and Upsampling==
Edward Said at Touché: Human Value Detection Using
Transformers and Upsampling
Notebook for the Touché Lab at CLEF 2024
Aisha Nur Aydin, Shaden Shaar and Claire Cardie
1
Cornell University
Abstract
In this paper, we tackle both subtasks of the proposed shared task Human Value Classification at Touché– that
aims to classify dialogue speech into one of 19 human values determined by Schwartz’s Refined Theory of Basic
Individual Values. We fine-tune models like DeBERTa and RoBERTa with F1-loss to handle multi-label settings.
We additionally test different sampling strategies to accommodate for data imbalance. We found that by training
on the English-translated utterances, we beat the baselines by at least 2 F1 points across both subtasks.
Keywords
human-value-classification, Touché, CLEF
1. Introduction
The Human Value Detection task aims to identify the values humans express through words. Eight
languages are included in the task, so the human value is meant to be identified in a multilingual context.
Upon identification of the values, the value can be classified as having been attained, constrained, or
neither attained nor constrained. The first subtask pertains to classifying a sentence to contain a specific
human value. In contrast, the second subtask identifies whether the sentence constrains or attains the
given set of human values[1].
We fine-tuned pre-trained RoBERTa and DeBERTa models based on the data from the ValuesML
dataset.1 We observed that the class imbalance in the data could affect the classification of the human
values, so we attempted various combinations of upsampling the data to address this issue. We found
that upsampling the lowest-performing categories by 4 folds yields the best results. Similar to other
tasks (e.g., emotion detection [2]), we use only the English models and train using English and translated
English utterances. Using the same model trained for both tasks, we were able to beat all the proposed
baselines.
2. Background
There are nineteen human values in Schwartz’s Refined Theory of Basic Individual Values [3]. There is
a wide range of values in this taxonomy. Additionally, since human values can be addressed implicitly, it
can be challenging to identify which value is being used in a given text [4]. The human value detection
task for 2023 had three components that made up each argument: a conclusion, a stance, and a premise.
Here is an example argument from the 2023 Human Values Task:
Conclusion We should ban human cloning
Stance in favor of
Premise We should ban human cloning as it will only cause huge issues when you have
a bunch of the same humans running around all acting the same.
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
$ ana72@cornell.edu (A. N. Aydin); ss2753@cornell.edu (S. Shaar); ctc9@cornell.edu (C. Cardie)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
1
ValuesML project
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
This year, the task takes a sentence as an input and outputs its human value, whether it is (even
partially) attained, constrained, or neither. A value is attained if the text supports it and is constrained
if it hinders it. This required us to treat the problem as a multi-class, multi-label classification problem
with 38 labels consisting of 19 human values that are attained or constrained.
An example sentence with a human value is the following. The input is:
"Young women examining their options for third level education have been urged to consider
careers in science, technology, engineering or maths (STEM)."
The value referred to in this sentence (the output) is Achievement Attained.
These sentences are provided by the ValueML dataset , which contains text and their respective
human values from articles and political text. The data consists of 3000 texts in over eight languages.
20% of the data is used for validation, 20% is part of testing, and the remaining 60% is for training.
3. System Overview
For all our experiments, we chose to combine all languages in the dataset into one by keeping the
English utterances only. We use RoBERTa Large, specifically "FacebookAI/roberta-large" as the base
model for our submission. We first experimented with the original RoBERTa and DeBERTa models
but found the F1 scores on the validation dataset lower than the BERT baseline provided by the task
organizers. We then fine-tuned RoBERTa-Large and DeBERTa-Large and observed improved results.
There was an improvement, but the overall F1 score appeared to be affected by the class imbalance.
Human Values that appeared less in the dataset had noticeably lower F1 scores than those that seemed
more.
We attempted multiple configurations of upsampling but ended up using an upsampling technique
where the human values with lower performance metrics were chosen to be upsampled by a factor of
four.
To determine which human values to upsample, we fine-tuned the RoBERTa large model on the
training data and analyzed the evaluation results. If the F1 score was 0.15 or less for subtask 1, we
upsampled the attains and constrains form of that human value. We also looked at the metrics for
subtask 2 to determine whether we should upsample only one of the attains or constrains of that value.
If the recall for one of the constrains or attains was 50% or less than its counterpart, we only increase
the underperforming one. For example, if the recall of the attains version of a human value was 50% or
less than the constrains version, we only choose to upsample the attains version and vice versa. Based
on the metrics and the unevenness of the dataset, we upsampled these values by a factor of four:
• Self-direction: thought constrained • Benevolence: dependability constrained
• Self-direction: action constrained • Universalism: tolerance attained
• Humility attained • Universalism: tolerance constrained
• Humility constrained • Conformity: interpersonal attained
• Face attained • Conformity: interpersonal constrained
• Face constrained • Tradition constrained
• Benevolence: caring constrained • Power: dominance constrained
Figure 1 illustrates the impact of upsampling on the dataset, showing the distribution of data before
and after upsampling.
4. Experimental Setup
To fine-tune the model, we set the learning rate to 2e-5, used a warm-up ratio of 0.2, set the batch size
to 8, and used four epochs. We put the random seed to 42, used the AdamW optimizer, and used a
Figure 1: Comparison of data distribution before and after upsampling. Left: Before Upsampling, Right: After
Upsampling.
linear scheduler. We used one A100 GPU to run the experiments for the final submitted models. Our
experiments used pre-trained RoBERTa [5] and DeBERTa [6] models. For evaluation, we use Precision,
Recall, and the macro F1-score. We fine-tuned four models: RoBERTa and DeBERTa large on the
upsampled data, and RoBERTa and DeBERTa large on the regular data, with no upsampling. For our
final submission, we chose the model fine-tuned with the upsampled data using RoBERTa large as the
base. This had the best metrics among all four models on the validation dataset. All four models can be
found on Hugging Face2 .
Table 1
Achieved F1 -score of each submission on the test dataset for subtask 1. A ✓ indicates that the submission used
the automatic translation to English. Baseline submissions shown in gray.
F1 -score
Benevolence: dependability
Conformity: interpersonal
Universalism: tolerance
Self-direction: thought
Universalism: concern
Universalism: nature
Self-direction: action
Benevolence: caring
Power: dominance
Conformity: rules
Security: personal
Security: societal
Power: resources
Achievement
Stimulation
Hedonism
Tradition
Humility
Face
All
Submission EN
muted-glacier-2024-05-07-02-06-56 (RoBERTa large with Upsampling) ✓ 28 05 17 11 15 25 31 34 16 32 41 45 44 06 05 10 23 41 57 27
other-models-2024-07-05-04-47-24 (DeBERTa large with Upsampling) ✓ 26 03 14 26 18 30 12 20 21 16 43 52 46 06 07 09 13 36 58 19
other-models-2024-07-05-04-47-48 (DeBERTa large no Upsampling) ✓ 26 00 17 07 09 38 29 27 19 24 44 48 45 00 00 18 11 42 58 11
other-models-2024-07-05-04-48-29 (RoBERTa large no Upsampling) ✓ 23 00 12 04 26 27 26 18 07 18 41 39 44 00 00 16 04 39 57 06
valueeval24-bert-baseline-en ✓ 24 00 13 24 16 32 27 35 08 24 40 46 42 00 00 18 22 37 55 02
valueeval24-random-baseline 06 02 07 05 02 11 08 10 04 05 13 03 11 03 00 04 04 09 04 02
5. Results
The upsampling methods resulted in a 4% increase for the macro F1 score of subtask 1 and a 2% increase
for the macro F1 score of subtask 2. Of the upsampled human values, "Benevolence: caring" and
2
https://huggingface.co/collections/aishanur/human-value-detection-668c4548607e863cc5cebd58
Table 2
Achieved F1 -score of each submission on the test dataset for subtask 2. A ✓ indicates that the submission used
the automatic translation to English. Baseline submissions shown in gray.
F1 -score
Benevolence: dependability
Conformity: interpersonal
Universalism: tolerance
Self-direction: thought
Universalism: concern
Universalism: nature
Self-direction: action
Benevolence: caring
Power: dominance
Conformity: rules
Security: personal
Security: societal
Power: resources
Achievement
Stimulation
Hedonism
Tradition
Humility
Face
All
Submission EN
muted-glacier-2024-05-07-02-06-56 (RoBERTa large with Upsampling) ✓ 83 77 82 85 88 88 79 80 77 84 84 85 80 80 76 90 86 85 85 78
other-models-2024-07-05-04-47-24 (DeBERTa large with Upsampling) ✓ 84 81 80 84 92 89 77 82 78 85 86 89 81 76 68 92 89 86 84 79
other-models-2024-07-05-04-47-48 (DeBERTa large no Upsampling) ✓ 85 81 83 85 90 90 81 82 76 86 85 84 81 85 87 93 88 86 83 86
other-models-2024-07-05-04-48-29 (RoBERTa large no Upsampling) ✓ 84 81 83 86 93 89 80 80 78 86 84 84 81 87 84 90 89 85 82 84
valueeval24-bert-baseline-en ✓ 81 83 79 86 88 84 77 80 74 84 81 78 78 79 87 89 86 85 81 78
valueeval24-random-baseline 53 55 49 52 54 52 56 56 50 48 54 50 54 55 61 55 51 48 51 51
"Tradition" were the only ones that did not have an improved F1-score for subtask 1. For the other
upsampled values, there were improved F1-scores, some marginal and others significant. For subtask
2, the upsampled values performed only marginally better except for "Self-direction: thought" and
"Humility", which all had worse metrics, and "Benevolence: dependability" and "Universalism: tolerance",
which stayed the same.
After the task deadline, we looked at the performances of the other three models on the test data.
Those models were DeBERTa large fine-tuned on upsampled data and DeBERTa large and RoBERTa
large fine-tuned on the original training data. The results comparing all of these models are on Table 1
and Table 2. The RoBERTa large model with upsampling performs better than the other three models in
subtask 1, but ends up being the worst-performing model on subtask 2. The best performing model for
subtask 2 is DeBERTa large without upsampling.
6. Conclusion and Future Work
We were able to beat the BERT baselines by incorporating the F1-Loss function and up-sampling
lower-performing categories. This entails that the data benefits from knowledge transfer obtained from
the different categories. In the future, we would also like to consider different languages as just using
the translation could have led to missed cultural nuances expressed in language.
References
[1] J. Kiesel, Ç. Çöltekin, M. Heinrich, M. Fröbe, M. Alshomary, B. D. Longueville, T. Erjavec, N. Handke,
M. Kopp, N. Ljubešić, K. Meden, N. Mirzakhmedova, V. Morkevičius, T. Reitis-Münstermann,
M. Scharfbillig, N. Stefanovitch, H. Wachsmuth, M. Potthast, B. Stein, Overview of Touché 2024:
Argumentation Systems, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D.
Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets
Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Confer-
ence of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin
Heidelberg New York, 2024.
[2] S. Hassan, S. Shaar, K. Darwish, Cross-lingual emotion detection, in: N. Calzolari, F. Béchet,
P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo,
J. Odijk, S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation
Conference, European Language Resources Association, Marseille, France, 2022, pp. 6948–6958.
URL: https://aclanthology.org/2022.lrec-1.751.
[3] S. H. Schwartz, J. Cieciuch, M. Vecchione, E. Davidov, R. Fischer, C. Beierlein, A. Ramos, M. Verkasalo,
J.-E. Lönnqvist, K. Demirutku, et al., Refining the Theory of Basic Individual Values, Journal of
personality and social psychology 103 (2012). doi:10.1037/a0029393.
[4] J. Kiesel, M. Alshomary, N. Mirzakhmedova, M. Heinrich, N. Handke, H. Wachsmuth, B. Stein,
SemEval-2023 Task 4: ValueEval: Identification of Human Values behind Arguments, in: R. Kumar,
A. K. Ojha, A. S. Doğruöz, G. D. S. Martino, H. T. Madabushi (Eds.), 17th International Workshop on
Semantic Evaluation (SemEval 2023), Association for Computational Linguistics, Toronto, Canada,
2023, pp. 2287–2303. doi:10.18653/v1/2023.semeval-1.313.
[5] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
RoBERTa: A robustly optimized BERT pretraining approach (2019). arXiv:1907.11692.
[6] P. He, X. Liu, J. Gao, W. Chen, DeBERTa: Decoding-enhanced BERT with disentangled attention
(2020). arXiv:2006.03654.