Eric Fromm at Touché: Prompts vs FineTuning for Human
                         Value Detection
                         Notebook for the Touché Lab at CLEF 2024

                         Ranjan Mishra2 , Meike Morren1
                         1
                             Department of Marketing, School of Business and Economics, Vrije Universiteit Amsterdam
                         2
                             Tinbergen Institute, Netherlands


                                        Abstract
                                        Human values are notoriously difficult to predict as they are often of nuanced nature, culturally embedded and
                                        varying across geographies. Generative Large Language Models (LLMs) have become very powerful to mimic
                                        how people use language, including value-laden content. We explore the opportunities for supervised fine-tuning
                                        and prompt engineering the LLMs in order to better perform a downstream task such as finding value-laden
                                        content in text. We compare fine-tuning, which heavily relies on labeled data, to the more flexible approach of
                                        prompt engineering that requires less or no labeled data at all. Our goal in this paper is three-fold: 1) assess the
                                        capabilities of closed source (GPT-3.5 and GPT-4o) versus open source (Gemini and Llama3) LLMs, 2) analyse the
                                        influence of domain-specific information by comparing fine-tuning with prompts, and 3) compare multi-label
                                        with single-label approaches.

                                        Keywords
                                        Generative AI, Prompt Engineering, Supervised Fine-Tuning


                         1. Introduction
                         In recent years, Generative AI (GenAI) has established itself as the state of the art in the field of Natural
                         Language Processing (NLP) by enabling the creation of highly sophisticated models. These models,
                         often referred to as large language models (LLMs) are extensively trained on a vast amount of data
                         allowing them to deliver state of the art performances on a wide range of NLP tasks across different
                         domains. Given their flexibility in training, these LLMs can be used in different ways. One can fine-tune
                         a pre-trained LLM by using Supervised Fine-Tuning (SFT) with a curated dataset for a specific language
                         modeling task. This allows to obtain an improved version of the model that can perform the particular
                         downstream task better. A more flexible and efficient alternative to fine-tuning is prompt engineering,
                         where prompts are queried in a specific format to the LLMs to generate a desired response. An effective
                         prompt design can mitigate the need for extensive fine-tuning [1]. In our paper, we compare two
                         aforementioned approaches and their influence on the prediction of human values, and we aim to
                         compare this effect across different open and closed source models. We present our results in the CLEF
                         Touché’s workshop [2].


                         2. Background
                         Higher-order constructs such as human values are likely picked up by transformer models 1 . One
                         way to effectively use these pre-trained transformer models to capture nuances in texts containing
                         human values is through fine-tuning. It involves taking a pre-trained model and adopting to a specific
                         language modeling task by further training on a smaller task-specific dataset. The idea is that, through
                         fine-tuning, the new model captures both the general linguistic features from the pre-training phase as
                         well as improve performance on the specific task by adjusting the model’s parameters during fine-tuning.
                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                          $ 674757rm@eur.nl (R. Mishra); meike.morren@vu.nl (M. Morren)
                           0000-0000-0000-0000 (R. Mishra); 0000-0001-6350-356X (M. Morren)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                             https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Fine-tuning can be done in various ways, including supervised, unsupervised and semi-supervised
approaches. For our case, we use supervised fine-tuning (SFT) which involves further training the
model on a labelled dataset to adapt it to a new task. SFT involves selecting a relevant pre-training
model for the task, preparing a labeled dataset tailored to the task and then extracting a new model
with adjusted parameters that captures the specific nuances of the ask. The main advantage of SFT is
the improved performance on the specific task while also being highly resource efficient, requiring less
data and computational capacity compared to training a new model from scratch.
   In our paper, we use four models, two from OpenAI (GPT 3.5, GPT 4o), Gemini-1.0-pro and Llama3-
70B-Instruct, all of which are based on the transformer architecture introduced by Vaswani et al. (2017)
[3]. The main advantage of this architecture is its attention mechanism, which allows the models to
focus on different parts of the input text selectively, enabling them to capture long-range dependencies
and contextual relationships more effectively than previous architectures RNNs and LSTMs. This results
in better handling of complex language structures and understanding nuanced meanings. An extensive
overview of the performance of these models across different tasks can be studies in their technical
reports. Our choice of these particular models is influenced by our familiarity, domain knowledge as
well as the prospect of comparison between closed source (GPT 3.5, GPT-4o, Gemini 1.0 pro) and open
source (Llama3-70b-instruct) models.
   To maximise the benefit from these capabilities of the LLMs, we integrate description of human values
directly into the prompts. By including the information about all possible values in the prompt but
instructing the model to only report one value per sentence, we ensure that the model assigns a single
value to each sentence. Prompting is quite sensitive to the information fed, meaning even small changes
in prompts can lead to significantly different results which emphasizes the importance of an effective
prompt design [1]. Therefore, we focus on how the information provided in the prompts influences
the prediction of human values. The informed zero-shot multilabel (ML) prompt (see Appendix A.1)
includes both the task of identifying human values as well as the descriptions of the values given in the
coding manual [4]. We also use single label (SL) prompting where we only give a description of one
value (see Appendix A.3) which allows us to obtain multiple values per sentence.
   Prompting also allows us to give examples with this description which is a bridge between fine-tuning
(giving many examples) and zero-shot prompting (giving only description). We apply few-shot SL
prompting by carefully selecting examples from the training set so that the model is able to learn to
distinguish between the positive and negative examples for each value. After some experimentation, we
suffice with 3 positive examples, and 3 negative examples. The example sentences are selected from a
dataset of sentences based on the words they have in common with the value-labeled sentences. Before
matching sentences based on words, we remove the words that are common across all sentences (see
Appendix B.2). For the negative examples, we select sentences that are a) randomly drawn from those
sentences not annotated by the vocal value, b) annotated with a related value adjacent in the circle,
and c) annotated with an opposed value. This way, we hope that we show the algorithm specialized
information on what constitutes a value and what does not.


3. Approach
From the training set, we selected sentences to be used as examples in prompts as well as labeled
data used for fine-tuning. We remove sentences with fewer than 15 characters are excluded from this
selection as they are less likely to be informative about a human value, reducing the training set by
44123 sentences. when we also remove those labeled by 0.5, which might be less clear, our final training
dataset is 42210. Second, we remove the stopwords (we augmented the nltk list with 124 words, see
B.1), connector words (gensim), numbers (both alphabetically and numerically written), and tokens
smaller than 2. We keep hyphened words and nouns, adjectives, and adverbs. On this subset we run a
phrase model to identify frequently co-occurring words. From this final vocabulary of 24172 tokens,
we identify the most frequent words occurring across all sentences (see B.2). Excluding these overall
common words, we search per value for the most frequent words and match the negative and positive
examples based on these words.
   To explore various approaches to zero-shot and few-shot prompting and compare with fine-tuning,
we select a subset from the validation sample. For prompting, we selected max 600 sentences per
value of which 300 were positive examples, and the other 300 were divided among 4 sets of negative
examples (of which 2 were random negative examples, 1 was related negative example, and 1 was
opposed negative example). If there were fewer than 300 positive examples, we selected all positive
examples, and matched with an equal set of negative examples (divided across the random, opposed and
related values). Since the negative examples could be labeled for values other than the vocal value, the
total subset contained more than 300 positive examples for some values. In total we have .. sentences
for the validation subset used for testing (see appendix A.3 and A.1. All of our models are tested on
these subsamples from the original validation set.
   To fine-tune the models, we used the training set to select sentences. We used the same approach as
above but only for maximally 240 positive examples per value (for SL), or 20 positive examples (for ML)
to reduce the computational resources needed. This way, we cap the dataset used for fine-tuning at
480 sentences. Again, we tested the models on the sub samples from the validation set. For fine-tuning
Gemini, we convert the training data into a jsonl format and use the VertexAI API to initiate and run
a fine-tuning job. When completed, the job returns evaluation metrics for the training data which
includes the training loss, token accuracy at training step and number of predicted tokens at a training
step 2 . These metrics can be visualised both using an API call as well as the Vertex AI Dashboard. For
fine-tuning in OpenAI using the Davinci model, we used the 480 sentences and the labels with hyphens
between (so self-direction-thought). This resulted in fewer random responses. But still there were
responses such as: ’ self-direction-direction-thought’,’-thoughtominityetal’ or ’freedom-dominance-th’.
After performing the fine-tuning job for both single and multi-label, we evaluate their performance on
the validation set.


4. Results
We present our validation set results and discuss the influence of different choices on model perfor-
mances: 1) open vs closed source 2) fine-tuning vs prompting using single and multi-label approaches.
As can be seen in 2, our best performing model is the open source Llama3-70b-instruct with an overall
f1-score of 0.70, 6 points higher than the best performing closed source models (gemini-1.0-pro SFT
(SL) and GPT-4o few-shot (SL). This signifies that even open-access models can deliver state of the art
performance in a task like human values detection that requires nuanced and contextual understanding
of language. In comparing fine-tuning to prompting, we first analyse it on the single label. Here, the
fine-tuning for single label for Gemini seems to relatively match the performance of the best performing
prompting approach, which Llama3 being an exception. This highlights that creating a fine-tuned
model for single labels and aggregating them for predicting all labels might give good results. However,
prompting still seems to be the best performing approach for predicting single labels. In contrast, the
multi-label approaches seem to be the worst performing for both the fine-tuning as well as prompting.
For fine-tuning, this can be caused by the lack of sufficient training data for each value for the model to
properly understand the nuances in them. For prompting, we think that this can be partly explained by
the fact that these language models do not robustly make use of information presented in long input
contexts[5].
   We also see that compared to zero-shot, few-shot approaches lead to a slight gain in performance,
indicating the usefulness of including positive and negative examples for each value in prompt design.
For GPT, it seems that the more recent models are more effective with a higher F1 score for GPT4o
compared to GPT3.5. We only tested this for zero-shot single label). Adding positive and negative
examples to the prompt increases the performance. When adding on top of the examples, the context
of the sentence (i.e. the three sentences in the text preceding the sentence that was labeled), the

2
    https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning
Table 1
Achieved F1 -score of each trained model on the validation dataset for subtask 1. A ✓ indicates that the submission
used the automatic translation to English.
                                                                                           F1 -score


                                                                  Benevolence: dependability
                                                                  Conformity: interpersonal


                                                                  Universalism: tolerance
                                                                  Self-direction: thought


                                                                  Universalism: concern
                                                                  Universalism: nature
                                                                  Self-direction: action


                                                                  Benevolence: caring
                                                                  Power: dominance


                                                                  Conformity: rules
                                                                  Security: personal
                                                                  Security: societal
                                                                  Power: resources
                                                                  Achievement
                                                                  Stimulation
                                                                  Hedonism


                                                                  Tradition


                                                                  Humility
                                                                  Face
                                                                  All
       Validation Subset                                   EN

       GPT-3.5 zero-shot (ML)                               ✓     38 32 33 42 59 69 32 32 38 32 31 63 30 33 33 32 32 33 32 32
       GPT-4o zero-shot (ML)                                ✓     48 38 38 44 54 64 52 46 36 59 49 55 36 35 38 49 35 56 79 37
       GPT-3.5 Supervised Fine Tuning (SFT) (ML)            ✓     42 41 38 39 48 49 47 41 38 48 46 46 49 35 28 38 39 46 53 40
       GPT-3.5 zero-shot (SL)                               ✓     57 47 58 59 48 61 61 50 40 55 59 70 57 62 56 39 53 47 69 75
       GPT-3.5 few-shot (SL)                                ✓     63 41 53 71 72 62 64 59 59 58 57 76 67 59 60 55 61 66 78 75
       GPT-4o few-shot (SL)                                 ✓     64 45 62 67 67 60 71 59 57 60 56 78 73 67 61 58 61 61 81 74
       GPT-3.5 context zero-shot (SL)                       ✓     58 48 57 64 46 62 66 35 29 55 60 71 70 64 56 39 57 71 73 72
       GPT-3.5 context few-shot (SL)                        ✓     62 45 52 72 76 62 43 54 54 60 58 74 68 61 57 53 61 78 73 73
       gemini-1.0-pro Supervised Fine Tuning (SFT) (SL)     ✓     64 57 51 12 77 69 61 68 73 68 68 84 67 52 66 67 54 65 84 70
       gemini-1.0-pro Supervised Fine Tuning (SFT) (ML) ✓         21 15 13 05 35 32 23 24 05 35 14 38 33 08 22 22 10 17 24 39
       llama3-70b-instruct zero-shot (SL)                   ✓     70 49 67 67 61 75 76 72 75 65 69 85 73 70 58 75 75 76 91 78
       llama3-70b-instruct zero-shot (ML)                   ✓     26 12 24 17 24 37 23 13 14 25 19 50 38 00 36 25 17 24 52 48


Table 2
Achieved F1 -score of each submission on the test dataset for subtask 1. A ✓ indicates that the submission used
the automatic translation to English. Baseline submissions shown in gray.
                                                                                   F1 -score
                                                          Benevolence: dependability
                                                          Conformity: interpersonal


                                                          Universalism: tolerance
                                                          Self-direction: thought


                                                          Universalism: concern
                                                          Universalism: nature
                                                          Self-direction: action


                                                          Benevolence: caring
                                                          Power: dominance


                                                          Conformity: rules
                                                          Security: personal
                                                          Security: societal
                                                          Power: resources
                                                          Achievement
                                                          Stimulation
                                                          Hedonism


                                                          Tradition


                                                          Humility
                                                          Face
                                                          All


               Submission (test set)            EN

               GPT3.5 few shot (SL)                ✓      23 08 12 13 20 27 18 27 12 15 32 31 33 07 03 19 19 35 50 11
               GPT-4o informed zero-shot (ML) ✓           25 15 10 10 18 25 18 09 24 21 30 46 33 09 15 26 15 41 55 20
               valueeval24-bert-baseline-en        ✓      24 00 13 24 16 32 27 35 08 24 40 46 42 00 00 18 22 37 55 02


performance deteriorates slightly. Note that Llama3 has only been used as a zero-shot due to time
constraints.
   If we zoom in on the values, we see that Llama3 performs well across all values, while other generative
LLMs perform worse. For instance, GPT3.5 has a much lower F1 score for values across the board,
except for tradition and universalism-nature. However, Llama3 also outperforms GPT3.5 here with a
very impressive F1 score of 85, respectively 91. Some values are notoriously difficult to predict, such
as self-direction thought. Even Llama3 was unable to achieve a higher F1 score than .49. Surprisingly,
fine-tuning with GEMINI proved to be very successful and obtained an F1 of .57 for this value. We can
hypothesize that for some values such as self-direction thought, fine-tuning leads to a better result
as the model better learns the nuances in the value through sufficient training examples whereas for
some an effective prompt design seems to give the best results. This also highlights the importance of
combining these two approaches to achieve an overall better result.
  Table 2 shows that our ML model does slightly better above the baseline model on the test set C.4. As
the SL predictions took a lot of time (for GPT3.5, it took about 3-4 hours per value for the single-label
model, and Llama3 it took about 7-8 hours per value), we only include the results of our best performing
GPT3.5 model: the few-shot single label prompt. Contrary to our expectations the prompt did not do
much better than our previous multilabel prompt using GPT4o. The value self-direction-thought was
not completely finished which could explain the low F1 score here, but even without this one value, we
don’t see an improvement of SL-GPT3.5 over ML-GPT4o. Despite these results, the single label few shot
outperformed this model in our validation subsets. The most likely reason could be that our validation
subsets have very different distributions of values and words than the test set.


5. Discussion
In our paper, we looked at the capabilites of open and closed source models as well as the influence of
fine-tuning and prompting with different single and multi-label approaches. Based on our validation set,
prompting gives the best results when trying to predict human values using text, hence signifying the
importance of an effective prompt design, with few-shot approaches showing slight gain in performances
compared to zero shot approaches. Given the time-limitedness, further research can be focused on
looking at the text similarities and differences between test and our validation subset. We can also
estimate SL prompting approaches with GPT-4o as well as run LLama3 SL on the entire test set given
enough computational capacity and finally compare SL SFT for openai with ML SFT and note the gain
in performance or lack thereof.


References
[1] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C.
    Schmidt, A prompt pattern catalog to enhance prompt engineering with chatgpt, arXiv preprint
    arXiv:2302.11382 (2023).
[2] J. Kiesel, Ç. Çöltekin, M. Heinrich, M. Fröbe, M. Alshomary, B. D. Longueville, T. Erjavec, N. Handke,
    M. Kopp, N. Ljubešić, K. Meden, N. Mirzakhmedova, V. Morkevičius, T. Reitis-Münstermann,
    M. Scharfbillig, N. Stefanovitch, H. Wachsmuth, M. Potthast, B. Stein, Overview of Touché 2024:
    Argumentation Systems, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D.
    Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets
    Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Confer-
    ence of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin
    Heidelberg New York, 2024.
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
    Attention is all you need, Advances in neural information processing systems 30 (2017).
[4] M. Scharfbillig, L. Smillie, D. Mair, M. Sienkiewicz, J. Keimer, R. Pinho Dos Santos, H. Vinagreiro
    Alves, E. Vecchione, L. Scheunemann, Values and Identities - a Policymaker’s Guide, Technical
    Report KJ-NA-30800-EN-N, European Commission’s Joint Research Centre, Luxembourg, 2021.
    doi:10.2760/349527.
[5] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How
    language models use long contexts, Transactions of the Association for Computational Linguistics
    12 (2024) 157–173.
A. Appendix: Prompts
A.1. Multi-Label
Assess which value relates to text. Follow description below in format VALUE: description. SELF-
DIRECTION–THOUGHT: Freedom to cultivate one’s own ideas and abilities
SELF-DIRECTION–ACTION: Freedom to determine one’s own actions
STIMULATION: Excitement, novelty, and change
HEDONISM: Pleasure and sensuous gratification
ACHIEVEMENT: Success according to social standards
POWER–DOMINANCE: Power through exercising control over people
POWER–RESOURCES: Power through control of material and social resources
FACE: Security and power through maintaining one’s public image and avoiding humiliation
SECURITY–PERSONAL: Safety in one’s immediate environment
SECURITY–SOCIETAL: Safety and stability in the wider society
TRADITION: Maintaining and preserving cultural, family, or religious traditions
CONFORMITY–RULES: Compliance with rules, laws, and formal obligations
CONFORMITY–INTERPERSONAL: Avoidance of upsetting or harming other people
HUMILITY: Recognizing one’s insignificance in the larger scheme of things
BENEVOLENCE–DEPENDABILITY: Being a reliable and trustworthy member of the in-group
BENEVOLENCE–CARING:Devotion to the welfare of in-group members
UNIVERSALISM–CONCERN: Commitment to equality, justice, and protection for all people
UNIVERSALISM–NATURE: Preservation of the natural environment
UNIVERSALISM–TOLERANCE: Acceptance and understanding of those who are different from oneself
Return VALUE. If text reflects no value, return NEUTRAL.


A.2. Single Label
Assess if the text relates to UNIVERSALISM–TOLERANCE: Acceptance and understanding of those who
are different from oneself. Return 1 if it does, 0 if not.


A.3. Few shot
Assess if the text relates to SELF–DIRECTION–THOUGHT: Freedom to cultivate one’s own ideas and
abilities. Return 1 if it does, 0 if not. Here are some examples:

Haimov explains that it is important for the child to be involved in the process, so that he un-
derstands that even if he is headed for a certain institution, sometimes it is not the right step for him. : 1

President Donald Trump says the US Supreme Court has not properly addressed mass election
fraud. : 1

Stabilize eco-bonuses and support efficient district heating for upgrading and decarbonization
of public and private heritage buildings.: 0

People who wanted to obtain information on the issue accelerated their research.: 0

This series of experiments is the first step in a multi-year experiment program of the Ministry
of Defense (the directorate for research and development of the military and technological infrastructure -
AB) and the defense industries to develop a land and air laser system to deal with threats at different
ranges at high powers.: 0


B. Appendix: Words
B.1. Stopwords
B.2. Common words
                                                                                                                                                                                                       Face
                                                                                                                                                                                                                                                    All texts


                                                                                                         Hedonism


                                                                                Stimulation
                                                                                                                           Achievement
                                                                                                                                                                            Power: resources
                                                                                                                                                                                                                      Security: personal


                                                                                                                                                  Power: dominance


          Self-direction: action
                                             Self-direction: thought
     people                            children                               right           development              good                      last                    Israel                    market       political                        water
       new                               school                           different                    order              fun              number                       power                    economic           state                          safe
       time                           President                             Trump                     public          really              company                      Russian                         gas      apology                     treatment
    country                          education                          important                    change         moment               companies                      Russia                     energy          party                           way
      years                             already                           political             technology          children                market                          US                   economy          media                     security y
       year                               party                               issue               education              love               already                         EU                      Russia      Russian                          body
government                                 right                           several                   energy             food                system                   President                       euros        public                      Mineral
        first                          freedom                                 idea             innovation           speech                     way                   military                 production     campaign                      important
  European                                  free                            things                    young               still                 well                  Ukraine                      money          Prime                    beneficial l
   Minister                                state                       researchers                 business           Many                    Israel                    police                         EU         image                           good
      many                         information                            decision            opportunities           home                     good                  sanctions                 investment            part                          risk
  countries                              action                          President                     work              true                 work                     control                        way            fact                        place
       even                               group                               name                 research             little                  able                  political                       well         never                        school
      world                                 well                               way                  possible              day                   best                  pressure                      sector           role                        home
        also                                 EU                           research                    future          happy                 percent                    Turkey                   companies      Ministry                         health
                                                                                                                         Humility


                                      Tradition


        Security: societal
                                                        Conformity: rules
                                                                                                                                              Benevolence: caring
                                                                                                                                                                                                                                                           Universalism: nature


                                                                                                                                                                                                                      Universalism: concern
                                                                                                                                                                                                                                                                                                Universalism: tolerance


                                                                                     Conformity: interpersonal
                                                                                                                                                                              Benevolence: dependability
  security                         Israel               law                          EU                           everyone             children                        together                                    social                             energy                             different
 measures                       cultural              rules                    relations                              much             support                          support                                children                               climate                         differences
     order                        family             Court                       Greece                                 day              family                     cooperation                               education                                 green                              racism
    social                     children                  EU                     meeting                          important           education                             Israel                               women                               emissions                            diversity
    health                          state         decision                         talks                            humble          companies                         President                                    rights                          renewable                               society
     Israel                    national              court                  cooperation                          American              families                              well                              refugees                                    use                               today
    police                       history               case                       states                              night               better                          NATO                                   system                               change                              meeting
   system                            God              legal                     Turkish                           situation                child                       relations                                citizens                                 areas                           together
   energy                            part            order                       Turkey                                team          important                           Turkey                                 support                       environmental                                 course
protection                       Jewish               right                  agreement                             thankful             citizens                      solidarity                                    right                              global                             political
     crisis                        Allah            public                        never                             grateful          students                              good                                    work                         sustainable                          differently
 economic                    education               work                       interest                             season              health                              way                                  school                                   gas                          tolerance
    public                     heritage               state                      chance                           anything             workers                        members                              opportunities                          production                                  issue
      state                         faith         Ministry                    everyone                                whole                  free                             EU                                  young                         development                                  peace
necessary                     language             already                     together                                  lot              social                          Prime                                   public                              carbon                      discrimination
C. Appendix: OpenAI
C.1. Zero-shot ML
 Value                             F1    Accuracy   Precision     Recall   N
 Self-direction: thought      0.329032   0.500000   0.245192    0.500000   208
 Self-direction: action       0.333663   0.501946   0.746032    0.501946   505
 Stimulation                  0.421372   0.541637   0.700067    0.541637   509
 Hedonism                     0.592675   0.644231   0.792135    0.644231   104
 Achievement                  0.692830   0.704829   0.729844    0.704829   583
 Power: dominance             0.325175   0.500000   0.240933    0.500000   579
 Power: resources             0.325581   0.500000   0.241379    0.500000   580
 Face                         0.389513   0.523870   0.668259    0.523870   163
 Security: personal           0.326923   0.500000   0.242857    0.500000   105
 Security: societal           0.313953   0.500000   0.228814    0.500000   590
 Tradition                    0.635952   0.677326   0.798913    0.677326   337
 Conformity: rules            0.306147   0.500000   0.220613    0.500000   587
 Conformity: interpersonal    0.331210   0.500000   0.247619    0.500000   105
 Humility                     0.333333   0.500000   0.250000    0.500000   40
 Benevolence: caring          0.327212   0.500000   0.243176    0.500000   403
 Benevolence: dependability   0.327623   0.500000   0.243631    0.500000   314
 Universalism: concern        0.330317   0.500000   0.246622    0.500000   592
 Universalism: nature         0.329567   0.500000   0.245787    0.500000   356
 Universalism: tolerance      0.327869   0.500000   0.243902    0.500000   82
 Mean                         0.384208   0.531255   0.398725    0.531255   354.842105

C.2. Zero-shot ML (GPT4o)
 Value                             F1    Accuracy   Precision     Recall   N
 Self-direction: thought      0.389098   0.528302   0.752475    0.528302   208
 Self-direction: action       0.382206   0.523276   0.715813    0.523276   505
 Stimulation                  0.444883   0.552972   0.701315    0.552972   509
 Hedonism                     0.548611   0.615385   0.782609    0.615385   104
 Achievement                  0.641937   0.665530   0.710124    0.665530   583
 Power: dominance             0.523553   0.572276   0.612637    0.572276   579
 Power: resources             0.458125   0.523571   0.542262    0.523571   580
 Face                         0.365027   0.511822   0.623428    0.511822   163
 Security: personal           0.590541   0.637800   0.732252    0.637800   105
 Security: societal           0.499145   0.575752   0.645444    0.575752   590
 Tradition                    0.550340   0.616032   0.755430    0.616032   337
 Conformity: rules            0.359992   0.513519   0.575959    0.513519   587
 Conformity: interpersonal    0.351852   0.509434   0.750000    0.509434   105
 Humility                     0.386602   0.525000   0.756410    0.525000   40
 Benevolence: caring          0.493737   0.576481   0.696442    0.576481   403
 Benevolence: dependability   0.354696   0.512422   0.746774    0.512422   314
 Universalism: concern        0.561806   0.619680   0.740923    0.619680   592
 Universalism: nature         0.791186   0.793496   0.800323    0.793496   356
 Universalism: tolerance      0.378788   0.523810   0.750000    0.523810   82
 Mean                         0.477480   0.573503   0.704769    0.573503   354.842105
C.3. Supervised Finetuning ML

 Value                             F1    Accuracy   Precision     Recall    N
 Self-direction: thought      0.412444   0.472346   0.454635    0.472346    208
 Self-direction: action       0.375717   0.511179   0.573454    0.511179    505
 Stimulation                  0.391322   0.484725   0.463307    0.484725    509
 Hedonism                     0.487179   0.519231   0.525641    0.519231    104
 Achievement                  0.493900   0.528693   0.538008    0.528693    583
 Power: dominance             0.474970   0.503548   0.504364    0.503548    579
 Power: resources             0.416076   0.508095   0.519784    0.508095    580
 Face                         0.383482   0.485919   0.460247    0.485919    163
 Security: personal           0.337805   0.489651   0.406863    0.489651    105
 Security: societal           0.484343   0.513368   0.515700    0.513368    590
 Tradition                    0.462425   0.467741   0.466663    0.467741    337
 Conformity: rules            0.489845   0.507557   0.507981    0.507557    587
 Conformity: interpersonal    0.351852   0.509434   0.750000    0.509434    105
 Humility                     0.285714   0.400000   0.222222    0.400000    40
 Benevolence: caring          0.377595   0.471914   0.433128    0.471914    403
 Benevolence: dependability   0.396299   0.500832   0.502480    0.500832    314
 Universalism: concern        0.469198   0.496210   0.495305    0.496210    592
 Universalism: nature         0.533835   0.537948   0.539046    0.537948    356
 Universalism: tolerance      0.403372   0.510119   0.532381    0.510119    82
 Mean                         0.422493   0.495711   0.495327    0.495711    354.842105

C.4. Zero-shot SL
 Value                             F1    Accuracy   Precision      Recall   N
 Self-direction: thought      0.475630   0.545320   0.591760     0.545320   208
 Self-direction: action       0.581450   0.612597   0.654453     0.612597   505
 Stimulation                  0.586714   0.623173   0.677015     0.623173   509
 Hedonism                     0.484685   0.576923   0.770833     0.576923   104
 Achievement                  0.616815   0.649464   0.709813     0.649464   583
 Power: dominance             0.614849   0.615789   0.615852     0.615789   579
 Power: resources             0.506604   0.527143   0.531476     0.527143   580
 Face                         0.401412   0.529895   0.681777     0.529895   163
 Security: personal           0.550000   0.610022   0.712781     0.610022   105
 Security: societal           0.594653   0.601389   0.602546     0.601389   590
 Tradition                    0.709248   0.729035   0.794491     0.729035   337
 Conformity: rules            0.573897   0.584789   0.585867     0.584789   587
 Conformity: interpersonal    0.628830   0.659833   0.733563     0.659833   105
 Humility                     0.563636   0.625000   0.785714     0.625000   40
 Benevolence: caring          0.399468   0.627181   0.446013     0.418121   403
 Benevolence: dependability   0.534250   0.587951   0.652088     0.587951   314
 Universalism: concern        0.473877   0.714247   0.482888    0.476164    592
 Universalism: nature         0.690581   0.708903   0.761707     0.708903   356
 Universalism: tolerance      0.753754   0.758929   0.771875     0.758929   82
 Mean                         0.565282   0.625662   0.661185     0.602128   354.842105
C.5. Few-shot SL
 Value                             F1    Accuracy   Precision     Recall   N
 Self-direction: thought      0.417733   0.492323   0.484873    0.492323   208
 Self-direction: action       0.534062   0.535459   0.535731    0.535459   505
 Stimulation                  0.715059   0.715224   0.719304    0.715224   509
 Hedonism                     0.720430   0.730769   0.770833    0.730769   104
 Achievement                  0.628770   0.647462   0.675343    0.647462   583
 Power: dominance             0.645333   0.655932   0.687710    0.655932   579
 Power: resources             0.559652   0.562381   0.563124    0.562381   580
 Face                         0.590816   0.604367   0.617737    0.604367   163
 Security: personal           0.584018   0.617102   0.661250    0.617102   105
 Security: societal           0.573713   0.596991   0.609961    0.596991   590
 Tradition                    0.764715   0.771529   0.793786    0.771529   337
 Conformity: rules            0.675029   0.678283   0.675832    0.678283   587
 Conformity: interpersonal    0.595561   0.599057   0.603175    0.599057   105
 Humility                     0.605003   0.625000   0.656740    0.625000   40
 Benevolence: caring          0.556410   0.581805   0.601835    0.581805   403
 Benevolence: dependability   0.612154   0.621321   0.630588    0.621321   314
 Universalism: concern        0.658162   0.678858   0.728490    0.678858   592
 Universalism: nature         0.788282   0.790734   0.798026    0.790734   356
 Universalism: tolerance      0.754785   0.758333   0.765931    0.758333   82
 Mean                         0.630510   0.645417   0.662119    0.645417   354.842105

C.6. Few-shot SL (GPT4o)
 Value                             F1    Accuracy   Precision     Recall   N
 Self-direction: thought      0.448520   0.546430   0.646784    0.546430   208
 Self-direction: action       0.620388   0.650339   0.709828    0.650339   505
 Stimulation                  0.674399   0.689271   0.718614    0.689271   509
 Hedonism                     0.672856   0.701923   0.813253    0.701923   104
 Achievement                  0.600360   0.639865   0.712416    0.639865   583
 Power: dominance             0.716752   0.717760   0.717817    0.717760   579
 Power: resources             0.591579   0.601071   0.607982    0.601071   580
 Face                         0.578354   0.625602   0.714617    0.625602   163
 Security: personal           0.600137   0.636710   0.699629    0.636710   105
 Security: societal           0.561422   0.571412   0.573649    0.571412   590
 Tradition                    0.788847   0.795278   0.819482    0.795278   337
 Conformity: rules            0.730796   0.739653   0.738435    0.739653   587
 Conformity: interpersonal    0.670071   0.697932   0.789236    0.697932   105
 Humility                     0.615385   0.650000   0.734375    0.650000   40
 Benevolence: caring          0.583410   0.628944   0.710506    0.628944   403
 Benevolence: dependability   0.610371   0.644014   0.705616    0.644014   314
 Universalism: concern        0.619124   0.656119   0.744511    0.656119   592
 Universalism: nature         0.819764   0.821310   0.825753    0.821310   356
 Universalism: tolerance      0.746444   0.761310   0.815374    0.761310   82
 Mean                         0.644683   0.672365   0.726204    0.672365   354.842105
C.7. Context zero-shot SL
 Value                             F1    Accuracy   Precision     Recall   N
 Self-direction: thought      0.479109   0.550222   0.604604    0.550222   208
 Self-direction: action       0.572290   0.600430   0.631968    0.600430   505
 Stimulation                  0.646528   0.678114   0.752841    0.678114   509
 Hedonism                     0.467637   0.567308   0.768041    0.567308   104
 Achievement                  0.625524   0.656131   0.714782    0.656131   583
 Power: dominance             0.662955   0.665090   0.666266    0.665090   579
 Power: resources             0.354369   0.542738   0.364602    0.361825   580
 Face                         0.296793   0.547741   0.505519    0.365161   163
 Security: personal           0.550000   0.610022   0.712781    0.610022   105
 Security: societal           0.601666   0.605324   0.605051    0.605324   590
 Tradition                    0.706533   0.726004   0.788529    0.726004   337
 Conformity: rules            0.705278   0.714757   0.714418    0.714757   587
 Conformity: interpersonal    0.644893   0.669086   0.726874    0.669086   105
 Humility                     0.563636   0.625000   0.785714    0.625000   40
 Benevolence: caring          0.393815   0.625444   0.453447    0.416962   403
 Benevolence: dependability   0.574656   0.612309   0.663650    0.612309   314
 Universalism: concern        0.718488   0.722763   0.733290    0.722763   592
 Universalism: nature         0.732942   0.744815   0.785144    0.744815   356
 Universalism: tolerance      0.727657   0.735119   0.753205    0.735119   82
 Mean                         0.580251   0.642022   0.670038    0.611918   354.842105

C.8. Context few-shot SL
 Value                             F1    Accuracy   Precision     Recall   N
 Self-direction: thought      0.454714   0.515908   0.527778    0.515908   208
 Self-direction: action       0.517328   0.519965   0.520295    0.519965   505
 Stimulation                  0.722974   0.723089   0.727319    0.723089   509
 Hedonism                     0.760369   0.769231   0.815972    0.769231   104
 Achievement                  0.624210   0.644229   0.673806    0.644229   583
 Power: dominance             0.433450   0.659516   0.461945    0.439677   579
 Power: resources             0.548872   0.552381   0.553282    0.552381   580
 Face                         0.540455   0.555271   0.562351    0.555271   163
 Security: personal           0.596154   0.626362   0.669426    0.626362   105
 Security: societal           0.585175   0.606366   0.618702    0.606366   590
 Tradition                    0.746617   0.753717   0.774514    0.753717   337
 Conformity: rules            0.683777   0.687529   0.684939    0.687529   587
 Conformity: interpersonal    0.608245   0.609035   0.610235    0.609035   105
 Humility                     0.573333   0.600000   0.633333    0.600000   40
 Benevolence: caring          0.535228   0.570406   0.596237    0.570406   403
 Benevolence: dependability   0.618975   0.623290   0.626965    0.623290   314
 Universalism: concern        0.649815   0.675845   0.741064    0.675845   592
 Universalism: nature         0.785372   0.787972   0.795740    0.787972   356
 Universalism: tolerance      0.730263   0.733929   0.740809    0.733929   82
 Mean                         0.616596   0.642844   0.649195    0.631274   354.842105
D. Appendix: GEMINI
D.1. Supervised Fine Tuning Gemini SL

 Value                             F1    Accuracy   Precision     Recall   N
 Self-direction: thought      0.567442   0.557143   0.559633    0.575472   210
 Self-direction: action       0.512097   0.528265   0.531381    0.494163   513
 Stimulation                  0.125874   0.521073   0.750000    0.068702   522
 Hedonism                     0.769231   0.798077   0.897436    0.673077   104
 Achievement                  0.690722   0.700000   0.712766    0.670000   600
 Power: dominance             0.617594   0.641414   0.669261    0.573333   594
 Power: resources             0.677054   0.620000   0.588670    0.796667   600
 Face                         0.608187   0.588957   0.590909    0.626506   163
 Security: personal           0.731707   0.688679   0.652174    0.833333   106
 Security: societal           0.680982   0.653333   0.630682    0.740000   600
 Tradition                    0.837349   0.843023   0.868750    0.808140   344
 Conformity: rules            0.666667   0.673333   0.680556    0.653333   600
 Conformity: interpersonal    0.524272   0.533333   0.540000    0.509434   105
 Humility                     0.666667   0.725000   0.846154    0.550000   40
 Benevolence: caring          0.671264   0.652068   0.640351    0.705314   411
 Benevolence: dependability   0.544170   0.598131   0.631148    0.478261   321
 Universalism: concern        0.656881   0.688333   0.730612    0.596667   600
 Universalism: nature         0.846939   0.833795   0.786730    0.917127   361
 Universalism: tolerance      0.705882   0.695122   0.697674    0.714286   82
 Mean                         0.640672   0.654745   0.678099    0.649180   370.421053

D.2. Supervised Fine Tuning Gemini ML

 Value                             F1    Accuracy   Precision     Recall   N
 Self-direction: thought      0.146341   0.938596   0.214286    0.111111   27
 Self-direction: action       0.125000   0.901754   0.142857    0.111111   36
 Stimulation                  0.051282   0.935088   0.200000    0.029412   34
 Hedonism                     0.347826   0.947368   0.571429    0.250000   32
 Achievement                  0.325000   0.905263   0.333333    0.317073   41
 Power: dominance             0.231579   0.871930   0.215686    0.250000   44
 Power: resources             0.240964   0.889474   0.208333    0.285714   35
 Face                         0.057143   0.942105   0.111111    0.038462   26
 Security: personal           0.354839   0.929825   0.282051    0.478261   23
 Security: societal           0.141176   0.871930   0.125000    0.162162   37
 Tradition                    0.384615   0.943860   0.322581    0.476190   21
 Conformity: rules            0.337662   0.910526   0.282609    0.419355   31
 Conformity: interpersonal    0.088889   0.928070   0.068966    0.125000   16
 Humility                     0.222222   0.963158   0.333333    0.166667   18
 Benevolence: caring          0.226415   0.928070   0.187500    0.285714   21
 Benevolence: dependability   0.105263   0.940351   0.111111    0.100000   20
 Universalism: concern        0.170213   0.931579   0.166667    0.173913   23
 Universalism: nature         0.242424   0.956140   0.250000    0.235294   17
 Universalism: tolerance      0.392157   0.945614   0.384615    0.400000   25
 Mean                         0.211047   0.924172   0.229270    0.223080   27.889
E. Appendix: LLAMA3
E.1. Zero Shot SL
 Value                          F1     Accuracy    Precision   Recall     N
 Self-direction: thought       0.49         0.62        0.76       0.36   210
 Self-direction: action        0.67         0.70        0.75       0.61   513
 Stimulation                   0.67         0.73        0.84       0.56   522
 Hedonism                      0.61         0.72        1.00       0.44   104
 Achievement                   0.75         0.72        0.69       0.83   600
 Power: dominance              0.76         0.73        0.71       0.82   579
 Power: resources              0.72         0.61        0.57       0.96   580
 Face                          0.75         0.76        0.80       0.71   163
 Security: personal            0.65         0.69        0.77       0.56   106
 Security: societal            0.69         0.57        0.54       0.96   590
 Tradition                     0.85         0.85        0.88       0.82   337
 Conformity: rules             0.73         0.64        0.59       0.94   587
 Conformity: interpersonal     0.70         0.70        0.71       0.70   105
 Humility                      0.58         0.68        0.82       0.45   40
 Benevolence: caring           0.75         0.72        0.71       0.79   403
 Benevolence: dependability    0.75         0.73        0.71       0.80   314
 Universalism: concern         0.76         0.73        0.69       0.86   592
 Universalism: nature          0.91         0.90        0.85       0.97   356
 Universalism: tolerance       0.78         0.78        0.82       0.74   82
 Mean                         0.705        0.716       0.748      0.709   384.68

E.2. Zero Shot ML
 Value                                F1   Accuracy   Precision       Recall   N
 Self-direction: thought      0.120000     0.920145    0.125000     0.115385   26
 Self-direction: action       0.240000     0.931034    0.428571     0.166667   36
 Stimulation                  0.173913     0.931034    0.307692     0.121212   33
 Hedonism                     0.242424     0.954628    0.800000     0.142857   28
 Achievement                  0.368421     0.912886    0.388889     0.350000   40
 Power: dominance             0.238806     0.907441    0.285714     0.205128   39
 Power: resources             0.133333     0.929220    0.300000     0.085714   35
 Face                         0.137931     0.954628    0.400000     0.083333   24
 Security: personal           0.247619     0.856624    0.158537     0.565217   23
 Security: societal           0.193548     0.773140    0.127119     0.405405   37
 Tradition                    0.500000     0.952813    0.419355     0.619048   21
 Conformity: rules            0.380952     0.952813    0.666667     0.266667   30
 Conformity: interpersonal    0.000000     0.967332    0.000000     0.000000   15
 Humility                     0.363636     0.974592    0.666667     0.250000   16
 Benevolence: caring          0.254545     0.925590    0.200000     0.350000   20
 Benevolence: dependability   0.166667     0.963702    0.500000     0.100000   20
 Universalism: concern        0.244444     0.876588    0.164179     0.478261   23
 Universalism: nature         0.520000     0.956443    0.393939     0.764706   17
 Universalism: tolerance      0.488889     0.958258    0.550000     0.440000   25
 Mean                         0.263954     0.926258    0.362228     0.289979   26.736842