<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting Human Preferences using a Multi-head BERT Classifier</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Filip F. Andresen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Håkon L. Hyrve</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sander S. Løvaas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, University of Oslo</institution>
          ,
          <addr-line>Gaustadalléen 23B, 0373 Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This document presents our approach and findings for the "Preference Prediction and Explanation" shared task from the 2025 ELOQUENT lab, which centres on automatically judging the quality of LLM-generated texts across ifve criteria: relevance, naturalness, truthfulness, safety, and overall quality. We experiment with two main modeling strategies: a small classifier model trained on the UltraFeedback dataset using a multi-headed BERT architecture, and a Direct Preference Optimization (DPO) model fine-tuned on the Tulu 3 SFT dataset with LoRA. Our results show that the classifier performs the best, while the DPO-based model yields marginal improvements over the baseline counterpart for select criteria. We discuss the limitations of training data alignment and label imbalance, and highlight the importance of dataset selection for generalization in preference prediction tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Human preference prediction</kwd>
        <kwd>Direct preference optimization</kwd>
        <kwd>LLM-as-judge</kwd>
        <kwd>Classification</kwd>
        <kwd>LLM</kwd>
        <kwd>NLP</kwd>
        <kwd>ML</kwd>
        <kwd>Language technology</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In this paper, we will explore diferent solutions for the first subtask. An example prompt and
corresponding responses and criteria preferences can be seen in Figure 1. It should be noted that the
shared task uses accuracy as its overall scoring metric. As such, we will use this metric for the current
work as well.</p>
      <p>One of the primary challenges in this task is the limited availability of task-specific training data,
which constrains the performance of models and results in outcomes that closely resemble baseline
performance. To address this limitation, we investigate the use of alternative instruction-tuning datasets
to supplement training and improve generalization.</p>
      <p>Additionally, we explore two distinct modelling approaches for preference prediction: (1) a Direct
Preference Optimization (DPO) strategy that learns from pairwise comparisons, and (2) a classifier-based
approach that directly predicts human preferences based on response features.</p>
      <p>Our codebase can be found in our repo2.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Reinforcement learning with human feedback (RLHF) is used to align language models with human
preferences, and has had an improving efect on chat models such as ChatGPT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Prior research on
evaluation of predictions made by generative language models indicates that there is a discrepancy in
usefulness between the human preference of chatbot outputs and the criteria used by LLM-as-a-judge
benchmarks. Zheng et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] conclude that strong LLMs like GPT-4 can achieve an agreement rate of
over 80% on human preferences, suggesting that there is a basis for using LLMs for evaluation.
      </p>
      <p>
        In their paper, Lambert et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] present the RewardBench dataset, which covers the criteria of chat,
reasoning, and safety. This is used to benchmark the performance of reward models. The study shows a
diference in performance across current reward models. They cover models of diferent sizes, from 400
million to 70 billion parameters, and models trained as classifiers or with Direct Preference Optimization
(DPO).
      </p>
      <p>
        Another recent contribution to the field of LLMs-as-a-judge is the UltraFeedback dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This
is a large-scale, diversified AI feedback dataset which consists of generated completions to prompts and
scores annotated automatically by a GPT-4 model. Cui et al. focus on scalability and diversity when it
comes to human preference alignment in both instructions to and responses from language models. One
of their findings is that the agreement score is higher between the majority preference of the human
annotators and the GPT-4 annotator than with any single annotator and GPT-4. They explain this as
the GPT-4 annotator generalizing well over human preferences with its preference predictions.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>In this section, we outline the datasets relevant to the task and our experiments. Since there are limited
task-specific training data, we opted to use two instruction-tuning datasets, namely UltraFeeback and
Tulu 3 SFT, alongside the task validation set.</p>
      <sec id="sec-3-1">
        <title>3.1. Task validation and test set</title>
        <p>
          A human-annotated dataset has been made by the organizers of the ELOQUENT CLEF shared task of
preference prediction and explanation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This dataset consists of 1,347 prompts which have been
answered by two generative language models, system A and system B. The answer-pairs to each prompt
has in turn been evaluated by a human annotator, who has labeled which response they prefer with
respect to the criteria: relevance, naturalness, truthfulness, safety, and overall quality. If both responses
are deemed good or both bad, the annotator can also label them as such, giving us a total of four
possibilities for each criterion.
        </p>
        <p>The dataset is partitioned into a development split and a test split, with 99 and 1,248 items respectively.
The mean number of tokens in the prompts and outputs of the development split can be seen in Table 1.
There appears to be no significant diference in length between output A and output B.</p>
        <p>Furthermore, an overview of the label distribution is presented in Table 2. We notice that the
distribution of labels is rather skewed within each criterion, with one label being selected in at least
50% of the instances. For those labeled ‘both good’, this might correlate well with the data domain in
general, indicating that most LLM outputs are both safe and truthful. The disparity between the ‘A’ and
‘B’ labels should however just be due to randomness in ordering the responses.</p>
        <p>The gold labels of the test data are withheld until the shared task is completed and are as such not
presented.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. UltraFeedback</title>
        <p>
          We trained our classifier using the UltraFeedback dataset from OpenBMB [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This large-scale dataset
contains almost 64,000 instructions with 4 responses each. Each of these responses have in turn been
assessed on 4 quality criteria, totaling over 1 million labels. This feedback was automatically given by a
GPT-4 model.
        </p>
        <p>The fields for each completion are defined by as such:
• Instruction following: LLMs should respond to humans without deviating from the
requirements.
• Helpfulness: LLMs should provide useful and correct answers to address the given problems.
• Truthfulness: LLMs’ output should be grounded in the instructions and real-world knowledge,
and avoid introducing any self-contradiction.
• Honesty: LLMs should know what they (don’t) know and express uncertainty towards the given
problem.</p>
        <p>In addition, the dataset has a separate Overall quality field, which scores the overall quality of the
completion. Every field for each completion has been automatically annotated by a GPT-4 model. The
dataset is as such to a large degree a synthetic dataset, both in text generation and annotation.</p>
        <p>Contrary to the task dataset, these completions are numerically scored and considered one at a time.
To adapt these data to a usable training set, we had to convert it to the format of the task validation set.
This was done by making pairs of completions for each prompt. We then extracted the corresponding
scores for the completions. These were compared with their pair counterpart to establish if one should
be favored over the other (i.e., be labeled A or B) or if they were both good or both bad (Both or Neither).</p>
        <p>As the two datasets did not contain the same fields, we experimented with which annotations
from the UltraFeedback dataset to use as proxies for the preference categories in the task set. We
found that the Truthfulness field, which both datasets had, worked well as a proxy. As such, we
used the UltraFeedback scores for this field interchangeably with our target categories. Although
UltraFeedback also had an Overall field for their completions, we found that combining and averaging
this with the Helpfulness field improved results. For the remaining categories, the choice was not as
clear. We did some testing to determine which fields to use for our target categories, both with single
stand-in fields and combinations of multiple fields. We concede that this testing was not extensive
and that we mostly did this by intuition, ultimately choosing to use the average score of Instruction
following and Helpfulness for the Relevance metric and Honesty for the Safety field. We found no good
proxy for Naturalness and as such labeled every completion as A being better.</p>
        <p>
          Table 3 shows the final distributions of preferences in our converted dataset. We notice that the
majority classes are consistent with the validation set, but the percentage distribution of labels difer.
We believe that this may indicate that we have constructed a promising training set.
3.3. Tulu 3 SFT
We fine-tuned our decoder model using the Tulu 3 SFT dataset 3 from Ai2 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This is a large-scale
instruction-tuning dataset consisting of more than 900,000 examples, sourced from diferent sources
such as NoRobots4 and CoCoNot5. This dataset is designed to improve the model’s ability to follow
3https://hf.co/datasets/allenai/tulu-3-sft-mixture
4https://hf.co/datasets/HuggingFaceH4/no_robots
5https://hf.co/datasets/allenai/coconot
instructions and provides prompts varied in complexity and form. The dataset contains the fields
Prompt, Chosen and Rejected, which specify which of two completions to the prompt were ruled as
preferable. Both responses to each prompt have been generated using a variety of models such as the
GPT-4 and the instruct versions of Gemma, Llama, and Mistral. The preferred response was selected
using an LLM-judge.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology &amp; Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Baseline</title>
        <p>
          The task description proposes a baseline solution where the 8 billion parameter instruction-tuned model
from Meta’s Llama 3 herd [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is prompted for its evaluation. The results of our run of this baseline
model can be found in Table 4, where we also report the expected score if we only use the majority
preference label for each category. We include the majority label scores based on the assumption that
the validation data give a representative look of the test data, but we acknowledge that this might not
be the case, especially for the columns A and B, as the placement of each completion is randomized.
        </p>
        <p>As we can see from the results in Table 4, for two of the categories, the generative approach performs
worse compared to randomly guessing a label. Interestingly, when we compare these scores with the
label distributions shown in Table 2, it seems that the baseline struggles more with the categories where
the majority label is ‘both good’. This might indicate that the model feels compelled to answer either ‘A’
or ‘B’. Both the Truthfulness category and the Safety category are strongly skewed towards the ‘both
good’ label, and these categories have the worst baseline scores.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Direct Preference Optimization</title>
        <p>
          We also tried an approach using Direct Preference Optimization (DPO) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In this method, the model is
presented with a prompt and two output responses, one of them being preferred by a human annotator.
The model is then trained to align with the human preference. Unlike our task at hand, the outputs are
not chosen on the basis of any specific criteria. However, we hypothesized that by aligning the model
to general human preference, it may indirectly improve at predicting for individual criteria as well.
        </p>
        <p>For this task, we used the Tulu 3 SFT dataset. We fine-tuned the Llama-3.1-8B-Instruct model, which
is the same model used in the baseline. Due to the size of the model and limited resources, we opted to
use Low-Rank Adaptation (LoRA) [11]. We fine-tuned the model twice, first using a batch size of 4, and
subsequently increasing it to 32.</p>
        <p>With the fine-tuned models, the predictions were generated using the same code as the baseline
approach, which we adapted from the ELOQUENT lab GitHub repository7. Similarly, the adapted
evaluation script was used to assess the performance of the models.
6https://hf.co/meta-llama/Llama-3.1-8B-Instruct
7https://github.com/eloquent-lab/eloquent-lab</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Classifier</title>
        <p>We also implemented a classifier approach. We based this classifier on the uncased, base BERT model 8
[12] for inference. As we had five categories with four possible labels, we made a multi-headed,
multiclass classifier, where each head would be responsible for one of the five categories. We used linear
layers for these five.</p>
        <p>The training of this model consisted of passing each training pair through the BERT model, before
getting the [CLS] token to calculate the loss for each head, which in turn updates its weights. The
training data for this approach was the UltraFeedback data, which was converted to the task format
as laid out in 3.2.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and discussion</title>
      <p>In this section, we present the results for both the DPO-finetuned model and the classifier. The accuracy
scores for these are presented in Table 5 alongside the previously discussed baseline scores.</p>
      <sec id="sec-5-1">
        <title>5.1. Direct Preference Optimization</title>
        <p>The results of the DPO fine-tuning are presented in Table 6. Evidently, the approach was not very
efective in predicting human preference. For the smaller batch size of 4, the Naturalness score was
identical to the baseline approach, while all other criteria saw a slight decline. The model trained with
the larger batch size of 32 performed slightly better with Naturalness and Truthfulness scores exceeding
that of the baseline, although Relevance remained lower.</p>
        <p>None of the scores obtained diverged significantly from the baseline. It is conceivable that the slight
diferences between the three models may be merely a result of randomness. As such, the slightly
increased Naturalness and Truthfulness scores of the latter model cannot be confidently attributed to
the efectiveness of this approach.</p>
        <p>The reason for the unsuccessfulness of our attempt is unclear, wether it is (1) that the method is
illsuited, (2) inadequate training, or (3) that our hypothesis that general fine-tuning will impact individual
criteria does not hold true. We believe it to be one or both of the two latter.
8https://huggingface.co/google-bert/bert-base-uncased</p>
        <p>In order to further explore this method, one could try to use diferent models, possibly one of smaller
size, and fine-tune the whole model without using LoRA. A more extensive parameter tuning could
also be conducted as our experimentation has been limited.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Classifier</title>
        <p>The results from the classifier ended up being the best for all categories. The scores reported in Table 5
are the results of both trying to use the training data in diferent ways, as discussed in 3.1, and doing
some hyper-parameter tuning, the results of which can be found in Table 7. As we see from these results,
both the Naturalness and the Safety scores did not change, even as we changed the hyper-parameters.
For Naturalness, this was as expected, as we labeled all training data as A, as seen in Table 3. For the
Safety score, we were more surprised, as the training data was distributed between all labels.</p>
        <p>We were concerned that the relatively high performance of the classifier was indicating overfitting, as
many of the classifier’s scores were close to the majority label share from Table 2. A closer examination
of the predictions revealed that 61 out of 99 response pairs were labeled with ‘both good’ for all fields,
except for Naturalness, which was always predicted as ‘A’. Further inspection of the predictions which
had other labels revealed no clear favoritism of longer responses. Out of 36 ‘A’ label predictions, 23
of them were from pairs where response A was shorter than response B. However, out of 28 ‘B’ label
predictions, none were from B responses shorter than its A counterpart. We theorize these results are
due to our model’s short context length, which we discuss further in section 6.5. Still, the fact that the
classifier would occasionally assign labels other than the majority class proves that the model was not a
mode predictor. Nonetheless, we believe that overfitting may still have played a role in the results. We
will also explore this issue further in the following section.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations and future work</title>
      <p>We experimented with several approaches, but were mostly unsuccessful in these endeavors. These
failures can broadly be attributed to one or a combination of the following pitfalls:</p>
      <sec id="sec-6-1">
        <title>6.1. Lack of data</title>
        <p>Although both datasets used for training contain substantial amounts of preference evaluations, neither
were based on the same criteria as our current task. The UltraFeedback dataset included some common
ones, but not all. Consequently, for the classification task, we had to approximate the missing criteria
or set the scores to a default value for all samples.</p>
        <p>Moreover, the validation set for the task is relatively small, only including 99 data samples. Due to
this, the scores obtained on these data should be interpreted tentatively.</p>
        <p>We believe that our results using UltraFeedback show that there is potential in repurposing the
data to new tasks. For example, to mitigate the lack of negative examples in the Safety part of the
training data, we theorized using the CrowS-Pairs dataset9 which consists of response pairs that are
labeled more or less harmful. Another idea was to use the aforementioned NoRobots dataset, which
could be seen as gold standard responses, due to it being human-made and of high quality.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Out-of-domain generalization</title>
        <p>Due to the models represented in the UltraFeedback and Tulu 3 SFT datasets might not be the same
as in the development data, there might be an out-of-domain generalization efect. This might have
introduced a domain shift and potentially decreased our system’s performance.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Model choice</title>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Overfitting</title>
        <p>Our experiments with the BERT-based classifier and DPO-based system were executed using only one
model. The BERT model has 110 million parameters, a notably small model. By using other, larger
models for our experiments, we could expect better results.</p>
        <p>As previously mentioned, the oficial shared task uses accuracy as its main measurement. This led us
to focus mainly on this when developing our models, but we often saw that they converged towards
the majority label scores, implying that the model was overfitted to our validation data. Although we
implemented measures to avoid this, such as a dropout layer in the classifier, it was dificult to be sure
whether our models generalized well. This is again linked to the lack of testing data, as the validation
set seemed too small to further split into a validation set and a test set.</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.5. Context length</title>
        <p>During most of the development phase of the classifier, we worked with the uncased BERT model. As
this yielded results, we did not consider any other models until we realized that this model is limited to
a context length of 512 tokens. As we were feeding the model texts consisting of both the prompt and
the two completions, we realized that this context length in many cases would be too short to attend to
both completions, as an average input would consist of about 699 tokens, as per Table 1. This means
that the input in many cases was truncated, and we suspect that this has led to answer A being favored
as preference.</p>
        <p>As we eventually realized this, we attempted other models as well. The choice of context length is,
however, also a choice of inference time. We got a functioning classifier using the Longformer model
[13], which supports document lengths of up to 4,096 tokens. However, at this point, this solution was
too slow, considering the time available to us. We also attempted truncating answers A and B equally,
but this did not yield any significant improvements to our accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Summary</title>
      <p>This paper has presented our work aimed to train a model to predict human preference of
machinegenerated text. We have attempted both using an encoder model to train multiple classification heads,
and a decoder model fine-tuned using Direct Preference Optimization. Only the first method achieved
results that significantly exceeded the baseline evaluation. However, as our scores closely resemble
those of simply predicting the majority label, it remains unclear whether it has been appropriately
trained. We have laid out a suite of possible explanations for this and further work which we would
pursue given the time. We conclude that more extensive test data is required to suficiently assess
performance.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>In this paper, generative AI tools, namely the services ChatGPT and GPT UiO, have been used by the
authors as a writing assistant to structure and fill in some tables, as well as a tool for grammatical
suggestions. Generative AI has not been used to produce text in its entirety, nor have entire passages
produced by AI been included in the paper. The authors take full responsibility for the claims, references
and findings of this paper.
ume 36, Curran Associates, Inc., 2023, pp. 53728–53741. URL: https://proceedings.neurips.cc/paper_
ifles/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf.
[11] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
adaptation of large language models, 2021. URL: https://arxiv.org/abs/2106.09685. arXiv:2106.09685.
[12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423/.
doi:10.18653/v1/N19-1423.
[13] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The long-document transformer, 2020. URL:
https://arxiv.org/abs/2004.05150. arXiv:2004.05150.
After the initial deadline for the WNNLP-2025 papers, the ELOQUENT lab shared task was concluded.
As such, we were able to test our system against the test split. The results of this can be seen in Table 8.</p>
      <p>As we see, the scores on the test split are lower, but all in all the system seems to perform quite
consistently between the two splits, with two exceptions – Naturalness and Overall – which are both
down by 20%.</p>
      <p>For Naturalness, this was expected, as the development score was surprisingly high. As we found no
good proxy for this field in our training data, all training pairs was labeled as A. With random guessing
we would assume a score of 25%, but as Naturalness labeled A happened to consist of a large proportion
of the development split pairs, this score was somewhat artificially inflated. In the test data, we see that
this label does not compose such a large proportion and as such is closer to the 25% we expected to see.</p>
      <p>The drop in the Overall score was however slightly surprising. To further assess this discrepancy, we
may have to inspect the diferences between the development and test data further.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bojar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Engels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mikhailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Šindelář</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Velldal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Øvrelid</surname>
          </string-name>
          , Overview of eloquent 2025:
          <article-title>shared tasks for evaluating generative language model quality</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mikhailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Butenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Øvrelid</surname>
          </string-name>
          , E. Velldal,
          <article-title>Overview of the Preference Prediction Task at the ELOQUENT 2025 lab for evaluating generative language model quality</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pyatkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Miranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chandu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <source>RewardBench: Evaluating Reward Models for Language Modeling</source>
          ,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2403.13787. doi:
          <volume>10</volume>
          .48550/arXiv.2403.13787, arXiv:
          <fpage>2403</fpage>
          .13787 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W.-L. Chiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <article-title>Judging llm-as-a-judge with mt-bench and chatbot arena</article-title>
          , in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>36</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2023</year>
          , pp.
          <fpage>46595</fpage>
          -
          <lpage>46623</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2023/file/ 91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ni</surname>
          </string-name>
          , G. Xie,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , M. Sun,
          <source>UltraFeedback: Boosting Language Models with Scaled AI Feedback</source>
          ,
          <year>2024</year>
          . URL: http://arxiv.org/ abs/2310.01377. doi:
          <volume>10</volume>
          .48550/arXiv.2310.01377, arXiv:
          <fpage>2310</fpage>
          .01377 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bojar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mikhailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Velldal</surname>
          </string-name>
          , L. Øvrelid,
          <article-title>Eloquent clef shared tasks for evaluation of generative language model quality, 2025 edition</article-title>
          , in: C.
          <string-name>
            <surname>Hauf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Nardini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Silvestri</surname>
          </string-name>
          , N. Tonellotto (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>366</fpage>
          -
          <lpage>372</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ding</surname>
          </string-name>
          , G. Yao,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , M. Sun,
          <article-title>UltraFeedback: Boosting Language Models with High-</article-title>
          quality
          <string-name>
            <surname>Feedback</surname>
          </string-name>
          (
          <year>2023</year>
          ). URL: https://openreview.net/forum?id= pNkOx3IVWI.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pyatkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ivison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J. V.</given-names>
            <surname>Miranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Graf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tafjord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wilhelm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Soldaini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dasigi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , Tulu 3:
          <string-name>
            <given-names>Pushing</given-names>
            <surname>Frontiers in Open Language Model Post-Training</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: http://arxiv.org/abs/2411.15124. doi:
          <volume>10</volume>
          .48550/arXiv.2411.15124, arXiv:
          <fpage>2411</fpage>
          .15124 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2407.21783. arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rafailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , E. Mitchell,
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ermon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Finn</surname>
          </string-name>
          ,
          <article-title>Direct preference optimization: Your language model is secretly a reward model</article-title>
          , in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , vol-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>