<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating the Performance of the Finetuned Quantized Llama 3 Based on Relevance, Truthfulness, Naturalness, and Safety⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rohit R. Gunti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abebe Rorissa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Tennessee, School of Information Sciences</institution>
          ,
          <addr-line>Knoxville, 1345 Circle Park Drive, Suite 412</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The University of Tennessee, School of Information Sciences</institution>
          ,
          <addr-line>Knoxville, 1345 Circle Park Drive, Suite 451</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The study aims to compare the performance of a quantized, finetuned Llama 3 model to its advanced version of the baseline model as part of our participation in a preference prediction task at Eloquent 2025. The performance is primarily evaluated based on five criteria such as (a) relevance, (b) naturalness, (c) truthfulness, (d) safety, and (e) overall quality of how well the model judges the two responses. Among our major findings is that optimization techniques (quantization) produce useful results. Specifically, the finetuned Llama 3 is better at addressing individual qualities such as safety, truthfulness, and relevance than the baseline model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Naturalness</kwd>
        <kwd>Truthfulness</kwd>
        <kwd>Safety</kwd>
        <kwd>Overall Quality</kwd>
        <kwd>Quantization</kwd>
        <kwd>Llama 3</kwd>
        <kwd>Llama 3</kwd>
        <kwd>1</kwd>
        <kwd>BERT</kwd>
        <kwd>ROUGE</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Prior works submitted to the Eloquent Lab 2024 primarily focused on evaluating the quality of
system responses and task reports [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The authors of those submissions addressed the training costs,
configuration, and resources utilized during evaluations [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6 ref7">2, 3, 4, 5, 6, 7</xref>
        ]. Little attention is given in
the extant literature to system optimization and reproducibility for interdisciplinary scholars. Only
a few studies avoid expensive processing to keep the system computationally light [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The study
utilizes relatively small 7B parameters and few-shot inference (no fine-tuning) and even employs a
quantized version of the Large Language Model (LLM) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Moreover, the focus is on a single task
of detecting hallucinations in LLMs, specifically leveraging prompt engineering techniques for this
purpose. Organizations have consistently emphasized the importance of optimization. Even before
the AI surge, some non-profit organizations, such as libraries, notably used optimization techniques
to analyze how applications ran in real-time and adjust their structure for better performance [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Similarly, other studies focused on practical application and used ANN optimization techniques to
enhance system performance [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. There is some evidence that ML optimization techniques can lead
to performance improvement [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Considering the billions of parameters an LLM is trained on, it
is essential to reduce the computational load of LLMs for some organizations with tailored budgets,
allowing them to fine-tune it for multiple tasks before deployment. There is ongoing exploration on
training the LLM with optimization techniques [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ].
      </p>
      <p>
        Therefore, this study contributes to the ongoing exploratory eforts by participating in Preference
Prediction task monitored by Eloquent Lab 2025 and reporting the optimized methodology and findings
of Unsloth finetuned Llama 3 [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ]. The preference prediction task tests the system’s, in this case,
ifnetuned LLM, capability to predict human preferences. In the initial stage (development state), the
tasks ofer a validation dataset with human-annotated preferences and explanations for the participants
to develop the system. Later in the test stage, the developed systems (finetuned LLMs) are expected
to make the judgment on human preference between the two LLM responses concerning five criteria:
relevance, naturalness, truthfulness, safety, and overall quality. In addition to the predictions, there is
also a second sub-task that expects the human’s preferred predictions along with a justifiable explanation
for all five criteria. In evaluating the predictions (first subtask), the accuracy is computed by comparing
the predictions of participants’ developed system with the ground truth predictions collected via a
human annotation platform like Toloka. Whereas, in evaluating the associated explanations, a surprise
LLM (e.g., Gemma or similar) is used for semantic assessments (e.g., ROUGE-L, BERTScore). In this
paper, we share our findings on preference prediction assessments monitored by Eloquent 2025 [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>
        2.1. Data
The following three sections describe our methods: (1) Data Collection, (2) Data Preprocessing, and (3)
Finetuning. Lastly, the findings section includes the training results and shared results evaluated by the
Preference Prediction task (2025) committee [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        The dataset (2025 validation data) used for finetuning the Llama 3 model is provided by the preference
prediction committee [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The dataset contains 99 JSON items where each item is similar to alpaca
format but has more fields in addition to instruction, input, and output. This study uses this dataset
for finetuning to make an LLM generate better response to a use instruction based on multiple quality
criteria such as relevance, truthfulness, naturalness, and safety. Hence, the validation dataset is referred
to as the finetune dataset.
      </p>
      <sec id="sec-2-1">
        <title>2.2. Finetune</title>
        <p>Llava 3 using a Low Rank Adaptation (LoRa) approach was utilized for eficient memory usage.
Finetuning involves customizing the model to generate nuanced responses based on the finetune dataset.
Before the finetuning, the dataset is prepared using the Llama 3-specific prompt template. Each entry
in the finetuning dataset contains examples with an instruction, input, and output for the Llama 3 to
follow structured guidance and generate relevant responses.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Lora Configuration</title>
        <p>The Llama 3 model, loaded in 4-bit quantization for eficiency, is set up using a specific training
configuration. Several experiments have been conducted to track the training loss to keep it minimal.
To supervise the finetuning, the SFT trainer is enabled. In this study, the SFT trainer configuration,
along with LoRa setup, where the training loss is observed to be minimal, is referred to as the optimal
training configuration, as shown in Table 2.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Findings</title>
      <p>In our section, we include two kinds of evaluation where each compares baseline scores and fine-tuned
Llama 3 scores as shown in Tables 2 and 3. Scores in Table 2 follow a systematic approach to evaluate
whether the baseline model’s judgement aligns with human judgement for each of the criteria. The
baseline/finetune model is presented with two AI-generated responses (A and B) along with the given
instruction (test data). The test data has a total of 1,248 items. Then, the model judges which assistant’s
answer is better for each criterion and saves it in the resultant TSV format. The results show the accuracy
comparison of the baseline model evaluated against the human-annotated answers (validation dataset)
for each criterion. The scores in the middle columns for baseline Llama 3.1, when judged individually
for each criterion, indicate low alignment with human annotations (validation). However, the scores
are notably higher for overall quality (42.42) than individual judgments, indicating that the baseline
Llama 3.1 performs better at holistic judgment but lacks individual qualities such as safety (13.13) and
truthfulness (11.11). In case of the finetuned model trained on the validation data (human annotations),
the scores from the preference prediction committee reveal that individual judgements of the finetuned
Llama 3 model are higher than the advanced versions of baseline Llama 3.1. This indicates that the model
quantized finetuning has improved qualities such as safety (48.96), relevance (39.98), and truthfulness
(38.62), which are major concerns. However, the overall quality (33.01) of the finetuned Llama 3 has
underperformed compared to the baseline Llama 3.1 model. This score indicates that the finetuned
model lacks a balanced factor efectively when making an overall judgment. However, reviewing the
limitations, such as finetuning on a small dataset (99 examples), limited baseline comparisons (only
with Llama 3.1), the study can enhance the performance for future preference prediction tasks.</p>
      <sec id="sec-3-1">
        <title>3.1. Sample output</title>
        <p>The below sample is one of the responses generated by a nfietuned Llama that reflects an entire
entry from the original TSV output file. The sample output is formatted in code syntax for better
readability, demonstrating finetuned Llama predictions and their associated explanations between
two generative LLM partial responses (LLM A and B), evaluated across multiple criteria: relevance,
naturalness, truthfulness, safety, and overall quality. Each criterion assesses diferent aspects of the
responses. For instance, relevance criteria evaluate if the answer adequately addresses the user’s original
query. Similarly, naturalness criteria judge how human-like the response feels, truthfulness assesses
factual accuracy, and safety examines the avoidance of harmful content. Finally, the overall quality
provides a generalized evaluation encompassing all these aspects. This structured breakdown aids in
systematically determining how well the LLM aligns with nuanced human preferences, supporting the
transparency and interpretability of the AI decision-making process.</p>
        <p>}
Naturalness</p>
        <p>{
}
Truthfulness</p>
        <p>{
}
Safety</p>
        <p>{
}
Overall Quality</p>
        <p>{
}</p>
        <p>}</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>{
Id: 001
Instruction: What chemicals does L-theanine affect?
Response of LLM B :
L-theanine, an amino acid primarily found in tea leaves ...</p>
      <p>Response of LLM B:
L-theanine, a non-protein amino acid found in green tea ...</p>
      <p>Criteria
{</p>
      <p>Relevance
{</p>
      <p>Finetuned Llama prediction: A
Finetuned Llama Explanation: Assistant A provides a concise, structured summary
directly addressing the user’s specific query about chemical effects, clearly
outlining multiple chemicals influenced by L-theanine, making it immediately
relevant.</p>
      <p>Finetuned Llama prediction: B
Finetuned Llama Explanation: Assistant B adopts a conversational and engaging tone
with natural formatting, using bullet points and highlighting key chemicals,
resembling human-produced explanatory content.</p>
      <p>Finetuned Llama prediction: A
Finetuned Llama Explanation: Assistant A is preferred for truthfulness because it
provides balanced statements and explicitly notes when evidence is preliminary
or ongoing, reflecting an accurate representation of the current scientific
consensus.</p>
      <p>Finetuned Llama prediction: A
Finetuned Llama Explanation: Assistant A clearly states the safety and current
limitations of scientific studies, avoiding any exaggerated claims or
implications, thus presenting a safer and more responsible communication.</p>
      <p>Finetuned Llama prediction: A
Finetuned Llama Explanation: Assistant A, overall, provides a comprehensive,
factual, balanced, and user-aligned answer, effectively meeting user
expectations across multiple evaluation dimensions.</p>
      <p>Our findings are competitive in the preference prediction 2025 task, ranking in second position (Team
UTK). Since the study focuses on observing whether the quantized finetuned llama 3 provides better
and useful results without compromising performance, further performance enhancement is out of the
study’s scope. However, the training insights (configurations) and findings will serve as the evidence to
explore the potential of quantized fine-tune models.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We would like to thank the University of Tennessee, Knoxville’s High Performance &amp; Scientific
Computing Team for providing us with access to the Nvidia H100 GPU for finetuning and evaluating the
Llama 3 and Llama 3.1 models.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly, and Copilot in order to: Grammar,
spelling check, and cross-check. After using these tool(s)/service(s), the authors reviewed and edited
the content as needed and take full responsibility for the publication’s content.
interaction: Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF
2025), Springer, 2025, pp. 53–72.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guillou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          ,
          <article-title>Eloquent clef shared tasks for evaluation of generative language model quality</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>459</fpage>
          -
          <lpage>465</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          ,
          <article-title>Eloquent 2024-topical quiz task</article-title>
          , in: Conference and
          <article-title>Labs of the Evaluation Forum, CEUR-WS</article-title>
          . org,
          <year>2024</year>
          , pp.
          <fpage>687</fpage>
          -
          <lpage>690</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guillou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zahra</surname>
          </string-name>
          ,
          <article-title>Overview of the clef-2024 eloquent lab: Task 2 on hallucigen</article-title>
          ,
          <source>in: 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September</source>
          <year>2024</year>
          , volume
          <volume>3740</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>691</fpage>
          -
          <lpage>702</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zahra</surname>
          </string-name>
          ,
          <article-title>Eloquent 2024-robustness task</article-title>
          , in: Conference and
          <article-title>Labs of the Evaluation Forum, CEUR-WS</article-title>
          . org,
          <year>2024</year>
          , pp.
          <fpage>703</fpage>
          -
          <lpage>707</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Neralla</surname>
          </string-name>
          , S. B. de Vroe,
          <article-title>Evaluating poro-34b-chat and mistral-7b-instruct-v0. 1: Llm system description for eloquent at clef 2024</article-title>
          , Working Notes of CLEF (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Simonsen</surname>
          </string-name>
          ,
          <source>Experimental report on robustness task- eloquent lab</source>
          <year>2024</year>
          , Working Notes of CLEF (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. T. M.</given-names>
            <surname>Bui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. F.</given-names>
            <surname>Brech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hußfeldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jennert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ullrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Breuer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Khasmakhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schaer</surname>
          </string-name>
          ,
          <article-title>The two sides of the coin: Hallucination generation and detection with llms as evaluators for llms</article-title>
          ,
          <source>arXiv preprint arXiv:2407.09152</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Siino</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Tinnirello</surname>
          </string-name>
          ,
          <article-title>Gpt hallucination detection through prompt engineering</article-title>
          , Working Notes of CLEF (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kistler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Franz</surname>
          </string-name>
          ,
          <article-title>Continuous program optimization: A case study</article-title>
          ,
          <source>ACM Transactions on Programming Languages and Systems (TOPLAS) 25</source>
          (
          <year>2003</year>
          )
          <fpage>500</fpage>
          -
          <lpage>548</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Mehat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kamaruddin</surname>
          </string-name>
          ,
          <article-title>Practical applications of taguchi method for optimization of processing parameters for plastic injection moulding: a retrospective review</article-title>
          ,
          <source>International Scholarly Research Notices</source>
          <year>2013</year>
          (
          <year>2013</year>
          )
          <fpage>462174</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Arkabaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdullaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Padmanaban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Salmanov</surname>
          </string-name>
          ,
          <article-title>Modelling and analysis of optimization algorithms</article-title>
          ,
          <source>Jurnal Ilmiah Ilmu Terapan Universitas Jambi</source>
          <volume>9</volume>
          (
          <year>2025</year>
          )
          <fpage>161</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Afzal</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>T. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Quek</surname>
          </string-name>
          ,
          <article-title>Large language model as a catalyst: A paradigm shift in base station siting optimization</article-title>
          ,
          <source>IEEE Transactions on Cognitive Communications and Networking</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>K. C.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Autonomous multi-objective optimization using large language model</article-title>
          ,
          <source>IEEE Transactions on Evolutionary Computation</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Orlm: A customizable framework in training large models for automated optimization modeling</article-title>
          ,
          <source>Operations Research</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z. E. A. L. . E. V.</given-names>
            <surname>Vladislav</surname>
          </string-name>
          <string-name>
            <surname>Mikhailov</surname>
          </string-name>
          ,
          <article-title>Butenko, Overview of the preference prediction task at the eloquent 2025 lab for evaluating generative language model quality</article-title>
          .,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bojar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Engels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mikhailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Šindelář</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Velldal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Øvrelid</surname>
          </string-name>
          , Overview of eloquent 2025:
          <article-title>Shared tasks for evaluating generative language model quality</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <article-title>Experimental IR meets multilinguality, multimodality</article-title>
          , and
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>