<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ELOQUENT Sensemaking Task: LLMs in the Evaluator Role</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kateryna Lutsai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matyáš Thér</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonáš Venc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ondřej Bojar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Charles University, Faculty of Science</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our participation in the ELOQUENT Sensemaking Task (2025), focusing on the “Evaluator” role. The task challenges language models to prepare, take, or rate an exam based on provided learning materials. We detail our approach to developing an Evaluator system that scores answers given the materials, a question, and a candidate answer. This involved selecting appropriate large language models (LLMs), designing efective prompts, and conducting some first experiments to refine our methodology. Our work explores the capabilities of LLMs to constrain their knowledge to the given materials and assesses their reliability in understanding and evaluating textual information. We present the results of our experiments, including the performance of diferent models and prompting strategies, and discuss the challenges encountered, such as handling large contexts and the limitations of automated evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;question answering</kwd>
        <kwd>LLM</kwd>
        <kwd>text understanding</kwd>
        <kwd>prompt engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>1. 100</p>
      <sec id="sec-1-1">
        <title>Explanations</title>
        <p>1. Everything is correct.
N. Incorrect since...</p>
      </sec>
      <sec id="sec-1-2">
        <title>Questions</title>
        <p>1. When was...
N. Define what is...
1. It was...</p>
      </sec>
      <sec id="sec-1-3">
        <title>Answers</title>
      </sec>
      <sec id="sec-1-4">
        <title>Scores</title>
      </sec>
      <sec id="sec-1-5">
        <title>Teacher</title>
        <p>Model</p>
      </sec>
      <sec id="sec-1-6">
        <title>Student</title>
        <p>Model</p>
      </sec>
      <sec id="sec-1-7">
        <title>Evaluator</title>
        <p>Model
DB</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Methods and Materials</title>
      <sec id="sec-2-1">
        <title>Input Data and Task Definition</title>
        <p>The input data for the Evaluator system, as defined by the task, consists of three main components:
1. Source Text (Context): plain text, potentially large (up to 35k tokens in our experiments),
derived from diverse sources such as books, presentations, and articles. The source texts provided
by the task organizers span various domains, including university lectures, textbooks, and audio
transcripts.
2. Questions: Provided by the original authors of the material or generated by “Teacher” systems
based on the source text.
3. Answers: Generated by “Student” systems in response to the questions, using only the provided
source text as a reference.</p>
        <p>While the task organizers used several languages in input texts, questions and answers, they also
provided us with versions automatically translated to English. We thus assume that all the texts,
questions and answers are in English. The Evaluator system’s role is to assess the quality of each answer
with respect to its corresponding question and the given context.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Output Format</title>
        <p>The required output for the Evaluator system is a JSON object. For each input question-answer pair, the
system must output a score and an explanation. Specifically, as detailed in the task documentation, the
expected output is a JSON object containing:
• score: An integer between 0 and 100, representing the quality of the answer (100 being the best).
• explanation: A brief string justifying the assigned score.</p>
        <p>For oficial task submissions, the output is a JSON dictionary where keys indicate the input file location,
and values are lists of integer scores for each evaluated answer.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Model Selection and Experimental Setup</title>
        <p>
          Our initial experiments used the llama3.3 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] model to generate synthetic training data for the Evaluator.
We constructed a dataset from QA pairs, expanding it with incorrect answers (e.g., by shifting the
answer to a previous or next item in the sequence) to create a 1:1 ratio of correct and incorrect responses.
The llama3.3 model was prompted to evaluate these pairs, with the input context composed of the
previous, current (relevant), and next items in the data sequence. The requested output should be a
JSON object containing a score.
        </p>
        <p>However, these preliminary experiments revealed significant limitations: the model often assigned
zero scores to correct answers and sometimes produced malformed JSON outputs. These findings led us
to reconsider our approach, moving away from custom dataset creation and older models, and instead
focusing on prompt engineering with more advanced models.</p>
        <p>
          Based on these insights, we selected Gemma3 (27b) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and Qwen3 (30b) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] as our primary models.
Both support a 128k token context window, making them suitable for handling large source texts.
Gemma3 is a decoder-only Transformer with sliding window attention and a function-calling head for
structured output, while Qwen3 is a mixture of experts (MoE) Transformer featuring
“Thinking/Notthinking” modes. All models were run using ollama on a cluster equipped with NVIDIA A30 or RTX
A4000 GPUs.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Prompt Engineering</title>
        <p>Prompt design was a critical factor in achieving reliable and structured outputs. We adopted a two-part
prompt structure:</p>
        <p>Listing 1: System Prompt (passed as the system argument)
You a r e a f a i r t e a c h e r who g r a d e s s t u d e n t s ’ a n s w e r s . E v a l u a t e t h e
q u a l i t y o f t h e Answer s p e c i f i c a l l y i n r e s p o n s e t o t h e Q u e s t i o n
c o n s i d e r i n g t h e C o n t e x t p r o v i d e d . F o r m a t y o u r e n t i r e r e s p o n s e a s a
s i n g l e JSON o b j e c t c o n t a i n i n g ’ s c o r e ’ ( an i n t e g e r b e t w e e n 0 and 1 0 0 ,
where 1 0 0 i s b e s t ) and ’ e x p l a n a t i o n ’ ( a s t r i n g b r i e f l y j u s t i f y i n g
t h e s c o r e ) .</p>
        <p>Listing 2: User Prompt (passed as the prompt argument)
Q u e s t i o n : &lt; q u e s t i o n &gt;
Answer : &lt; answer &gt;
And g i v e n t h e f o l l o w i n g c o n t e x t : &lt; t e x t _ f r a g m e n t s &gt;</p>
        <p>In the context of ollama.generate method, the system prompt (Listing 1) sets the overall behavior
and role of the model for the session, ensuring that responses are consistent and formatted as required.
The user prompt (Listing 2) provides the specific input for each evaluation instance, supplying the
question, answer, and relevant context to be assessed.</p>
        <p>The key elements we considered during prompt engineering included:
• Clearly defining the order of information (whether to present the task description or the inputs
ifrst).
• Explicitly stating the output format requirements (JSON with exactly two fields: score and
explanation).
• Emphasizing knowledge boundaries (e.g., instructing the model to rely only on the provided
context).</p>
        <p>This careful prompt design was essential to ensure that the models produced outputs in the required
format and focused their evaluation solely on the provided context.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>We processed subsets of the development set using the selected models, Gemma3 and Qwen3, to
generate scores and explanations for each question-answer pair. The following examples illustrate the
evaluation process and the models’ ability to provide structured, context-aware feedback:</p>
      <p>Listing 3: Input Example: Relevant Answer
Q u e s t i o n : What i s a r e s u l t o f m i s p l a c i n g p u n c t u a t i o n marks i n
machine t r a n s l a t i o n ?
Answer : What i s a r e s u l t o f m i s p l a c i n g p u n c t u a t i o n marks i n machine
t r a n s l a t i o n i s n o t d i s c u s s e d i n t h e p r o v i d e d t e x t .</p>
      <p>C o n t e x t : C o m p u t a t i o n Graphs F o r o u r e x a m p l e n e u r a l network from
S e c t i o n . . . book p u b l i s h e d i n y e a r 2 0 2 0 .</p>
      <p>Listing 4: Model Output: Relevant Answer
" s c o r e " : 1 0 0 ,
" e x p l a n a t i o n " : " The g i v e n answer i s c o r r e c t , t h e t e x t d o e s n o t
m e n t i o n m i s p l a c i n g p u n c t u a t i o n marks i n machine t r a n s l a t i o n . "</p>
      <p>Listing 5: Input Example: Unrelated Answer
Q u e s t i o n : What was t h e s t r u c t u r e o f t r a d e i n t h e Roman Empire ?
Answer : I n h o s p i t a l s , d e h y d r a t i o n i s commonly t r e a t e d w i t h i n f u s i o n s
.</p>
      <p>C o n t e x t : CHAPTER OUTLINE 7 . 1 The D a i l y L i f e o f a Roman F a m i l y 7 . 2
S l a v e r y i n t h e Roman Empire 7 . 3 The Roman Economy : Trade , Taxes ,
and C o n q u e s t . . . J e w i s h p o p u l a t i o n d u r i n g t h e i m p e r i a l p e r i o d .</p>
      <p>Listing 6: Model Output: Unrelated Answer
" s c o r e " : 1 0 ,
" e x p l a n a t i o n " : " The answer i s e n t i r e l y u n r e l a t e d t o t h e q u e s t i o n .</p>
      <p>I t i s t h e f u l l t e x t o f a c h a p t e r on t h e Roman Empire . T h e r e i s
no a t t e m p t t o answer t h e q u e s t i o n a t a l l . T h e r e f o r e t h e s c o r e
i s t h e l o w e s t p o s s i b l e . "
{
}
{
}</p>
      <p>These examples illustrate the models’ capability to follow the structured output format and provide
reasonable scores and explanations based on the provided context and the task instructions. However, a
key finding was the dificulty in selecting “the right” model when multiple models of similar quality are
available, especially with large and detailed contexts, that is why simple eyeballing is not an option.
Furthermore, fine-tuning models was deemed ineficient compared to prompt-engineering the newest
models, particularly due to the lack of time and suitably styled training datasets.</p>
      <p>Our early eforts to create a custom training dataset by generating synthetic incorrect answers—such
as by shifting answers between unrelated QA pairs—highlighted further limitations. In particular, older
models like llama3.3 often failed to diferentiate correct from incorrect answers and frequently returned
malformed or overly simplistic outputs. For that reason, we shifted our attention toward zero-shot
prompting with more advanced models.</p>
    </sec>
    <sec id="sec-4">
      <title>Related works</title>
      <p>
        Question Answering (QA) is a well-established area in Natural Language Processing. Traditional QA
systems often rely on large-scale knowledge bases or web corpora to extract answers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In contrast,
context-based or reading comprehension QA tasks present models with a specific document and require
them to answer questions using only the provided information [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These tasks require models to
locate and synthesize information solely from the given document(s), which aligns closely with the
ELOQUENT Sensemaking task’s objective of testing whether LLMs can limit their knowledge to the
provided materials. The Evaluator role, in particular, touches upon aspects of automated assessment
and answer scoring, which has parallels in educational technology and peer review systems. This
framing shares common ground with recent work in automatic answer grading and explanation-based
assessment, such as in the domain of Automatic Short Answer Grading (ASAG) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>Participating in the ELOQUENT Sensemaking task, specifically in the Evaluator role, highlighted several
challenges and provided us with useful insights. Manually evaluating the nuances of answers against
extensive source texts is inherently dificult and time-consuming, underscoring the need for robust
automated evaluation methods. However, creating such automated evaluators is also challenging,
especially when aiming for human-like judgment.</p>
      <p>A significant hurdle is the availability of suitable datasets for task-specific fine-tuning. While
generalpurpose LLMs are powerful, adapting them to the precise requirements of a specialized evaluation
task without extensive, tailored training data relies heavily on prompt engineering. Our experiments
showed that newer models with large context windows, combined with careful prompt design, can
achieve promising results in scoring answers based on provided contexts. Nevertheless, the process of
selecting the best model and refining prompts remains an empirical endeavor. The findings suggest
that prompt-engineering with state-of-the-art models is currently a more pragmatic approach than
ifne-tuning for such specific, limited-duration tasks, especially when appropriately styled training data
is scarce.</p>
      <p>Our code is available on GitHub: https://github.com/K4TEL/llm-sensemaking</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Thanks to the Institute of Formal and Applied Linguistics ÚFAL) at Charles University, Faculty of
Mathematics and Physics (MFF), for providing access to the HPC cluster with GPU nodes, which
was essential for running the experiments with large language models. This work has also received
funding from the Project OP JAK Mezisektorová spolupráce Nr. CZ.02.01.01/00/23_020/0008518 named
“Jazykověda, umělá inteligence a jazykové a řečové technologie: od výzkumu k aplikacím.”</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used X-GPT-4 and Gemini 2.5 for: Grammar and
spelling check, text style adjustments. After using these tools and services, the authors reviewed and
edited the content as needed and take full responsibility for the content of the publication.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Šindelář</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bojar</surname>
          </string-name>
          ,
          <article-title>Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Meta</surname>
            <given-names>AI</given-names>
          </string-name>
          ,
          <source>Meta llama 3</source>
          .
          <article-title>3 70b instruction-tuned model</article-title>
          ,
          <year>2024</year>
          . URL: https://huggingface.co/meta-llama
          <source>/ Llama-3</source>
          .
          <fpage>3</fpage>
          <string-name>
            <surname>-</surname>
          </string-name>
          70B-Instruct.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Gemma</given-names>
            <surname>Team</surname>
          </string-name>
          , Gemma 3,
          <year>2025</year>
          . URL: https://goo.gle/Gemma3Report.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Qwen</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <source>Qwen3 technical report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505.09388.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <article-title>Reading wikipedia to answer open-domain questions</article-title>
          ,
          <source>ACL</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lopyrev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , Squad:
          <volume>100</volume>
          ,000+
          <article-title>questions for machine comprehension of text</article-title>
          ,
          <source>EMNLP</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Burrows</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Automated essay scoring: A survey of the state of the art</article-title>
          ,
          <source>IEEE Transactions on Learning Technologies</source>
          <volume>8</volume>
          (
          <year>2015</year>
          )
          <fpage>107</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>