1. Introduction

ELOQUENT Sensemaking Task: LLMs in the Evaluator Role

Kateryna Lutsai

Matyáš Thér

Jonáš Venc

Ondřej Bojar

0 0 Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics , Prague , Czech Republic 1 Charles University, Faculty of Science , Prague , Czech Republic

This paper describes our participation in the ELOQUENT Sensemaking Task (2025), focusing on the “Evaluator” role. The task challenges language models to prepare, take, or rate an exam based on provided learning materials. We detail our approach to developing an Evaluator system that scores answers given the materials, a question, and a candidate answer. This involved selecting appropriate large language models (LLMs), designing efective prompts, and conducting some first experiments to refine our methodology. Our work explores the capabilities of LLMs to constrain their knowledge to the given materials and assesses their reliability in understanding and evaluating textual information. We present the results of our experiments, including the performance of diferent models and prompting strategies, and discuss the challenges encountered, such as handling large contexts and the limitations of automated evaluation.

eol>question answering LLM text understanding prompt engineering

1. Introduction

1. 100

Explanations

1. Everything is correct. N. Incorrect since...

Questions

1. When was... N. Define what is... 1. It was...

Answers Scores Teacher

Model

Student

Model

Evaluator

Model DB

Methods and Materials Input Data and Task Definition

The input data for the Evaluator system, as defined by the task, consists of three main components: 1. Source Text (Context): plain text, potentially large (up to 35k tokens in our experiments), derived from diverse sources such as books, presentations, and articles. The source texts provided by the task organizers span various domains, including university lectures, textbooks, and audio transcripts. 2. Questions: Provided by the original authors of the material or generated by “Teacher” systems based on the source text. 3. Answers: Generated by “Student” systems in response to the questions, using only the provided source text as a reference.

While the task organizers used several languages in input texts, questions and answers, they also provided us with versions automatically translated to English. We thus assume that all the texts, questions and answers are in English. The Evaluator system’s role is to assess the quality of each answer with respect to its corresponding question and the given context.

Output Format

The required output for the Evaluator system is a JSON object. For each input question-answer pair, the system must output a score and an explanation. Specifically, as detailed in the task documentation, the expected output is a JSON object containing: • score: An integer between 0 and 100, representing the quality of the answer (100 being the best). • explanation: A brief string justifying the assigned score.

For oficial task submissions, the output is a JSON dictionary where keys indicate the input file location, and values are lists of integer scores for each evaluated answer.

Model Selection and Experimental Setup

Our initial experiments used the llama3.3 [ 2 ] model to generate synthetic training data for the Evaluator. We constructed a dataset from QA pairs, expanding it with incorrect answers (e.g., by shifting the answer to a previous or next item in the sequence) to create a 1:1 ratio of correct and incorrect responses. The llama3.3 model was prompted to evaluate these pairs, with the input context composed of the previous, current (relevant), and next items in the data sequence. The requested output should be a JSON object containing a score.

However, these preliminary experiments revealed significant limitations: the model often assigned zero scores to correct answers and sometimes produced malformed JSON outputs. These findings led us to reconsider our approach, moving away from custom dataset creation and older models, and instead focusing on prompt engineering with more advanced models.

Based on these insights, we selected Gemma3 (27b) [ 3 ] and Qwen3 (30b) [ 4 ] as our primary models. Both support a 128k token context window, making them suitable for handling large source texts. Gemma3 is a decoder-only Transformer with sliding window attention and a function-calling head for structured output, while Qwen3 is a mixture of experts (MoE) Transformer featuring “Thinking/Notthinking” modes. All models were run using ollama on a cluster equipped with NVIDIA A30 or RTX A4000 GPUs.

Prompt Engineering

Prompt design was a critical factor in achieving reliable and structured outputs. We adopted a two-part prompt structure:

Listing 1: System Prompt (passed as the system argument) You a r e a f a i r t e a c h e r who g r a d e s s t u d e n t s ’ a n s w e r s . E v a l u a t e t h e q u a l i t y o f t h e Answer s p e c i f i c a l l y i n r e s p o n s e t o t h e Q u e s t i o n c o n s i d e r i n g t h e C o n t e x t p r o v i d e d . F o r m a t y o u r e n t i r e r e s p o n s e a s a s i n g l e JSON o b j e c t c o n t a i n i n g ’ s c o r e ’ ( an i n t e g e r b e t w e e n 0 and 1 0 0 , where 1 0 0 i s b e s t ) and ’ e x p l a n a t i o n ’ ( a s t r i n g b r i e f l y j u s t i f y i n g t h e s c o r e ) .

Listing 2: User Prompt (passed as the prompt argument) Q u e s t i o n : < q u e s t i o n > Answer : < answer > And g i v e n t h e f o l l o w i n g c o n t e x t : < t e x t _ f r a g m e n t s >

In the context of ollama.generate method, the system prompt (Listing 1) sets the overall behavior and role of the model for the session, ensuring that responses are consistent and formatted as required. The user prompt (Listing 2) provides the specific input for each evaluation instance, supplying the question, answer, and relevant context to be assessed.

The key elements we considered during prompt engineering included: • Clearly defining the order of information (whether to present the task description or the inputs ifrst). • Explicitly stating the output format requirements (JSON with exactly two fields: score and explanation). • Emphasizing knowledge boundaries (e.g., instructing the model to rely only on the provided context).

This careful prompt design was essential to ensure that the models produced outputs in the required format and focused their evaluation solely on the provided context.

Results

We processed subsets of the development set using the selected models, Gemma3 and Qwen3, to generate scores and explanations for each question-answer pair. The following examples illustrate the evaluation process and the models’ ability to provide structured, context-aware feedback:

Listing 3: Input Example: Relevant Answer Q u e s t i o n : What i s a r e s u l t o f m i s p l a c i n g p u n c t u a t i o n marks i n machine t r a n s l a t i o n ? Answer : What i s a r e s u l t o f m i s p l a c i n g p u n c t u a t i o n marks i n machine t r a n s l a t i o n i s n o t d i s c u s s e d i n t h e p r o v i d e d t e x t .

C o n t e x t : C o m p u t a t i o n Graphs F o r o u r e x a m p l e n e u r a l network from S e c t i o n . . . book p u b l i s h e d i n y e a r 2 0 2 0 .

Listing 4: Model Output: Relevant Answer " s c o r e " : 1 0 0 , " e x p l a n a t i o n " : " The g i v e n answer i s c o r r e c t , t h e t e x t d o e s n o t m e n t i o n m i s p l a c i n g p u n c t u a t i o n marks i n machine t r a n s l a t i o n . "

Listing 5: Input Example: Unrelated Answer Q u e s t i o n : What was t h e s t r u c t u r e o f t r a d e i n t h e Roman Empire ? Answer : I n h o s p i t a l s , d e h y d r a t i o n i s commonly t r e a t e d w i t h i n f u s i o n s .

C o n t e x t : CHAPTER OUTLINE 7 . 1 The D a i l y L i f e o f a Roman F a m i l y 7 . 2 S l a v e r y i n t h e Roman Empire 7 . 3 The Roman Economy : Trade , Taxes , and C o n q u e s t . . . J e w i s h p o p u l a t i o n d u r i n g t h e i m p e r i a l p e r i o d .

Listing 6: Model Output: Unrelated Answer " s c o r e " : 1 0 , " e x p l a n a t i o n " : " The answer i s e n t i r e l y u n r e l a t e d t o t h e q u e s t i o n .

I t i s t h e f u l l t e x t o f a c h a p t e r on t h e Roman Empire . T h e r e i s no a t t e m p t t o answer t h e q u e s t i o n a t a l l . T h e r e f o r e t h e s c o r e i s t h e l o w e s t p o s s i b l e . " { } { }

These examples illustrate the models’ capability to follow the structured output format and provide reasonable scores and explanations based on the provided context and the task instructions. However, a key finding was the dificulty in selecting “the right” model when multiple models of similar quality are available, especially with large and detailed contexts, that is why simple eyeballing is not an option. Furthermore, fine-tuning models was deemed ineficient compared to prompt-engineering the newest models, particularly due to the lack of time and suitably styled training datasets.

Our early eforts to create a custom training dataset by generating synthetic incorrect answers—such as by shifting answers between unrelated QA pairs—highlighted further limitations. In particular, older models like llama3.3 often failed to diferentiate correct from incorrect answers and frequently returned malformed or overly simplistic outputs. For that reason, we shifted our attention toward zero-shot prompting with more advanced models.

Related works

Question Answering (QA) is a well-established area in Natural Language Processing. Traditional QA systems often rely on large-scale knowledge bases or web corpora to extract answers [ 5 ]. In contrast, context-based or reading comprehension QA tasks present models with a specific document and require them to answer questions using only the provided information [ 6 ]. These tasks require models to locate and synthesize information solely from the given document(s), which aligns closely with the ELOQUENT Sensemaking task’s objective of testing whether LLMs can limit their knowledge to the provided materials. The Evaluator role, in particular, touches upon aspects of automated assessment and answer scoring, which has parallels in educational technology and peer review systems. This framing shares common ground with recent work in automatic answer grading and explanation-based assessment, such as in the domain of Automatic Short Answer Grading (ASAG) [ 7 ].

Conclusions

Participating in the ELOQUENT Sensemaking task, specifically in the Evaluator role, highlighted several challenges and provided us with useful insights. Manually evaluating the nuances of answers against extensive source texts is inherently dificult and time-consuming, underscoring the need for robust automated evaluation methods. However, creating such automated evaluators is also challenging, especially when aiming for human-like judgment.

A significant hurdle is the availability of suitable datasets for task-specific fine-tuning. While generalpurpose LLMs are powerful, adapting them to the precise requirements of a specialized evaluation task without extensive, tailored training data relies heavily on prompt engineering. Our experiments showed that newer models with large context windows, combined with careful prompt design, can achieve promising results in scoring answers based on provided contexts. Nevertheless, the process of selecting the best model and refining prompts remains an empirical endeavor. The findings suggest that prompt-engineering with state-of-the-art models is currently a more pragmatic approach than ifne-tuning for such specific, limited-duration tasks, especially when appropriately styled training data is scarce.

Our code is available on GitHub: https://github.com/K4TEL/llm-sensemaking

Acknowledgments

Thanks to the Institute of Formal and Applied Linguistics ÚFAL) at Charles University, Faculty of Mathematics and Physics (MFF), for providing access to the HPC cluster with GPU nodes, which was essential for running the experiments with large language models. This work has also received funding from the Project OP JAK Mezisektorová spolupráce Nr. CZ.02.01.01/00/23_020/0008518 named “Jazykověda, umělá inteligence a jazykové a řečové technologie: od výzkumu k aplikacím.”

Declaration on Generative AI

During the preparation of this work, the authors used X-GPT-4 and Gemini 2.5 for: Grammar and spelling check, text style adjustments. After using these tools and services, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

[1]

Šindelář ,

Bojar , Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR- WS , 2025 .

[2] Meta

, Meta llama 3 . 3 70b instruction-tuned model , 2024 . URL: https://huggingface.co/meta-llama / Llama-3 . 3 - 70B-Instruct.

[3]

Gemma

Team , Gemma 3, 2025 . URL: https://goo.gle/Gemma3Report.

[4]

Qwen

Team , Qwen3 technical report , 2025 . URL: https://arxiv.org/abs/2505.09388.

[5]

Chen ,

Fisch ,

Weston ,

Bordes , Reading wikipedia to answer open-domain questions , ACL ( 2017 ).

[6]

Rajpurkar ,

Zhang ,

Lopyrev ,

Liang , Squad: 100 ,000+ questions for machine comprehension of text , EMNLP ( 2016 ).

[7]

Burrows ,

Gurevych ,

Stein , Automated essay scoring: A survey of the state of the art , IEEE Transactions on Learning Technologies 8 ( 2015 ) 107 - 121 .