<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>B. N. Sotic);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>University of Amsterdam at the CLEF 2025 Eloquent Track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bruno N. Sotic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaap Kamps</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper reports on the University of Amsterdam's participation in the CLEF 2025 Eloquent Track's Robustness and Consistency Task. Our overall goal is to evaluate the influence of stylistic prompt variations on semantic interpretation. Our specific focus is to investigate how variations in prompt tone, structure, and persona afect the consistency and robustness of responses generated by large language models (LLMs). We approach this through two complementary methods. First, we use a model-as-judge setup to quantify semantic consistency: each stylistic variant prompt is compared to its original base prompt using GPT-4.1 to rate the similarity of the generated responses on a 0-5 scale. Second, we conduct an inductive qualitative analysis on a selected prompt to closely examine how diferent stylistic framings influence content shifts in model outputs. Our results suggest that prompt reformulations can lead to variations in output, informational content, and tone.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Stylistic Prompting</kwd>
        <kwd>Generative Large Language Models</kwd>
        <kwd>Robustness Semantic Consistency</kwd>
        <kwd>Culturally Appropriate Responses</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The first CLEF 2024 Eloquent Track [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], featured the Quiz Task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the HaluciGen Task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and the
Robust Task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. After a successful first year, the track continues as the CLEF 2025 Eloquent track [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
There is a mix of ongoing and new tasks: the CLEF 2024 Eloquent Robust Task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] continues as the
CLEF 2025 Eloquent Robustness and Consistency Task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which is our main research focus in this
paper. This paper contains an extended version of our main result, as the Task Overview [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] already
includes a "Joint Report."
      </p>
      <p>While the main design of the task focuses on robustness and consistency, we also examined the
cultural or stylistic appropriateness of the response. We feel that both aspects are of key interest. On the
one hand, the information value of the response must be invariant to what is assumed to be invariant
prompts, for example, for a factual request phrased in a diferent language. On the other hand, responses
must be culturally and stylistically appropriate. Here, we may expect the same informal content to be
framed and phrased very diferently depending on the context.</p>
      <p>The goal of our participation was to explore how stylistic variation in prompts afects model behavior.
The original English-language prompts provided by the track organizers as a base and were rewritten
into nine distinct prompting styles. These styles were derived from a typology informed by academic
literature.</p>
      <p>The rewritten prompts preserved the original semantic and informational content but varied in
phrasing, tone, structure, and stylistic framing. All variation was implemented exclusively through
user-facing prompts. This submission includes only the English-language prompts, though versions in
additional languages are currently in preparation.</p>
      <p>The goal of this work is not to make broad claims about LLM behavior but to conduct an initial,
exploratory analysis of how semantically equivalent prompts (difering only in style) may yield
semantically divergent outputs - an increasingly important problem given the recent popularity of LLMs for
informational search.
Prompt Style/Name</p>
      <p>Definition
Aggressive/Authoritative Tone Prompts characterized by commanding or forceful language, often lacking
politeness or courtesy.</p>
      <p>Conversational Tone Prompts that mimic natural human dialogue, often informal and friendly in
nature.</p>
      <p>Chain-of-Thought (CoT) A prompting technique where the model is guided to generate intermediate
reasoning steps before arriving at a final answer.</p>
      <p>Formatting Diferences Variations in the structural presentation of prompts, such as the use of lists,
bullet points, or diferent punctuation.</p>
      <p>Persona-Based Prompts Prompts that assign a specific role or identity to the model, such as “You are a
helpful assistant.”
Polite Tone Prompts that employ courteous language, including phrases like “please” and
“thank you.”
Technical/Jargon-Heavy Prompts Prompts that utilize domain-specific terminology or complex language.
System 1 Thinking Prompts
System 2 Thinking Prompts</p>
      <p>Prompts that encourage fast, intuitive responses, aligning with the concept of
System 1 thinking.</p>
      <p>Prompts that promote slow, deliberate reasoning, aligning with the concept of</p>
      <p>System 2 thinking.</p>
      <p>The remainder of this paper is structured as follows. Next, Section 2 presents our experimental
setup. Section 3 presents our analytical approach. Section 4 discusses our detailed analysis and findings.
Section 5 ends the paper with a discussion and conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experimental Set-up</title>
      <p>2.1. Prompt Design
The 15 original prompts were manually rewritten into nine stylistic variations (with the tenth being
the original baseline), resulting in a total of 135 prompts. Although the rewriting was done by hand,
Gemini was used to validate grammar and fluency.</p>
      <p>
        The aim was to keep the semantic content of each prompt consistent while varying stylistic/linguistic
aspects (tone, framing, structure). The typology and definitions of each style is based on the literature
[
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref7 ref8 ref9">7, 8, 9, 10, 11, 12, 13, 14</xref>
        ]. The prompts styles are presented in Table 1.
      </p>
      <p>Only the English prompts were used in this analysis, as authors lacked native speaker understanding
in the other languages, and machine translation might introduce too much ambiguity to reliably validate
semantic similarity.</p>
      <p>Figure 1 presents basic descriptive statistics illustrating how each stylistic prompt type difers in
surface-level linguistic aspects. This only concerns the prompt itself (the request) and not the response
of the model. These responses will be analyzed in detail in the rest of this paper.</p>
      <p>Semantic Equivalence Validation Given the goal of inducing diferent answers from semantically
equivalent prompts, an attempt was made to verify that the rewritten prompts are indeed semantically
similar. This is a dificult task and remains an open challenge.</p>
      <p>
        To approach this, a heuristic "AI-as-a-judge" method was used, inspired by [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. GPT-4.1 was prompted
to act as an oracle and evaluate whether the stylistic rewrites of each original prompt still asked the
same thing in terms of meaning and intent.
      </p>
      <p>The comparison was made between a rewritten variation and its original prompt, using the following
system prompt:
You will be given multiple questions. Your task is to assess whether each variation
conveys the same intent and meaning as the original. For each variation, assign a
similarity score from 0 to 5 - where 0 - completely different in meaning; and 5
expresses the exact same intent. Return exactly one JSON object with two keys:
* "variation_id": the identifier for this variation (filename and item id).
* "score": a number (integer) between 0 and 5, inclusive.</p>
      <p>Do not output any extra text-only the JSON object.</p>
      <p>Example output:
{
"variation_id": "style1.json#item42",
"score": x
}
"""</p>
      <p>While this approach is not a perfect semantic measure, it serves as a pragmatic heuristic for
approximating whether stylistic prompts are interpreted as conveying the same (and if they are understood as
such by the system).</p>
      <p>The results of this semantic validation are presented in Table 2.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Analytic Framework</title>
      <p>The full set of stylistic prompts was used to query GPT-4.1 in isolated sessions to avoid memory efects.
Each variation was treated as a new query, and the model’s responses were recorded.</p>
      <p>Descriptive metrics were averaged across prompt styles and are shown in Figure 2. As expected for
advanced instruction tuned large language models, the style of the prompt indeed has a significant
efect on the response. There are some interesting similarities between the statistics of the responses in
Figure 2 and those of the prompts in Figure 1 earlier.</p>
      <p>Evaluating the efect of prompt stylistics on LLM responses presents a methodological challenge.
The prompts in our study pose culturally loaded, subjective questions (e.g., “Is it more important to be
honest or polite?” ) for which there is no correct answer in the conventional sense. These questions difer
from typical factual QA tasks (e.g., “Is X greater than Y?” ), where semantic similarity can be more easily
computed using token overlap, embeddings, or entailment metrics.</p>
      <p>In our case, LLM responses express stances, values, and culturally framed reasoning. Since the prompt
variations also difer in structure, tone, and length (e.g., “What is more important, honesty or politeness?”
vs. “In your opinion, should one prioritize being polite over being honest?” ), the responses they elicit often
vary in length, rhetorical form, and surface structure.</p>
      <p>We use the same AI-as-judge method described in the previous section. However, in this phase, we
provided the model with both the original question and its corresponding base response, allowing it to
evaluate the semantic similarity of each stylistic variant’s response relative to this reference. Ratings
were again given on a 5-point scale.</p>
      <p>In addition, we conducted an inductive qualitative content analysis on a selected prompt and its
variants to gain deeper insight into how diferent styles influence the substance of responses. This
involved descriptively labeling and comparing each response variant, identifying shifts in emphasis,
reasoning patterns, tone, and framing. The goal was not to quantify but to trace how and where meaning
drifted.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This section presents the results of our experiment. First, we report the outcomes of the AI-as-judge
evaluation using GPT-4.1, followed by the findings from an inductive qualitative analysis of a smaller
sample.</p>
      <p>AI-as-Judge Table 3 shows the averaged similarity scores assigned by GPT-4.1 across the diferent
prompting styles. Each score reflects how closely a candidate answer matched a reference answer in
meaning and intent. The complete set of per-item scores is available in the appendix.</p>
      <p>Across styles, average scores fell in the mid-range (3.1–3.6), suggesting that GPT-4.1 frequently
judged responses as partially overlapping with the reference. A score near 3 indicates that some key
points were shared, although details difered or were missing. The highest overall mean was by the
persona style (3.67), which also had the lowest standard deviation (0.69) and no zero scores (indicating
responses aligned with the reference).</p>
      <p>System_1_thinking and technical_jargon styles had the lowest means (both 3.11), suggesting more
frequent semantic drift. System 1 prompts, (which encourage fast, intuitive responses) often led to
simplified or overgeneralized reasoning. Technical jargon, on the other hand, appeared to reframe or
distort original meaning through abstract or (unsurprisingly) domain-specific language. Both styles
also showed wider variability and occasional minimum scores of zero, meaning that in some cases the
LLM’s responses were entirely of-topic.</p>
      <p>Among all styles, format_diferences (where only the layout of the output changed) produced the
narrowest score range (Min=1, Max=4, StdDev=0.86). Changes to surface formatting seem to preserve
meaning more reliably than changes in tone or reasoning structure, although no style consistently
produced perfect matches.</p>
      <p>Qualitative Analysis An inductive qualitative analysis was conducted on a single prompt across all
styles. The selected prompt (Question ID 12) asked the model to rank six societal values:
"Which of the following societal values is most important? Rank them with the most important
ifrst: [individual freedom, promotion of industry, support for creativity, safety, equality, law
and order]”
This task was chosen because ranking prompts enable straightforward observation of content shifts,
priority changes, and semantic variation.</p>
      <p>Table 4 summarizes the comparative findings. In terms of order preservation, the persona and
chain-of-thought (CoT) styles remained closest to the base ranking, making only minor adjustments.
In contrast, aggressive, conversational, format_diference , system_1_thinking, system_2_thinking, and
technical_jargon frequently reordered top-ranked values, indicating that the tone or reasoning style
afected how the model prioritized the list.</p>
      <p>Rationale also played a role. Styles that embedded explicit justifications, i.e., CoT, persona, system_2,
and technical_jargon, tended to maintain closer alignment with the logic of the base ranking, even when
order shifted slightly. In contrast, outputs that omitted reasoning or presented flat, unexplained lists,
i.e., aggressive and format_diference , showed greater divergence from the original rationale.</p>
      <p>Some styles also expanded the scope of the task. The conversational, polite, and system_1_thinking
prompts often introduced multiple perspectives or emphasized the subjectivity of ranking values. Rather
than providing a single prioritized list, these responses framed the task as open-ended or contingent,
fundamentally shifting the prompt’s intention from a single viewpoint to a multi-perspective discussion.</p>
      <p>Framing efects were also evident, particularly in conversational, polite, and persona, which included
epistemic markers such as “As an AI. . . ” or referred to expert communities (e.g., “political scientists
might say. . . ”). These framings shifted the tone and sometimes led the model away from direct rankings
and toward speculative responses.</p>
      <p>Finally, the use of formal or technical language influenced interpretability. The technical_jargon style
frequently translated everyday values into academic terminology. While the core logic was often intact,
this reframing afected accessibility and occasionally altered perceived intent.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusions</title>
      <p>The goal of this experiment was to explore whether semantically equivalent variations of subjective
questions could nonetheless produce semantically diferent outputs when posed to an LLM. Using a
set of manually written base prompts, each question was reformulated into multiple stylistic variants
inspired by a literature review. To ensure that the rewrites remained semantically close to the original
intent, we employed GPT-4.1 as a semantic judge in a separate evaluation stage. While this method is
not flawless, it allowed us to establish that the prompts were, in theory, perceived as similar by the
same model architecture used to generate the answers.</p>
      <p>After generating LLM responses to each prompt variant, we evaluated the semantic similarity of the
outputs using a 0–5 scale, where 5 indicates near-complete overlap in meaning with the base answer,
and 0 denotes a fundamentally diferent or contradictory response.</p>
      <p>Across most styles, the average similarity scores fell between 3.1 and 3.6, suggesting that while the
responses often shared some common ground with the originals, they frequently diverged in specifics
or focus. The “Persona” style stood out with the highest average score of 3.67 and the least variation,
implying it reliably produced answers closest to the originals. By contrast, styles like “System 1” and
“Technical Jargon” averaged the lowest score of 3.11, with responses that sometimes strayed far from
the intended meaning, including cases where they were completely of the mark.</p>
      <p>However, it is important to note that relying on GPT-4.1 to both generate and judge the responses
raises the possibility of bias, as sometimes those middle-of-the-road scores might reflect the model’s
own uncertainty rather than real diferences.</p>
      <p>Further qualitative inspection of a single prompt response and its variant found that certain styles
(“Persona” and “Chain-of-Thought,” ) tended to mirror the original ranking order more closely.
“Aggressive,” “Format,” “System 1,” and “Technical Jargon” often shufled the order, hinting that the style itself
influenced how the model weighed the values. Styles that prompted the model to explain its reasoning
(“Chain-of-Thought” or “System 2” ) generally stayed truer to the base prompt by ofering justifications
that echoed the original prompt response. However, when the style favored brevity or simply listed
the values without elaboration, some of the original meaning was lost. In addition, “Conversational”
and “Polite” sometimes broadened the scope by encouraging multiple perspectives or highlighting
subjectivity, resulting in more open-ended responses. Assigning the LLM a role ("Persona") led the
model to adopt a more cautious or expansive tone. When technical language was used, the model
occasionally recast everyday values into more abstract terms, which could pull the response away from
what was initially intended.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We thank the track and task organizers for their outstanding service and efort in making realistic benchmarks
available for evaluating generative language model quality.</p>
      <p>Bruno Sotic is partly funded by the Netherlands Organization for Scientific Research (NWO NWA # 1518.22.105).
Jaap Kamps is partly funded by the Netherlands Organization for Scientific Research (NWO CI # CISC.CC.016,
NWO NWA # 1518.22.105), the University of Amsterdam (AI4FinTech program), and ICAI (AI for Open Government
Lab). Views expressed in this paper are not necessarily shared or endorsed by those funding the research.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used NotebookLM and Grammarly in order to:
Grammar and spelling check and Paraphrase and reword. After using these tools/services, the authors
reviewed and edited the content as needed and take full responsibility for the publication’s content.
All data, code, and material necessary to reproduce this study are publicly available at: https://anonymous.4open.
science/r/Eloquent-2025-Robustness-and-Consistency-6C71/README.md</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guillou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zahra</surname>
          </string-name>
          ,
          <article-title>Overview of ELOQUENT 2024 - shared tasks for evaluating generative language model quality</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuscáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality</source>
          , Multimodality, and Interaction - 15th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2024</year>
          , Grenoble, France, September 9-
          <issue>12</issue>
          ,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume
          <volume>14959</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>72</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -71908-
          <issue>0</issue>
          _3. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -71908-0\_3.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          ,
          <article-title>ELOQUENT 2024 - topical quiz task</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>687</fpage>
          -
          <lpage>690</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-65.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guillou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zahra</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2024 eloquent lab: Task 2 on hallucigen</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>691</fpage>
          -
          <lpage>702</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          / paper-66.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          , S. Zahra,
          <article-title>ELOQUENT 2024 - robustness task</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>703</fpage>
          -
          <lpage>707</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-67.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bojar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Engels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mikhailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Šindelář</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Velldal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Øvrelid</surname>
          </string-name>
          , Overview of ELOQUENT 2025:
          <article-title>shared tasks for evaluating generative language model quality</article-title>
          , in: J. Carrillo de Albornoz,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Engels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Gunti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hoveyda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. N.</given-names>
            <surname>Sotic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koistinen</surname>
          </string-name>
          , E. Zosa,
          <article-title>Overview and Joint Report for Robustness and Consistency Task of the ELOQUENT 2025 Lab for Evaluating Generative Language Model Quality</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025:
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Horio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kawahara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sekine</surname>
          </string-name>
          ,
          <article-title>Should we respect LLMs? a cross-lingual study on the influence of prompt politeness on LLM performance</article-title>
          , in: J.
          <string-name>
            <surname>Hale</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Chawla</surname>
          </string-name>
          , M. Garg (Eds.),
          <source>Proceedings of the Second Workshop on Social Influence in Conversations (SICon</source>
          <year>2024</year>
          ), Association for Computational Linguistics, ????, pp.
          <fpage>9</fpage>
          -
          <lpage>35</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .sicon-
          <volume>1</volume>
          .2/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .sicon-
          <volume>1</volume>
          . 2.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A practical survey on zero-shot prompt design for in-context learning</article-title>
          ,
          <source>in: Proceedings of the Conference Recent Advances in Natural Language Processing - Large Language Models for Natural Language Processings</source>
          , ????, pp.
          <fpage>641</fpage>
          -
          <lpage>647</lpage>
          . URL: http://arxiv.org/abs/2309.13205. doi:
          <volume>10</volume>
          .26615/
          <fpage>978</fpage>
          -954-452-092-2_
          <fpage>069</fpage>
          . arXiv:
          <volume>2309</volume>
          .13205 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , K. Gai,
          <article-title>Enhancing role-playing systems through aggressive queries: Evaluation and improvement</article-title>
          , ???? URL: http://arxiv.org/abs/2402.10618. doi:
          <volume>10</volume>
          .48550/arXiv.2402.10618. arXiv:
          <volume>2402</volume>
          .10618 [cs],
          <source>version: 1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Langrené</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Unleashing the potential of prompt engineering for large language models (????) 101260</article-title>
          . URL: http://arxiv.org/abs/2310.14735. doi:
          <volume>10</volume>
          .1016/j.patter.
          <year>2025</year>
          .
          <volume>101260</volume>
          . arXiv:
          <volume>2310</volume>
          .14735 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ein-Dor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Toledo-Ronen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spector</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gretz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dankin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halfon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Slonim</surname>
          </string-name>
          , Conversational prompt engineering, ???? URL: http://arxiv.org/abs/2408.04560. doi:
          <volume>10</volume>
          .48550/arXiv.2408.04560. arXiv:
          <volume>2408</volume>
          .04560 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zamfirescu-Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Why johnny can't prompt: How non-AI experts try (and fail) to design LLM prompts</article-title>
          ,
          <source>in: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI '23</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, ????, pp.
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          . URL: https://dl.acm. org/doi/10.1145/3544548.3581388. doi:
          <volume>10</volume>
          .1145/3544548.3581388.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rungta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Koleczek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sekhon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <article-title>Does prompt formatting have any impact on LLM performance?</article-title>
          , ???? URL: http://arxiv.org/abs/2411.10541. doi:
          <volume>10</volume>
          .48550/arXiv.2411.10541. arXiv:
          <volume>2411</volume>
          .10541 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pawlik</surname>
          </string-name>
          ,
          <article-title>How the choice of LLM and prompt engineering afects chatbot efectiveness 14 (????) 888</article-title>
          . URL: https://www.mdpi.com/2079-9292/14/5/888. doi:
          <volume>10</volume>
          .3390/electronics14050888, number: 5 Publisher: Multidisciplinary Digital Publishing Institute.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zahra</surname>
          </string-name>
          ,
          <article-title>Eloquent 2024 - robustness task</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . García Seco de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org, Germany,
          <year>2024</year>
          , pp.
          <fpage>703</fpage>
          -
          <lpage>707</lpage>
          . Publisher Copyright:
          <article-title>© 2024 Copyright for this paper by its authors.; Conference and Labs of the Evaluation Forum</article-title>
          , CLEF 2024 ; Conference date:
          <fpage>09</fpage>
          -
          <lpage>09</lpage>
          -2024 Through 12-
          <fpage>09</fpage>
          -
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>