<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>R. Alharbi);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Context in Few-Shot LLM</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Terry R. Payne</string-name>
          <email>T.R.Payne@liverpool.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Competency Question, Prompting Techniques, Requirement Engineering</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Liverpool</institution>
          ,
          <addr-line>Liverpool L69 7ZX</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1969</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>This paper investigates the role of structured data-specifically ontology triples-in prompting large language models for the task of retrofitting competency questions from ontologies. Building on RETROFIT-CQ, our previous work that introduces a zero-shot method for generating competency questions from ontology triples, we explore how few-shot prompting can enhance the quality and contextual alignment of the generated questions. We incorporate examples of competency questions, ontology URIs, and textual descriptions into the prompts to evaluate how diferent combinations of structured and contextual data influence large language models performance. We empirically evaluate this few-shot approach on a selection of benchmark ontologies (Video Game, African Wildlife, and Vicinity Core) which were originally used to evaluate the previous zero-shot prompt. Our experiments demonstrate that few-shot prompting helps in reducing overgeneralisation and in improving semantic alignment.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Prompting⋆</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Competency Questions (CQs) play a central role in ontology engineering. They serve multiple purposes
across the ontology lifecycle, from supporting requirements elicitation in the early stages of
development [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1, 2, 3, 4, 5</xref>
        ], to facilitating verification and validation during testing [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], and even enabling
ontology reuse by recommending suitable candidate ontologies [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9, 10</xref>
        ]. Despite their importance,
authoring CQs remains a non-trivial task for both ontology engineers and domain experts [11, 12, 13],
and is often based on traditional knowledge elicitation approaches, i.e. card sorting or 20 questions [ 14].
      </p>
      <p>Recent advances in Artificial Intelligence, particularly with the rise of Large Language Models
(LLMs), have created new opportunities to support the authoring of CQs. For example, LLMs have been
used to address challenges such as grammar correction and language variability, which often hinder
CQ formulation [15]. Several recent eforts have leveraged LLMs for CQ generation: AgoCQs [ 15]
explores generation from a corpus of text describing a domain; RevOnt [16] uses Wikidata as a source
of background knowledge; OntoChat [17] facilitates dialogue-based CQ creation through interactive
agents; and RETROFIT-CQ[18] generates questions for ontologies whose CQs have not been published
with the ontology itself by leveraging ontology triples. A similar approach is also adopted in [19].</p>
      <p>Although LLMs mitigate several traditional limitations in diferent AI tasks, their use introduces new
research challenges and opportunities, particularly arising from variability across model architectures,
parameter configurations, and prompt engineering strategies. Previous studies have explored the use of
both open-source and proprietary LLMs, and examined parameters such as temperature, which can
influence hallucination and creativity [ 20, 21]. However, one critical and underexplored aspect remains:
prompting techniques, i.e. the strategies used to instruct LLMs.</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>Prompting governs how humans interact with LLMs, typically through instructions or examples [22,
23, 24]. Common strategies include zero-shot prompting, where the model is asked to perform a task
without any examples, few-shot prompting, where a small number of examples are provided, and more
structured approaches like chain-of-thought prompting, which guide reasoning steps. While prior CQ
generation studies have used zero-shot prompts extensively [18, 19, 16, 15, 25], and some have explored
chain-of-thought prompting [17], few-shot prompting remains largely uninvestigated in this context.</p>
      <p>In contrast, fields such as education have widely adopted few-shot prompting, demonstrating its
efectiveness in generating high-quality, context-sensitive questions in reading comprehension or
competency-based assessments [26, 27, 28]. These findings motivate our exploration of few-shot
prompting for CQ generation.</p>
      <p>In this study, we buid upon our previous method, RETROFIT-CQ, which generates CQs from structured
RDF triples extracted from ontologies using tailored prompts for LLMs. RETROFIT-CQ has shown
encouraging results in generating CQs that reflect the intent and content of manually authored CQs
across a range of existing ontologies [21, 20, 18]. We extend this method by introducing few-shot
prompting and compare its efectiveness with zero-shot prompting.</p>
      <p>Our exploratory results indicate that few-shot prompting yields modest yet meaningful improvements
in the quality of generated CQs, notably by reducing overgeneralisation and producing more concise,
context-sensitive formulations. However, these improvements are not statistically significant at this
stage. We therefore suggest that further investigation is required across a broader range of LLMs and
ontologies to determine whether the benefits of few-shot prompting justify the associated computational
and cost overhead—particularly in domains where such advantages may be more pronounced, such as
education.</p>
      <p>The remainder of this paper is structured as follows: Section 2 discusses related work, while Section 3
details the methodology used for this exploratory study, including the experimental design and
evaluation metrics. We present the results in Section 4 and discuss key findings and limitations in Section 5.
Finally, Section 6 concludes the paper and outlines directions for future work.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related work</title>
      <p>
        Several studies have investigated the generation of CQs, exploring a spectrum of approaches. Traditional
approaches often rely on close interaction with domain experts and ontology engineers to manually craft
CQs (e.g., [
        <xref ref-type="bibr" rid="ref1">1, 14</xref>
        ] ). More recent approaches either continue to involve human input while incorporating
automated techniques (LLM) [17], or aim to partially replace human involvement by using external
knowledge resources along with LLM to support the CQ generation process [18, 19, 16, 15, 25].
      </p>
      <p>Interaction with LLMs is typically carried out through prompting techniques. While various
prompting strategies exist [22, 23, 24], most previous approaches for CQ generation have primarily used
zero-shot prompting [18, 19, 16, 15, 25, 17]. To the best of our knowledge, few-shot prompting (i.e.,
providing illustrative examples within the prompt) has not yet been applied in methods for generating
CQs in the context of ontology engineering. However, in related domains such as automatic question
generation for educational applications [29], few-shot prompting has shown promising results,
particularly in improving question relevance and quality. For example, in [26], the authors showed that
few-shot prompting improves model performance in generating higher-order questions for
comprehensive reading assessment compared to zero-shot prompting. Additionally, [27] explored a few-shot
prompting strategy for controllable question generation in narrative comprehension. Their results
demonstrate that questions generated with attribute-specific guidance closely match the corresponding
ground-truth questions.</p>
      <p>In [28], the authors proposed a method for extracting text from PDF documents, chunking it into
500-word segments, and preprocessing it to build a KG that captures contextual information. This is
followed by generating competency questions for educational assessment using both zero and few shot
prompts. However, the KG’s role in competency questions generation is unclear: it is not fully specified
whether the full KG or a subgraph is passed in the prompt. For example, in the few-shot prompt, the
Extracting
Triples from
Ontologies
1</p>
      <p>Few-shot Prompts</p>
      <p>Exemplar CQ</p>
      <p>Ontology URI
Ontology URI and</p>
      <p>Description
{S,P,O}
{S,P,O}
----{S,P,O}</p>
      <p>Few-shot
prompts</p>
      <p>Questions
Filtration
5</p>
      <p>CQ1
CQ2
CQ3
—</p>
      <p>CQn
Candidate CQs
system is given a text describing operational waste management along with five example questions
distilled from the text, but there is no explicit mention of the KG in the prompt. A key takeaway
is that few-shot prompting improves the model’s ability to generate contextually relevant questions
aligned with the source content. This suggests a promising direction for investigating its applicability
in generating CQs aimed at capturing ontology requirements.</p>
      <p>We evaluate the CQs generated using few-shot prompting against the original CQs used in the
ontology construction process for each of the ontologies used in our experiments. There is a lack of
consensus within the ontology engineering community on what constitutes a “good” CQ [13, 12, 11]
and on what are appropriate methods to evaluate their quality. We therefore compare the results of
this extended approach using as baseline zero-shot prompting in [20, 21]. Through this comparative
evaluation, we aim to identify strengths and limitations of each prompting approach, understand
their impact on the CQs generated, and highlight promising directions for future research focused on
improving competency question generation in ontology engineering.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Retrofit-CQ with Few-shot Prompting</title>
      <p>
        The RETROFIT-CQ approach [18] addresses the absence of published CQs for a given ontology. Ideally,
the statements within an ontology are derived from a set of CQs formulated by ontology engineers as
part of the ontology construction process, following some of the most prominent ontology engineering
methodologies [
        <xref ref-type="bibr" rid="ref1 ref3 ref5">5, 1, 3</xref>
        ]. However, often these CQs are not made available, either as part of the ontology
documentation or in publications about the ontology. The goal of the RETROFIT-CQ pipeline is to reverse
this process, reconstructing the CQs from the ontology’s existing statements, which are represented as
triples. The new pipeline, utilising few-shot prompting, is organised into five main phases as presented
in Figure 1: (i) Triples are extracted from the ontology; (ii) Few-shot prompts are constructed by
combining the CQ definition with various contextual elements (such as exemplar CQs, the ontology URI,
and its description—discussed later in this section); (iii) These triples are embedded into the few-shot
prompt to create input queries for the language model; (iv) The language model is then queried using
these prompts to generate a diverse set of questions; and (v) The generated questions are filtered to
remove duplicates and irrelevant entries, producing the final set of candidate CQs.
      </p>
      <p>Previous studies present an empirical analysis of the RETROFIT-CQ approach conducted across
various LLMs [18, 20, 21]. The results of these studies supports the claim that LLMs, when provided
with structured input (in the form of ontology triples) and carefully designed prompts, can efectively
generate valid CQs with high recall, measured by how well they matched the existing CQs for each
ontology included in the experiment. In particular, [20, 21] assess the impact of both the creativity
parameter (temperature) and the degree of contextual content included in zero-shot prompts used with
both closed-source LLMs (e.g., gpt-3.5-turbo and gpt-4) and open-source models (e.g., Flan-T5, Mistral,
and LLaMA). A key insight from previous studies is that increasing the level of contextual information
in zero-shot prompts can improve the precision of the generated CQs.</p>
      <p>In this study, we therefore explore whether few-shot prompts with varying contextual elements
afect the quality of the generated candidate CQs compared to zero-shot prompts. One of these
contextual elements is the role and we distinguish between system and user roles, as shown in the
following prompts:
messages = ["role": "system", "content": ("You are an ontology engineer working
on a project to develop a new ontology in the [DOMAIN]. Your task is to generate
competency questions (CQs) based on RDF triples from the ontology schema. A CQ is
a natural language question that can be answered using the information modelled in
the ontology. CQs help define the scope, purpose, and evaluation criteria for the
ontology.")]</p>
      <p>The system role establishes that the LLM should act as if it was an ontology engineer tasked with
generating CQs from RDF triples, ensuring responses remain focused on ontology design and CQ
generation.
"role": "user", "content": (
"Here are some examples of ontology triples and the corresponding CQs:
Subject: [EXAMPLE_SUBJECT_1]
Predicate: [EXAMPLE_PREDICATE_1]
Object: [EXAMPLE_OBJECT_1]
Generated question: [EXAMPLE_CQ_1]
--------------Subject: [EXAMPLE_SUBJECT_n]
Predicate: [EXAMPLE_PREDICATE_n]
Object: [EXAMPLE_OBJECT_n]
Generated question: [EXAMPLE_CQ_n]
Now, based on the RDF triple below, generate one or more relevant CQs:
Subject: {subject}
Predicate: {predicate}
Object: {object})"</p>
      <p>The user role feeds the LLM with sample RDF triples with their corresponding CQs, then requests
new CQs for a specific triple. This guides the model by providing the expected input-output format and
framing the task. We also experimented with two variants of the user role, by including in the content
1) the ontology URI and 2) the ontology URI and the ontology description. The former simulates
browsing or understanding of the knowledge model and helps evaluate how external context and
schema patterns influence the quality and relevance of the generated CQs.</p>
      <p>The latter includes the ontology description to provide additional context. This enhancement enables
the language model to better understand the purpose, scope, and semantics of the ontology, leading to
more accurate and meaningful CQs.
"role": "user", "content": (
"Below are some examples of ontology schema-based competency question generation,
derived from a vocabulary for describing [Ontology_DESCRIPTION].</p>
      <p>The Ontology defines terms for describing [EXAMPLES_OF_CONTENT], including their
relationships and properties."
Ontology URI: [ONTOLOGY_URI]
Subject: [EXAMPLE_SUBJECT_1]
Predicate: [EXAMPLE_PREDICATE_1]
Object: [EXAMPLE_OBJECT_1]
Generated question: [EXAMPLE_CQ_1]
--------------Subject: [EXAMPLE_SUBJECT_n]
Predicate: [EXAMPLE_PREDICATE_n]
Object: [EXAMPLE_OBJECT_n]
Generated question: [EXAMPLE_CQ_n]
Now, based on the RDF triple below, generate one or more relevant CQs:
Subject: {subject}
Predicate: {predicate}
Object: {object})"</p>
      <p>We evaluate selected LLMs that were used in earlier work: for this exploratory study, we limit
our selection to one closed-source model and one open-source model, i.e. gpt-4 and Flan-T5. 1 The
configurations of these LLMs use default settings to maintain comparability with zero-shot prompting
in previous studies. For the dataset in this experiment, we selected three ontologies previously used
in [18, 20, 21]. Two of these (Video Game, and VICINITY Core) were sourced from the CORAL
repository [30], while the third (African Wildlife) was used in [31]. The ontologies were randomly
selected from those that satisfy the following criteria: (i) the ontologies were produced by diferent
developers (CQ style); (ii) they represent various domains (diversity); and (iii) each had a significant
number of published CQs (significance).</p>
      <p>To validate the candidate CQs generated by our approach, we compare them against the original
baseline CQ for each ontology. This comparison is performed by embedding both the original and
candidate CQs using SBERT [32], and then computing the cosine similarity between their embedding
vectors. We report performance metrics similar to those used in previous studies, in particular; the
precision ( ) and recall ( ) metrics are based on determining the number of CQs in the relevant
original baseline CQ dataset -   , and the candidate CQ set -   , which correspond to the filtered
CQs generated by the LLMs. The metrics used are given below, where   ⊆   is the set of candidate
CQs that are assessed as having a similar meaning to those in the baseline CQ (  ) according to SBERT,
such that the cosine similarity is ≥ 0.7 (i.e. true positives); and the set of unmatched CQs (  ) is the
relative complement of the set   with respect to the set of baseline CQs, such that   =   ∖  
(i.e. the CQs in the baseline set that are not in the set of true positives). The similarity threshold, 0.7,</p>
      <sec id="sec-4-1">
        <title>1Comprehensive experimentation across a broader range of LLMs is left for future work.</title>
        <p>was experimentally determined as the most discriminant, whilst allowing some variance between the
questions.</p>
        <p>Precision (Prec): This is the ratio of the number of True Positives (|  |) and the sum of both True</p>
        <p>Positives and False Positives, i.e. all of Candidate CQs (|  |).</p>
        <p>Recall (Rec): Also known as sensitivity, this is the ratio of the number of True Positives (|  |) and the
sum of True Positives (|  |) and False Negatives (|  |) corresponding to the unmatched CQs in
the baseline dataset.</p>
        <p>. =
|  |
|  |
. =</p>
        <p>|  |
|  | + |  |</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Results</title>
      <p>This section presents a comparative evaluation of RETROFIT-CQ using both zero-shot and few-shot
prompting strategies with two LLMs: GPT-4 and Flan-T5. In the few-shot setting, we examine three
types of contextual input, each incrementally building on the previous one during prompting: (i) CQ
examples, (ii) CQ examples with an ontology URI, and (iii) CQ examples with an ontology URI and a
description. The evaluation spans three ontologies— Video Game, African Wildlife, and Vicinity Core—
and the generated CQs are compared against a gold-standard set of existing CQs for each ontology,
measuring Precision (Prec) and Recall (Rec).</p>
      <sec id="sec-5-1">
        <title>4.1. Few-Shot Prompting Results</title>
        <p>Video Game Ontology GPT-4 outperforms Flan-T5 across all prompting contexts. Its highest score
comes with CQ examples+URI+description input (Prec = 0.9921, Rec = 0.9766), indicating well-balanced
performance. Flan-T5 performs best with CQ examples (Prec = 0.9825, Rec = 0.9333), followed closely by
CQ examples+URI+description (Prec = 0.9298, Rec = 0.9298). Its weakest performance occurs with the
CQ examples+URI (Prec = 0.8596), highlighting the model’s reliance on richer input. GPT-4, by contrast,
consistently achieves high precision and recall across all contexts, demonstrating greater robustness to
input variation.</p>
        <p>African Wildlife Ontology Both models perform strongly on this ontology, likely due to its semantic
simplicity. GPT-4 achieves perfect Rec (1.0) across all settings and its best Prec (0.9691) with CQ examples.
Diferences across contexts are minimal, indicating GPT-4’s ability to generalise with limited structured
input. Flan-T5 also achieves its highest Prec (0.9130) with CQ examples. While all settings yield
perfect recall (1.0), precision varies—suggesting that the model sometimes generates more plausible but
incorrect CQs when aiming for high coverage.
(1)
(2)
Vicinity Core Ontology This ontology proves most challenging, likely due to its complexity and
specificity. Flan-T5’s precision drops significantly—from (0.6603) with CQ examples to (0.6055) when
CQs are combined with URIs. This indicates that adding URIs may introduce noise or irrelevant
information, reducing the model’s ability to focus on the core question pattern. GPT-4 remains more
stable, with Prec ranging from 0.8553 to 0.8274 across prompting strategies. Its highest score (Prec =
0.8553) occurs with the CQ examples+URI+description input, suggesting a strong ability to integrate
structured inputs like URIs with accompanying natural language. This reflects GPT-4’s capacity to
generalize from prior exposure to symbolic and descriptive patterns, even without access to external
web data or real-time resolution.</p>
        <p>Table 2 shows example CQs generated for the triple (Virtuosity subClassOf Achievement) in the Video
Game Ontology using GPT-4.2 The quality of generated CQs improves with the richness of contextual
input. With CQ examples alone, the output tends to be broad and generic. Adding the ontology URI
leads to better structural alignment, while combining the URI with a textual description results in
the most accurate and semantically grounded CQs. These findings emphasise the value of context in
helping LLMs produce high-quality CQs for ontology engineering.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Few-Shot vs. Zero-Shot Performance Comparison</title>
        <sec id="sec-5-2-1">
          <title>2all results are available in https://github.com/Li563313/Question-generation</title>
          <p>(e.g., vicinity core ontology). In the zero-shot setting, the model tends to overgenerate, often producing
irrelevant or loosely related CQs. Few-shot examples help anchor the output into relevant patterns. For
instance: (1) Video Game Ontology: Prec improves from 0.86 → 0.98. (2) African Wildlife Ontology:
Prec increases from 0.70 → 0.91. (3) Vicinity Core Ontology: Prec improves from 0.59 → 0.66, despite
the domain’s complexity.</p>
          <p>GPT-4 performs well even without examples, but benefits from a few-shot prompt in terms of reducing
overgeneration. Recall remains high in the zero-shot setting, though precision may decline due to
the inclusion of loosely relevant responses. Supplying a few examples helps the model respond more
precisely. For example: (1) Video Game Ontology: Zero-shot Rec is nearly perfect 0.9976, but Prec is
low 0.7122. With few-shot prompting, Prec rises to 0.9921. (2) Vicinity Core Ontology: GPT-4 generates
over 5,000 candidate CQs in zero-shot mode (CCQ in Table 1) with only 0.6869 Prec. Few-shot guidance
improves Precto 0.8553.</p>
          <p>While GPT-4 exhibits strong baseline performance in both settings, both models benefit significantly
from few-shot prompting. The improvements are especially pronounced in terms of precision and
semantic relevance, demonstrating that structured examples and contextual enrichment are critical for
generating high-quality CQs in knowledge engineering tasks.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Discussion</title>
      <p>This study presents one of the first comparative analysis of few-shot and zero-shot prompting techniques
applied to CQ generation for the construction of ontologies and KGs. While prior CQ generation
eforts have predominantly relied on zero-shot prompting [ 15, 18, 17, 19, 16, 25], few-shot prompting
has been shown to enhance question quality, particularly in educational and professional training
settings [26, 28, 27]. Motivated by these findings, we investigated whether similar improvements hold
for CQ generation in ontology engineering: specifically, whether the choice of prompting technique
afects the quality of generated CQs, and to what extent.</p>
      <p>Our analysis indicates that few-shot prompting enhances RETROFIT-CQ’s performance compared
to zero-shot by reducing the noise in generated CQs. That is, it results in more concise, relevant, and
context-specific questions rather than overly general ones. This improvement is most clearly reflected
in the increased precision scores shown in Table 3. Notably, while recall remains high (close to 1)
in both prompting setups – indicating that both techniques are capable of capturing the majority of
ground-truth CQs – few-shot prompting mitigates the problem of overgeneration, particularly in large
or complex ontologies like the Vicinity Core ontology. This suggests that few-shot prompting not only
improves precision but also controls verbosity and enhances semantic relevance.</p>
      <p>The efectiveness of few-shot prompting also varies depending on the underlying LLMs and the type of
contextual information provided. For instance, Flan-T5, an open-source model, performs inconsistently
across diferent contexts. Its precision drops significantly when an ontology URI is provided—such as
in the case of the Video Game ontology (Table 1), where precision is 0.8596. In contrast, performance
improves when an ontology description is included (0.9298), and it is highest when only example CQs
are used (0.9825). This discrepancy likely stems from Flan-T5’s limitations: it is a static model without
internet access, and thus cannot interpret URIs or resolve them to meaningful content.</p>
      <p>A similar pattern is observed with GPT-4. Despite its superior capabilities, we did not enable internet
access (e.g., through search APIs or web retrieval mechanisms).3 This is because we wanted to maintain
the same settings in order to compare the results against our previous paper [12]. Consequently, GPT-4
also struggles when presented with ontology URIs. This underperformance highlights a broader issue:
without contextual enrichment—such as ontology descriptions or structured example questions—even
advanced LLMs are unable to infer the intended semantics of domain-specific identifiers. Therefore, the
presence of well-crafted contextual information in few-shot prompts plays a critical role in generating
high-quality CQs for knowledge engineering.</p>
      <p>Despite the improvements seen with few-shot prompting, this technique raises one major concern:
we must consider whether the gains in CQ quality justify the additional computational cost and time
associated with few-shot setups, especially when using closed-source LLMs like GPT-4. The need for
structured context (e.g., examples or descriptions) not only increases the complexity of the prompting
pipeline but may also limit scalability in real-world ontology engineering tasks.</p>
      <p>Whether the use of few-shot prompting is warranted for CQ generation, given the trade-ofs in time,
resources, and deployment complexity remains an open question. While high-quality, semantically rich
questions are undoubtedly valuable in domains such as education and assessment, where understanding
and precision are paramount, it remains unclear whether similar similar characteristics are is always
necessary in the broader field of ontology engineering. More empirical studies are needed to assess
whether the added precision translates to tangible benefits in ontology design, validation, or reuse.
3https://platform.openai.com/docs/guides/function-calling?api-mode=responses</p>
      <p>Moreover, future work should include a broader range of LLMs (both open-source and closed—
source) and ontologies of varying complexity. This would allow us to assess the generalisability of our
ifndings and better understand how prompting strategies interact with model architecture, domain
specificity, and the nature of the task.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions</title>
      <p>This study provides the first comparative analysis of zero-shot and few-shot prompting for CQ generation
in ontology engineering. Our results show that few-shot prompting improves the quality of generated
CQs– particularly in terms of precision and relevance– by providing structured context that helps guide
the model. However, this improvement comes with increased resource demands, raising important
questions about cost-efectiveness in real-world applications. While few-shot prompting is valuable
in domains requiring high precision, such as education or formal assessment, its broader utility in
ontology engineering warrants further investigation. Future work should explore this trade-of across
more models, domains, and ontology types to better understand when and where few-shot prompting
truly adds value.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author used ChatGPT to improve the manuscript’s readability and subsequently reviewed and
edited the content, taking full responsibility for the final published article.
[10] S. Azzi, A. Assi, S. Gagnon, Scoring ontologies for reuse: An approach for fitting semantic
requirements, in: Proceedings of the Research Conference on Metadata and Semantic Research,
MTSR 2022, Springer Nature, 2023, pp. 203–208.
[11] C. M. Keet, Z. C. Khan, On the roles of competency questions in ontology engineering, in:
Proceedings of the 24th International Conference on Knowledge Engineering and Knowledge
Management, EKAW2024, Springer Nature Switzerland, Cham, 2025, pp. 123–132.
[12] R. Alharbi, J. de Berardinis, F. Grasso, T. Payne, V. Tamma, Characteristics and desiderata for
competency question benchmarks, in: Proceedings of the 23rd International Semantic Web
Conference, ISWC, Lecture Notes in Computer Science, Springer, 2024. To appear.
[13] R. Alharbi, V. Tamma, F. Grasso, T. R. Payne, A review and comparison of competency question
engineering approaches, in: Proceedings of the 24th International Conference on Knowledge
Engineering and Knowledge Management, EKAW2024, Springer Nature Switzerland, Cham, 2025,
pp. 271–290.
[14] L. Rao, H. Reichgelt, K. Osei-Bryson, Knowledge elicitation techniques for deriving competency
questions for ontologies, in: Proceedings of the Tenth International Conference on Enterprise
Information Systems (ICEIS 2008), volume ISAS-2, Barcelona, Spain, 2008, pp. 105–110.
[15] M.-J. Antia, C. M. Keet, Automating the generation of competency questions for ontologies
with agocqs, in: Knowledge Graphs and Semantic Web, Springer Nature Switzerland, Cham, 2023,
pp. 213–227.
[16] F. Ciroku, J. de Berardinis, J. Kim, A. Meroño-Peñuela, V. Presutti, E. Simperl, Revont: Reverse
engineering of competency questions from knowledge graphs via language models, Journal of
Web Semantics 82 (2024) 100822. doi:https://doi.org/10.1016/j.websem.2024.100822.
[17] B. Zhang, V. A. Carriero, K. Schreiberhuber, S. Tsaneva, L. S. González, J. Kim, J. de Berardinis,
Ontochat: A framework for conversational ontology engineering using language models, in:
Proceedings of the 21st Extended Semantic Web conference, ESWC, Springer Nature Switzerland,
Cham, 2025, pp. 102–121.
[18] R. Alharbi, V. Tamma, F. Grasso, T. R. Payne, An experiment in retrofitting competency questions for
existing ontologies, in: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing,
SAC ’24, 2024, p. 1650–1658. doi:10.1145/3605098.3636053.
[19] Y. Rebboud, L. Tailhardat, P. Lisena, R. Troncy, Can LLMs generate competency questions?, in:</p>
      <p>Extended Semantic Web Conference, ESWC2024, Hersonissos, Greece, 2024.
[20] R. Alharbi, V. Tamma, F. Grasso, T. R. Payne, The role of generative ai in competency question
retrofitting, in: Proceedings of the 21st Extended Semantic Web Conference, ESWC2024, Springer
Nature Switzerland, Cham, 2025, pp. 3–13.
[21] R. Alharbi, V. Tamma, F. Grasso, T. R. Payne, Investigating open source llms to retrofit
competency questions in ontology engineering, Proceedings of the AAAI Symposium Series 4 (2024)
188–198. URL: https://ojs.aaai.org/index.php/AAAI-SS/article/view/31793. doi:10.1609/aaaiss.
v4i1.31793.
[22] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic
survey of prompting methods in natural language processing, ACM Computing Surveys 55 (2023).
doi:10.1145/3560815.
[23] G. Marvin, N. Hellen, D. Jjingo, J. Nakatumba-Nabende, Prompt engineering in large language
models, in: Proceedings of the Data Intelligence and Cognitive Informatics conference, Springer
Nature Singapore, 2024, pp. 387–402.
[24] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot
learners, in: Proceedings of the 34th International Conference on Neural Information Processing
Systems, NIPS ’20, Curran Associates Inc., Red Hook, NY, USA, 2020.
[25] X. Pan, J. v. Ossenbruggen, V. de Boer, Z. Huang, A rag approach for generating competency
questions in ontology engineering, in: Metadata and Semantic Research, Springer Nature Switzerland,
Cham, 2025, pp. 70–81.
[26] Y. Poon, J. S. Y. Lee, Y. Y. Lam, W. L. Suen, E. L. C. Ong, S. K. W. Chu, Few-shot question generation
for reading comprehension, in: Proceedings of the 10th SIGHAN Workshop on Chinese Language
Processing (SIGHAN-10), Association for Computational Linguistics, Bangkok, Thailand, 2024, pp.
21–27. URL: https://aclanthology.org/2024.sighan-1.3/.
[27] B. Leite, H. Cardoso, On few-shot prompting for controllable question-answer generation in
narrative comprehension, in: Proceedings of the 16th International Conference on Computer
Supported Education - Volume 2: CSEDU, INSTICC, SciTePress, 2024, pp. 63–74. doi:10.5220/
0012623800003693.
[28] D. Di Nuzzo, E. Vakaj, H. Saadany, E. Grishti, N. Mihindukulasooriya, Automated generation of
competency questions using large language models and knowledge graphs, in: Proceedings of
the 3rd International Workshop on Natural Language Processing for Knowledge Graph Creation,
NLP4KGC 2024, co-located with 20th International Conference on Semantic Systems (SEMANTiCS
2024), 2024, pp. 128–153.
[29] N. Mulla, P. Gharpure, Automatic question generation: a review of methodologies, datasets,
evaluation metrics, and applications, Progress in Artificial Intelligence 12 (2023) 1–32.
[30] A. Fernández-Izquierdo, M. Poveda-Villalón, R. García-Castro, Coral: A corpus of ontological
requirements annotated with lexico-syntactic patterns, in: Proceedings of the 16th International
Conference on the Semantic Web, ESWC 2019, 2019, pp. 443–458.
[31] D. Wiśniewski, J. Potoniec, A. Ławrynowicz, C. M. Keet, Analysis of ontology competency
questions and their formalizations in sparql-owl, Journal of Web Semantics 59 (2019) 100534.
[32] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks,
in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
Association for Computational Linguistics, 2019, pp. 3982–3992.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <article-title>Ontology development 101: A guide to creating your first ontology</article-title>
          ,
          <source>Technical Report, Stanford knowledge systems laboratory technical report KSL-01-05</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Poveda-Villalón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fernández-Izquierdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fernández-López</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <article-title>García-Castro, LOT: An industrial oriented ontology engineering framework</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          <volume>111</volume>
          (
          <year>2022</year>
          )
          <article-title>104755</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.engappai.
          <year>2022</year>
          .
          <volume>104755</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Presutti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Daga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          , E. Blomqvist,
          <article-title>Extreme design with content ontology design patterns</article-title>
          ,
          <source>in: Proceedings of the 2009 International Conference on Ontology Patterns</source>
          , volume
          <volume>516</volume>
          <source>of WOP'09</source>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          .org,
          <year>2009</year>
          , p.
          <fpage>83</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Briggs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Miranker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. P.</given-names>
            <surname>Heideman</surname>
          </string-name>
          ,
          <article-title>A pay-as-you-go methodology to design and build enterprise knowledge graphs from relational databases</article-title>
          ,
          <source>in: Proceedings of the 18th International Semantic Web Conference, ISWC 2019</source>
          , Springer International Publishing,
          <year>2019</year>
          , pp.
          <fpage>526</fpage>
          -
          <lpage>545</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>030</fpage>
          - 30796- 7_
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Suárez-Figueroa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gómez-Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fernández-López</surname>
          </string-name>
          ,
          <article-title>The neon methodology framework: A scenario-based methodology for ontology development</article-title>
          ,
          <source>Applied ontology 10</source>
          (
          <year>2015</year>
          )
          <fpage>107</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bezerra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Freitas</surname>
          </string-name>
          ,
          <article-title>Verifying description logic ontologies based on competency questions and unit testing</article-title>
          ,
          <source>in: Proceedings of the IX Seminar on Ontology Research and I Doctoral and Masters Consortium on Ontologies</source>
          , volume
          <year>1908</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Keet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ławrynowicz</surname>
          </string-name>
          ,
          <article-title>Test-driven development of ontologies</article-title>
          ,
          <source>in: Proceedings of the 13th International Conference on The Semantic Web, ESWC 2016</source>
          , Springer International Publishing,
          <year>2016</year>
          , pp.
          <fpage>642</fpage>
          -
          <lpage>657</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alharbi</surname>
          </string-name>
          ,
          <article-title>Assessing candidate ontologies for reuse</article-title>
          ,
          <source>in: Proceedings of the Doctoral Consortium at ISWC</source>
          <year>2021</year>
          (
          <article-title>ISWC-DC)</article-title>
          ,
          <year>2021</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          . URL: https://api.semanticscholar.org/CorpusID:244895203.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alharbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Grasso</surname>
          </string-name>
          ,
          <article-title>Requirement-based methodological steps to identify ontologies for reuse</article-title>
          ,
          <source>in: Intelligent Information Systems</source>
          , Springer Nature Switzerland,
          <year>2024</year>
          , pp.
          <fpage>64</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>