1. Introduction

R. Alharbi);

1613-0073

Methods for Competency Question Elicitation from Ontology Requirements

Reham Alharbi

R.Alharbi@liverpool.ac.uk 0

Jacopo de Berardinis

Jacopo.De-Berardinis@liverpool.ac.uk 0

Terry R. Payne

T.R.Payne@liverpool.ac.uk 0

Valentina Tamma

V.Tamma@liverpool.ac.uk 0

Workshop

0 University of Liverpool , Liverpool , UK

2025

000 0 0001

Competency Questions (CQs) are used guide ontology development, yet formulating them in such a way as to align them to the stakeholder needs remains challenging. This paper presents a comparative analysis of three CQ elicitation methods: manual authoring by ontology engineers; template-based instantiation; and automated generation using diferent LLMs (GPT-4.1, Gemini 2.5). Each CQ is evaluated across dimensions of suitability, readability, and complexity. To facilitate this evaluation we introduce AskCQ, a dataset of 204 CQs derived from a shared user story in the cultural heritage domain. Our results show that manually authored CQs are consistently more acceptable, readable, and concise. LLM-generated CQs are more complex and diverse but require refinement. These findings highlight the importance of human expertise and suggest potential hybrid approaches.

1. Introduction Methodology

To investigate the impact of diferent Competency Question (CQ) elicitation strategies on the characteristics of the resulting questions, we conducted a comparative analysis across three representative

CEUR

ceur-ws.org approaches: fully manual authoring; template-based instantiation; and automatic generation via Large Language Models (LLMs). Each approach was applied independently to the same ontology requirement source to ensure a fair and controlled comparison.

1. Manual (Human-Authored): Two ontology engineers (HA-1 and HA-2), each with over five years of professional experience in ontology design and requirement engineering, independently read and interpreted the same user story. Based solely on their expert understanding of the personas, goals, and informational needs described therein, each formulated a set of CQs without constraints on format or style. This condition serves as the expert-driven baseline and reflects common manual practice in ontology development. 2. Template-Based (Pattern Instantiation): An ontology engineer with similar domain experience instantiated a curated set of 19 CQ patterns derived from Ren et al. [ 4 ]. These patterns use archetypal structures such as “Which [CE1] [OPE] [CE2]?” and “Is the [CE1] [CE2]?” and were manually populated with entities and relations extracted from the user story. The instantiation process required the identification of suitable fillers from the story content, and their mapping to the syntactic slots defined by the patterns. This semi-automated method ofers structured linguistic support but limited flexibility. 3. LLM-Based (Generative AI): Two state-of-the-art LLMs — GPT-4.1 and Gemini 2.5 Pro — were prompted to generate CQs directly from a markdown-formatted version of the user story. Prompts were intentionally minimal and neutral: no explicit instructions were given regarding CQ format, number, or examples, to avoid priming or biasing the output. This open-ended configuration was intended to test each model’s intrinsic ability to extract ontology-relevant requirements and phrase them as competency questions.

2.1. AskCQ Dataset Construction

All three approaches were applied to the same textual requirement source: a detailed user story developed for a cultural heritage ontology use case. The story, adapted from the methodology proposed by de Berardinis et. al. [ 7 ] is centred on two personas, a music archivist and a curator, and describes their activities and data needs relating to a museum’s music memorabilia collection, including acquisition, loan, metadata management, and display. The output comprises five CQ sets: HA-1 (44 CQs), HA-2 (54 CQs), Pattern (38 CQs), GPT-4.1 (26 CQs), and Gemini 2.5 Pro (42 CQs), totalling 204 distinct questions.1

2.2. Evaluation Dimensions and Feature Extraction

To assess the quality and characteristics of the generated CQs, we adopted a multi-dimensional, mixedmethods evaluation framework encompassing both qualitative expert judgment and quantitative feature analysis: CQ suitability, structural and semantic properties, and inter-method agreement. 1. Suitability (Expert Evaluation): Each CQ was independently reviewed by three ontology experts, who rated its acceptability for guiding ontology engineering in the context of the user story. Scores ranged from -3 (unanimous rejection) to +3 (unanimous acceptance). The experts were not provided with explicit criteria to preserve their interpretive autonomy, analogous to the elicitation setup. A Fleiss’ Kappa of = 0.35 indicated fair inter-expert agreement. 2. Readability: Each CQ was assessed to gauge its ease of understanding. We assess readability in a similar way to Ciroku, et al. [ 5 ], where a suite of established readability indices designed to capture diferent aspects of textual dificulty were initially computed for each CQ using the textstat Python 1The resulting AskCQ dataset is publicly released under a CC-BY license, and all CQs were anonymized and randomly shufled prior to evaluation to minimize bias regarding their origin. library.2 In this paper we report only the Flesch-Kincaid Grade Level (FKGL) and the Dale-Chall Readability Score (DCR) as representative readability features.

• Flesch-Kincaid Grade Level (FKGL) — Estimates the education level (U.S. grade) required for comprehension [ 8 ]. • Dale-Chall Readability Index (DCR) — Penalizes complex vocabulary based on a restricted list of familiar words [ 9 ]. 3. Relevance: The alignment of each CQ, together with the user story, was assessed by Gemini 2.5 Pro and rated on a 4-point scale Likert scale using the following criteria: (4) directly stated in the story, (3) inferable and necessary, (2) tangentially relevant, (1) of-topic. The evaluation prompt was carefully designed and spot-validated on a selected sample of CQs. 4. Complexity: The following four complementary metrics were defined to quantify diferent facets of CQ complexity: • c0 (Length): The total number of characters, as a coarse indicator of verbosity and potential elaboration. • c1 (Requirement Complexity): The number of distinct concepts, properties, relations, and iflters identified in the CQ by Gemini 2.5 Pro. • c2 (Linguistic Complexity): A count of syntactic and lexical features (noun phrases, verbs, prepositions, modifiers, etc.) extracted via spaCy. • c3 (Syntactic Complexity): The structural depth and richness of the dependency parse tree, including depth, node count, and key dependency types. These metrics were selected from the linguistic complexity heuristics from the Universal Dependency set [ 10 ].

Overall, these four dimensions are expected to provide complementary perspectives on CQ complexity. A CQ might be semantically complex (e.g., requiring navigation of intricate partonomy or causality relations) yet linguistically simple (e.g., “What caused this event?”), scoring high on requirement metrics but low on linguistic/syntactic ones. Conversely, a CQ might lack ontological complexity but is phrased using complex sentence structures, thereby scoring high on syntactic metrics but low on semantic ones. 5. Semantic Overlap: To analyse the semantic characteristics of CQ sets generated by diferent approaches, we conducted a study on their embeddings. This utilised Sentence-BERT embeddings from the all-MiniLM-L6-v2 model [ 11 ], which generates vectors e ∈ ℝ384 capturing the semantic meaning of each CQ (this method follows that adopted by several other studies [ 3, 5 ]). Furthermore, to identify semantically equivalent CQs, a pre-defined similarity threshold ( = 0.75 ) was determined empirically. This study quantifies the semantic overlap between pairs of CQ sets (e.g., ↔ ). For each pair, we denote = | | and = | | as the number of CQs in each set, respectively, and measure: • Centroid cosine similarity. The cosine similarity between the centroids of Set A and Set B provides a measure of the overall alignment of their central semantic representation. A score that is closer to 1 will indicate that the two sets are, on average, focused on similar concepts. • Coverage analysis. We measured how well one set covers the semantic content of another.

This was performed in both directions, i.e. for the coverage of Set A by Set B (Set A ← Set B) we determine: – Mean Maximum Similarity (MMS). For each CQ embedding e, in Set A, its maximum cosine similarity to any CQ embedding in Set B, → = max cos(e, , e, ), was identified. The mean of these → scores (and the standard deviation) indicates how well each CQ in Set A is semantically represented by its closest counterpart in Set B. A higher mean suggests stronger semantic parallels ofered by Set B. 2Scores were computed using the textstat Python library and interpreted comparatively, given the short, interrogative nature of CQs.

The same metrics were computed for the coverage of Set B by Set A. • Bidirectional coverage: This symmetric metric quantifies the overall mutual semantic overlap.

cov + →co v , where → It was calculated as → cov is the number of CQs in Set A covered by Set B (i.e., → ≥ ), and →co v is t+h e number of CQs in Set B covered by Set A. Hence, a higher percentage indicates greater shared conceptual space between the two sets.

Together, these dimensions provide a comprehensive, multidimensional view of the suitability, expressiveness, and diversity of CQs produced by diferent elicitation methods, grounded in both expert assessment and computational analysis.

3. Overview of Evaluation Outcomes

The results for the expert evaluation clearly favoured manually authored CQs. The manual method (HA-1 and HA-2) achieved a mean suitability score of 2.65, with 94.5% of questions accepted by a majority of annotators. This indicates that domain experts are highly efective at producing suitable CQs. The LLM-based methods (GPT-4.1 and Gemini) achieved an average score of 1.24 with 76.0% acceptance, suggesting moderate reliability. Pattern-based CQs scored lowest, with a mean suitability of 0.11 and only 50% acceptance. For readability, human-authored CQs had the lowest FKGL and DCR scores, indicating clearer phrasing. GPT-4.1 generated the most complex and least readable CQs (FKGL 11.64). LLM-generated CQs were also significantly longer (c0), richer in ontological references (c1), and more syntactically complex (c3) than manual or pattern-based ones. Figure 1 consolidates our findings, showing distinct feature profiles from the CQs generated by each elicitation approach. 3

The pairwise comparison results (Table 1) show that the cosine similarities between the centroids of most pairs are relatively high (typically ranging from 0.61 to 0.85). This suggests that, at a high level, all sets tend to address the same core thematic area defined by the user story. The lowest centroid similarities were observed in comparisons involving Gemini (e.g., 0.61 with HA-2), indicating its central theme might be slightly more distinct than the other sets.

CQ Feature Profile

Linguistic (c2) Syntactic (c3)

Requirement (c1) 0.25 0.50

Despite these relatively high centroid similarities, the specific semantic coverage between sets is low, denoting high degrees of novelty, i.e. a high number of CQs not previously generated. The percentage of CQs in one set covered by another (i.e., having a CQ in the other set with similarity ≥ 0.75) is consistently below 21%, and often below 10%. Between the two human annotators ( − 1 ↔ − 2 ), who shared a high centroid similarity (0.82), HA-2 covered 20.5% of HA-1’s CQs, and HA-1 covered 11.1% of HA-2’s CQs (HA-2 has 10 more CQs than HA-1), yielding a bidirectional coverage proportion of 15.3%.

4. Discussion and Conclusion

Our findings suggest that CQs manually crafted by ontology engineers tend to demonstrate the highest suitability for OE, due to achieving better readability, lower complexity, and uniquely capturing inferential requirements (implicit functional requirements) essential for robust ontology design. While LLMs can produce relevant and thematically coherent outputs, the resulting CQs exhibited higher complexity, lower readability, and their semantic coverage, though broad, exhibits limited overlap with human-generated CQs and amongst each other. These results suggest that while LLMs can provide reasonable CQs, these are not comparable to expert authored ones, and that human expertise still remains critical in Ontology Engineering. Crucially, our insights on CQ characteristics and limitations of current automated approaches can be leveraged to directly inform and improve their elicitation methods, aiming to better align their outputs with the desiderata of ontology engineers.

Declaration on Generative AI

Generative AI was only used in the experiments described in the paper, and no Gen AI tool was used to compose or edit the text. 3A full discussion of the results and analysis for this evaluation is available in [ 12 ].

[1]

G. K. Q.

Monfardini ,

J. S.

Salamon ,

M. P.

Barcellos , Use of competency questions in ontology engineering: A survey , in: Proc. of the Conceptual Modeling: 42nd International Conference , ER, Springer-Verlag, 2023 , p. 45 - 64 .

[2]

C. M.

Keet ,

Z. C.

Khan , Discerning and characterising types of competency questions for ontologies , 2024 . URL: https://arxiv.org/abs/2412.13688. arXiv: 2412 . 13688 .

[3]

Alharbi ,

Tamma ,

Grasso ,

T. R.

Payne , A review and comparison of competency question engineering approaches , in: Proc. 24th International Conference on Knowledge Engineering and Knowledge Management , EKAW , Springer Nature, 2024 , pp. 271 - 290 .

[4]

Ren ,

Parvizi ,

Mellish ,

J. Z.

Pan , K. van Deemter ,

Stevens , Towards competency questiondriven ontology authoring , in: Proc. of the 11th Extended Semantic Web Conference , ESWC, Springer International Publishing, 2014 , pp. 752 - 767 .

[5]

Ciroku , J. de Berardinis,

Kim ,

Meroño-Peñuela ,

Presutti , E. Simperl, Revont: Reverse engineering of competency questions from knowledge graphs via language models , Journal of Web Semantics 82 ( 2024 ) 100822 .

[6]

Zhang ,

V. A.

Carriero ,

Schreiberhuber ,

Tsaneva ,

L. S.

González ,

Kim , J. de Berardinis, Ontochat: A framework for conversational ontology engineering using language models , in: Proc. of the 21st Extended Semantic Web conference, ESWC , Springer Nature Switzerland, 2025 , pp. 102 - 121 .

[7] J. de Berardinis , V. A.

Carriero , N.

Jain , N.

Lazzari , A.

Meroño-Peñuela , A.

Poltronieri , V.

Presutti , The polifonia ontology network: Building a semantic backbone for musical heritage , in: Proc. of the 22nd International Semantic Web Conference , ISWC, Springer, 2023 , pp. 302 - 322 .

[8]

J. P.

Kincaid ,

R. P.

Fishburne ,

R. L.

Rogers ,

B. S.

Chissom , Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , Technical Report Research Branch Report 8-75 , Naval Air Station Memphis, Research Branch, Millington

, 1975 .

[9]

Dale ,

J. S.

Chall , A Formula for Predicting Readability: Instructions, Ohio State University Bureau of Educational Research, 1948 .

[10] M.-C. de Marnefe , T. Dozat, N.

Silveira , K.

Haverinen , F.

Ginter , J.

Nivre , C. D.

Manning , Universal Stanford dependencies: A cross-linguistic typology , in: Proc. of the Ninth International Conference on Language Resources and Evaluation (LREC'14) , European Language Resources Association (ELRA) , 2014 , pp. 4585 - 4592 .

[11]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , in: Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics , 2019 .

[12]

Alharbi ,

Tamma ,

T. R.

Payne , J. de Berardinis, A comparative study of competency question elicitation methods from ontology requirements , 2025 . URL: https://arxiv.org/abs/2507.02989. arXiv: 2507 . 02989 .