<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Automatic Evaluation of Questions Generated from Ontologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samah Alkhuzaey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Floriana Grasso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Terry R. Payne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valentina Tamma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Liverpool</institution>
          ,
          <addr-line>L69 3BX, Liverpool</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automatic question generation has emerged as an important field in educational technology. It enables the creation of large question banks for various learning environments. Nevertheless, the predominant reliance on human assessments to evaluate these generated questions hampers scalability and eficiency. To address this challenge, this paper presents an automatic framework that utilises ontological metrics to assess the complexity of questions generated from domain ontologies. The proposed approach is evaluated through an expert-based evaluation. The results reveal a consensus between the complexity scores generated by the framework and the opinions of educational experts, demonstrating the efectiveness of our proposed approach. However, the findings also highlight the need for adjustments to account for certain features that could enhance the accuracy of the proposed model's ratings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Question generation</kwd>
        <kwd>ontology</kwd>
        <kwd>evaluation</kwd>
        <kwd>complexity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        relationship between entities. This distances the generation and evaluation process from the
external structure of the question, allowing for more focus on its semantics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Despite the
potential of using Ontology-based AQG systems, the evaluation of such approaches has to date
primarily relied on human judgment, including the use of expert reviewers [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ], students’
performance [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], or crowd-sourced evaluations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. While human judgment provides valuable
insights into the quality and relevance of generated questions, it is subjective, time-consuming,
and resource-intensive. Additionally, the scalability of human-based evaluation is limited, which
hinders the comprehensive assessment of AQG systems across diverse domains and datasets.
By automating the evaluation process, researchers can overcome the limitations of
humanbased assessment and improve eficiency, making evaluation eforts more scalable. Additionally,
automated evaluation frameworks ofer the potential to provide objective, reproducible, and
quantifiable metrics for accurately measuring the performance of Ontology-based AQG systems.
      </p>
      <p>
        Developing automatic measures to evaluate questions generated from ontologies is practical
due to the structured and standardised nature of ontologies and the inherent consistency
in their hierarchical organisation and relationships. Ontologies are formal representations
of knowledge within a specific domain [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], comprising concepts, relationships, and rules
that define their interaction. The structured nature of ontologies, which typically includes a
hierarchical organisation of concepts and well-defined relationships, facilitates a systematic
approach to question generation. Furthermore, the uniformity in structure across various
ontologies facilitates the development of generalised methods for both question generation
and its evaluation. The uniformity in the structure becomes clear when examining that most
approaches that employ ontologies to generate questions tend to share common features, such
as leveraging basic hierarchical relationships or other semantic elements (e.g. object properties).
This consistency further supports the development of consistent evaluation measures.
      </p>
      <p>In this paper, we investigate the possibility of using an automatic evaluation framework as a
proxy for expert user evaluations when assessing the complexity level of question generated
from ontologies. The diferent aspects involved in the construction of ontology-based questions
are evaluated to understand their efect of question complexity. Our approach exploits the
hierarchical structure and standardised relationships inherent in ontologies to establish a consistent
evaluation method. In Section 2, we review similar research on evaluation methodologies in
the field of ontology-based automatic question generation, before presenting our proposed
framework in Section 3, where the theoretical foundations and proposed metrics are detailed in
practical terms. The evaluation methodology is presented in Section 4, followed by preliminary
results of our study in Section 5, and the conclusions in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Although Ontology-based AQG provides an automated approach for creating questions, it still
requires significant input from educational experts to evaluate the generated questions. As with
most natural language generation tasks, human evaluation is considered the benchmark against
which the outcome of the generation process is compared. To assess the quality of questions
generated from knowledge sources such as ontologies, various methods, metrics, and techniques
have been employed. When evaluating the questions, quality can be examined across diferent
dimensions, such as the question’s structure [
        <xref ref-type="bibr" rid="ref10 ref11 ref5">5, 10, 11</xref>
        ], cognitive level [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], dificulty [
        <xref ref-type="bibr" rid="ref10 ref4">4, 10</xref>
        ],
semantic ambiguity [
        <xref ref-type="bibr" rid="ref12 ref8">12, 8</xref>
        ], practical usefulness in an educational context [
        <xref ref-type="bibr" rid="ref12 ref4 ref5">4, 5, 12</xref>
        ] or overall
acceptability by an expert [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In general, quality assessment in ontology-based AQG can be
broadly categorised as those related to the language of the question, and those associated with
the question’s cognitive level. Human-based evaluation continues to be widely used for assessing
the efectiveness of Ontology-based AQG systems in both language and cognitive evaluations.
Expert reviewers, who possess domain knowledge or expertise in exam construction, ofer
valuable insights into the appropriateness of the generated questions. Students can also be
recruited to provide practical evaluations, allowing for feedback from end-users and reflecting
the usability and comprehensibility of the generated questions in real educational contexts.
      </p>
      <p>
        Experts are typically hired to evaluate questions based on linguistic aspects, such as
grammatical correctness, syntactic consistency, and fluency [
        <xref ref-type="bibr" rid="ref10 ref11 ref14 ref5">5, 10, 11, 14</xref>
        ]. Grammatical correctness
involves assessing whether the questions adhere to grammar rules, ensuring that they are
error-free and clear for learners. Another linguistic metric commonly evaluated is the syntactic
consistency of the questions. This involves ensuring that the questions have consistent syntactic
features, such as the Part of Speech (POS) used. This measure ensures that the questions have a
uniform syntactic structure, as syntactic inconsistencies may confuse learners. In this type of
evaluation, experts are typically presented with a set of generated questions and asked to rate
their quality based on specific criteria using a categorical scale.
      </p>
      <p>
        Human-centred evaluations have been conducted to assess cognitive-level metrics, such
as question dificulty , discrimination, and complexity [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5, 6, 7, 8</xref>
        ]. Dificulty and discrimination
are usually measured using standard statistical analysis of responses, employing pedagogical
theories such as Item Response Theory (IRT) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This involves administering a subset of
the generated questions to students in real or mock exams, where the actual dificulty and
discrimination are calculated and compared with predicted values. Dificulty may be estimated
by domain experts that draw on their knowledge and experience in the field [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]; whereas
complexity measures the inherent complexity of a question based on its structure, the cognitive
processes required to answer it, and the depth of understanding it demands [
        <xref ref-type="bibr" rid="ref16 ref8">8, 16</xref>
        ]. Unlike
statistical dificulty and discrimination, which are heavily influenced by learners’ backgrounds
and knowledge levels, complexity provides an intrinsic measure of a question’s potential to
engage and challenge learners.
      </p>
      <p>
        While human evaluation is commonly seen as the most reliable method for assessing the
quality of generated questions, it is not always practical, especially for systems that generate a
large volume of questions, due to the substantial amount of time and efort needed to manually
evaluate each question. Thus it is necessary to employ automated or semi-automated evaluation
methods to ensure eficient and timely assessment. The fact that the average number of expert
evaluators involved in these studies is typically three can further exacerbate these scalability
issues [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Automatic evaluation techniques have thus gained attention as a means of addressing
these limitations of human-centric assessment. Such methods use computational algorithms to
analyse the generated questions, taking into account factors such as language and cognitive
level. Notable examples include metrics that quantify the similarity of generated questions to
those created by humans [
        <xref ref-type="bibr" rid="ref12">12, 17</xref>
        ]. Alsubait et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], for example, developed specific similarity
measures for ontologies to assess the similarity of distractors in Multiple Choice Questions
(MCQs). Their assumption is that having similar choices increases the cognitive level required
      </p>
      <p>Example Question
Define &lt;X&gt;
Is &lt;x&gt; an &lt;X&gt;?
Is &lt;x&gt; &lt;P&gt; &lt;y&gt;?
Which of these is &lt;X&gt;:
What &lt;P1&gt; &lt;y&gt; and
&lt;P2&gt; &lt;z&gt;?
How many &lt;Y&gt; does
&lt;x&gt; have?
Which &lt;X&gt; has the
2nd most &lt;Value&gt;?</p>
      <p>Define a Coastal region
Is Nice a coastal city?
Is Paris the capital of France?
Which of these is a Rural area:
A: Lyon B: Auvergne
Which city has a population of
over one million and borders the
Mediterranean Sea?
How many departments are part
of Overseas France?
Which city has the 2nd highest
population?
Which &lt;Z&gt; share the
value of &lt;DP&gt;?</p>
      <p>
        Which cities does the Seine River
pass through?
for learners to find the answer. Other studies [
        <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
        ] explored various ontological measures to
extract semantic features, such as using entity popularity as a determining factor of dificulty,
as questions containing popular entities were believed to be easier to answer. To measure this,
the authors counted the number of object properties linked to the entity from other individuals
within the ontology. They also proposed measuring question specificity by examining the
depth of a certain concept in the concept and role hierarchy of the domain ontology, as deeper
concepts in the hierarchy result in more dificult questions. Other work [
        <xref ref-type="bibr" rid="ref10">10, 18</xref>
        ] proposed similar
features, but used a knowledge base to extract these features.
      </p>
      <p>
        The literature suggests that research on automated evaluation methods is constantly evolving;
however, certain approaches are limited by factors such as question types [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or the specific
attributes of the input ontology [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Challenges still exist in developing comprehensive evaluation
frameworks that can accommodate various question types and ontology structures.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. The Proposed Framework</title>
      <sec id="sec-3-1">
        <title>3.1. Characteristics of Questions Generated from Ontologies</title>
        <p>
          The generation of questions from ontologies typically requires a set of templates (textual or
graph-based) that are instantiated with ontological elements based on rules. The basic building
block behind the instantiation is the existence of Resource Description Framework (RDF)
patterns that facilitate the generation of certain question types. Diferent templates require
the ontology to contain specific RDF patterns as pre-requisites to generate a required question.
Several of the most common question formats are illustrated in Table 1, together with the
corresponding RDF pattern requirements used in other studies [
          <xref ref-type="bibr" rid="ref13">13, 19, 20</xref>
          ], which may vary in
terms of the number of triple patterns they require and the types of properties they include.
        </p>
        <p>
          RDF patterns may involve the utilisation of concepts [
          <xref ref-type="bibr" rid="ref5">5, 20</xref>
          ], various types of properties [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
individuals [20], quantifiers [ 19] or constraints1 [
          <xref ref-type="bibr" rid="ref16">16, 21</xref>
          ]. Additionally, they may include a single
triple pattern or multiple triple patterns. For example, the first template in Table 1 is used to
generate definition questions where the learner is asked to define a certain domain-specific
term. As a pre-requisite, the ontology must contain its corresponding RDF pattern which
necessitates the existence of an entity of type Class that is annotated with a comment using
&lt;rdfs:comment&gt;. Another template requires the incorporation of entities that are connected
through object- or datatype properties to generate questions that ask about the relationship
between two entities. Such questions have a one-to-one mapping between the number of
relevant triples and generated questions which indicate that these templates rely on explicitly
stated facts. For example, if the ontology includes the triple &lt;Paris, capitalOf, France&gt;,
the system may generate a question based on the Property Assertions template (Table 1) “Is Paris
the capital of France?”. This approach depends on utilising the information available in the
ontology directly, without any additional inference.
        </p>
        <p>
          Other templates may include functional properties that impose constraints on the question
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] to derive new knowledge and generate questions that inquire about facts that are not
explicitly modelled by the ontology. Functional properties allow the system to infer new facts
that are not explicitly mentioned in the ontology. The constraint-based question in template
07 requires the existence of multiple entities that are connected through a specific datatype
property and applies an ordinal constraint (represented by ordinal or relational operators) to
generate the questions. For example, if the input ontology includes information about the cities
in France and their respective populations, the model can generate a question such as “Which
city in France has the second highest population?” (example question for template 07 in Table 1).
This enhances the capability of question generation by producing more complex questions.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Theoretical Foundation</title>
        <p>There are diferent perspectives in the pedagogical literature on what constitutes a complex
question. Complexity is often seen as a measure of the cognitive demand placed on the learner
by the question. Ahmed and Pollitt [22] explored the cognitive demands of questions, and
suggested that question complexity is a crucial aspect that increases cognitive demand. They
defined complexity as the number of operations or ideas that need to be considered in order
to arrive at a correct answer. Low-complexity questions involve straightforward ideas (i.e.
recognition of knowledge) and operations that do not require linking them together, whereas
higher-complexity questions require the learner to identify and combine various operations
and concepts and connect them, hence promoting recall and evaluation of knowledge. Studies
examining the gradual progression towards competence found that novice and expert learners
possess diferent characteristics regarding their level of knowledge and ability to reason about
that knowledge [23, 24]. These studies suggest that assessments need to consider this distinction
1A constraint is defined as a triple in which the subject and object are connected through a functional property.
in order to distinguish between learners of diferent mastery levels by constructing questions
that require varying amounts of knowledge and diferent reasoning abilities. This relates to the
previous definition of complexity, which defines complex questions as those that encompass a
greater number of facts and necessitate higher cognitive skills, such as deduction and reasoning.
To illustrate this, in order to correctly answer the example question for template 07 in Table 1,
a learner needs to understand several semantic relations (i.e. inference steps): 1) the answer
must be a city in France; 2) the answer should have a numerical value indicating its population;
and 3) the learner must select the answer that ranks second in terms of population among all
cities. This contrasts with the example question for template 03 that requires the learner to
recall a single piece of information (i.e. “The capital city of France is Paris” ). For a more detailed
discussion of this theoretical backing, see [25].</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Question Complexity Evaluation Framework</title>
        <p>The main objective of our framework is to assess the complexity of questions by utilising
ontological metrics that are shared by questions generated from ontologies. This allows us to
distinguish between students with varying levels of mastery. Complexity, in this context, refers
to the level of knowledge required, and the cognitive demands placed on students during the
question-answering process. Instead of heavily relying on educational experts to determine
the complexity level of the generated questions, we suggest utilising the ontological features
that make up the questions to automatically calculate the internal complexity of the generated
questions. By internal, we mean the intrinsic factors of the question that increase or decrease its
complexity level. Thus, we eliminate external sources such as the proficiency level of learners
or previous background knowledge. We evaluate the generated questions based on two metrics
(based on those described in [25]): 1) the volume of knowledge (i.e. number of facts) they test; and
2) the level of reasoning they require. The first metric quantifies the number of relevant triples
used to generate the question; note that this metric does not count the number of entities in the
question itself (as assessment questions are short linguistic constructs which typically contain a
limited number of facts) but rather it establishes a relationship between the number of relevant
triples retrieved from executing the query against the ontology (during instantiation) and the
number of generated questions. The second metric evaluates the specificity and restrictiveness
of the question by quantifying the number of applied constraints. This metric indicates the
desired level of precision in the answer and highlights the question’s role in filtering and refining
the search space. We use this metric to determine the level of reasoning involved in generating
the answer. The proposed metrics are calculated according to the following definition:</p>
        <p>Definition: Let  be a question,  be its corresponding Graph Pattern,  be the
complexity of the Graph Pattern,    be a triple pattern where each element (subject, predicate,
and object) can be a variable and  be a constraint. The complexity of a question  is equal
to the complexity of its corresponding graph pattern  , and the complexity of the graph
pattern  is calculated as the total number of triple patterns and constraints it contains.</p>
        <p>Thus, template 03 has a complexity score of 1, as it only has one relevant triple and no
constraints, whereas template 07 has a higher complexity, due to the three constraints (‘ORDER
BY DESC(Value)’, ‘OFFSET 1’, and ‘LIMIT 1’) and the additional triple involved. These metrics
enhance evaluation methodologies in the field of ontology-based AQG, thus enabling a more
rdf:type
Grandparent</p>
        <p>Sarah
hasChild</p>
        <p>Peter
hasChild
hasSpouse</p>
        <p>rdf:type
Anne</p>
        <p>In-law
YearOfBirth
PlaceOfBirth
Occupation
comprehensive and advanced assessment of the generated questions’ complexity level.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Methodological Approach</title>
        <p>An expert-centred evaluation was conducted where educational experts with knowledge and
experience in the construction of assessment questions in an educational setting were recruited.
They were presented with a set of generated questions with varying characteristics and asked
to rate each question based on its complexity level according to their perception. The questions
can be rated on a scale ranging from 1 to 3, where a higher score denotes higher complexity.
The evaluation was conducted through an online questionnaire designed for this purpose. As
the data collection for this evaluation is currently ongoing, the results presented here are
preliminary, and provide an initial assessment of the proposed evaluation framework.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Preparation</title>
        <p>Our study utilised a domain ontology purposely built for this evaluation task. Instead of using an
ontology focused on a specific domain, we opted to model a universally recognised concept: the
family tree relationship. This decision was made to provide a foundation of abstract knowledge
that anyone can understand, regardless of their expertise or familiarity with specific subjects.
By relying on the inherent understanding that people from diferent cultures and backgrounds
share, we aim to create a neutral basis for assessing question complexity. This approach allows
us to evaluate questions consistently and ensures that their quality is determined solely by their
structure and cognitive attributes, without being influenced by domain-specific complexities. A
small ontology was therefore developed (comprising 130 axioms) representing information about
an imaginary family including the relationship between each individual and some characteristics
including individuals’ year of birth, place of birth, I.Q. and occupation. A fragment of the
input ontology is illustrated in Figure 1. The question generation process was guided by the
question specifications shown in Table 1, which allowed us to generate questions of varying
characteristics, based on those proposed in previous studies. Multiple variants were generated
from each template, and in order to keep the rating process manageable, we randomly selected
and presented 3 variants per template to our experts. Consequently, each expert evaluated a
total of 23 questions.</p>
        <p>0
11 15</p>
        <p>85
79
16
63
21
42
53
21
79
definictiloanssaspsreorptieorntyassertion
definitciolanssassperortpioenrtyassertion</p>
        <p>MCcQomplexMCQ aggregate
Question Templates
ordinal condition
(a) Expert-based assessment of complexity for dif- (b) A comparison of complexity scores generated
ferent generated questions. by our model with expert perceptions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Preliminary Results</title>
      <p>The preliminary analysis revealed notable trends regarding the perceived complexity of the
questions. Figure 2a shows the percentage distribution of the ratings on a 3-point scale of
complexity, whereas Figure 2b presents a comparison between the complexity scores generated
by our model and those given by the experts. To obtain the automatic scores, we calculated the
complexity of each variant that was generated from a given template, and from these, determined
the mean rating given for the template. Finally, the complexity levels were normalised to map
values to a range between 1 and 3 for better comparability with experts’ ratings, based on the
minimum and maximum values of complexity given the dataset of questions considered.</p>
      <p>Both figures suggest some correlation between the complexity scores provided by experts and
those generated by the model, indicating consistency in evaluating question complexity through
both approaches. The majority of the respondents’ ratings closely matched the assessments
made by our framework, with only a few minor discrepancies. More specifically, questions
that our framework categorised as simple (templates 01 to 03, excluding template 04) were
also perceived as simple by most reviewers. Regarding Definition , Property Assertion, and Class
Assertion questions, our model assigned a complexity rating of 1, implying a level of simplicity.
The experts’ ratings strongly aligned with our system’s ratings, with approximately 98% of
these questions also receiving a complexity rating of 1 from the experts. Questions based on
Property Assertions were perceived to be the simplest by the experts across those questions
analysed as having a very low complexity, with a mean very close to 1. This closely aligned
with the score given by our model, given that the answer to these questions is encoded within a
single triple pattern (i.e. &lt;x&gt; &lt;p&gt; &lt;y&gt;) without requiring any additional cognitive tasks.</p>
      <p>The second template, which our model considered as producing simple questions, generated
questions based on class assertion axioms using the template &lt;x&gt; &lt;rdf:type&gt; &lt;X&gt;, with 85%
of experts considering the questions generated from this template to be simple, with a mean
rating close to 1. Additionally, 15% of experts gave these questions a maximum score of 2.</p>
      <p>The third template, which generates definition questions, difered slightly from the previous
templates in terms of simplicity according to our model. While still perceived as low complexity,
this template exhibits more variability in responses. The mean value is higher than 1, indicating
that some experts found it moderately complex. The distribution shows that approximately 60%
of experts rated it as low complexity, while medium and high ratings accounted for the remaining
40%. This observation can be attributed to several factors. Firstly, the answer space for definition
questions is typically broader, allowing for more potential variations in responses. This requires
greater precision from the expert and may involve other skills, such as linguistic proficiency.
When examining the specific textual templates used for this template, we utilised two diferent
variations. These variations are “Define &lt;X&gt;” and “What is the term used to describe
this &lt;string&gt;?” The question generated from the first textual template was perceived as
more complex than the second, possibly due to the cognitive tasks involved in defining a term,
such as categorisation and abstraction, which are mentally demanding compared to simply
recognising or identifying a term. This supports the notion that the expanded answer space
could influence the perception of complexity. Thus, these factors contribute to the understanding
that definition questions are not as straightforward as other types of questions.</p>
      <p>While our system categorised MCQs as simple, they were perceived as moderately complex,
with a mean below 2. The distribution of complexity ratings is balanced between low and
medium complexity, with very few high ratings. Upon closer examination,</p>
      <p>we discovered that our model only considered the triple patterns present in the stem “Which
of these is &lt;X&gt;:” when calculating the complexity score for MCQs, and excluded those
that appeared in the options. However, MCQs were generated and presented to the experts
with the options included in the stem (e.g., “Which of these is X: 1)y 2)x 3)z”). This
omission caused some triples to be excluded from the complexity calculation, resulting in a
lower score for the number of relevant triples metric and an overall decrease in the complexity
score. This finding has prompted us to re-evaluate our model for the MCQ template in order to
address the issue of missing triples.</p>
      <p>The template used to create Complex MCQs aligns with the score given by our model. Our
model rated this template with a complexity score closer to 1.5, indicating moderate complexity.
This is due to the larger number of triples present in the questions generated from this template.
Experts also gave the question a score of 1.5, indicating moderate overall agreement. The
distribution reveals that while the majority of ratings are for low complexity, with some for
medium complexity, no high ratings were given to the question. This suggests that the question
leans towards simplicity.</p>
      <p>Based on the expert evaluations, templates 06 (Aggregate), 07 (Ordinal), and 08 (Conditional)
were considered the most complex. Of these, our model’s ratings align with this assessment
for templates 06 and 07. However, when it comes to questions generated based on Aggregate
functional properties, our model’s ratings difered the most from those given by experts.</p>
      <p>Only 20% of experts considered aggregate-based questions to be of low complexity. The
majority of experts, 63%, gave these questions a score of 2, whereas 16% gave them a score
of 3. Overall, these questions were perceived to be on the complex side, with a mean closer
to 2. However, our model assigned these questions a score closer to 1, despite the fact that
the question template included multiple triple patterns and utilised constraints such as COUNT
and HAVING, which introduce an additional reasoning step in the question-answering process.
This discrepancy becomes clearer when we compare aggregate questions to complex MCQs.
Our model assigned a higher complexity score to the latter, despite the fact that they do not
require any reasoning. This highlights the importance of revising the definition of complexity
and possibly giving more weight to the second metric: the number of constraints included.
This suggests that the use of weights for diferent constraints may result is a more nuanced
complexity estimate, which could be explored in future work. By making this adjustment, our
model would be able to diferentiate between questions with multiple triple patterns but no
reasoning, and questions that utilise both features.</p>
      <p>The templates that received the fewest low ratings were the Ordinal and Condition templates,
when compared to other questions. Ordinal questions were found to be the most complex,
with a mean score of 2.2. However, it is worth noting that our framework categorises these
questions as the most complex, while the automatic score rates them even higher at 3, which is
considerably more complex than that judged by the experts. Although the complexity ratings
were spread fairly evenly between medium and high, there were very few low ratings for this
template. Similarly, none of the experts rated questions generated from the Condition templates
as having low complexity, as there were no low ratings in the distribution. The majority of
ratings for this template were medium complexity, with some high ratings, which aligns with
the score given by our model suggesting an overall score closer to 2. This suggests that experts
widely agreed that Ordinal and Conditional questions should not be regarded as simple questions
(with respect to complexity).</p>
      <p>Overall, these initial findings support the hypothesis that an automatic evaluation framework
can be efectively used as a proxy for expert user evaluations when assessing the complexity
level of questions generated from ontologies. However, they also identify areas where the model
could be refined, especially in accurately evaluating the complexity of MCQs and considering
the added complexity introduced by constraints. We will continue to analyse and refine our
evaluation approach in future work to better align our system’s ratings with expert assessments.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper introduced an automatic framework that leveraged ontological metrics to assess
the complexity of questions generated from domain ontologies. The efectiveness of the
proposed framework was evaluated through an expert-based evaluation which revealed consensus
between the complexity scores generated by the framework and the opinions of educational
experts. Nonetheless, minor discrepancies arose, particularly in scenarios involving questions
with more triples and those requiring higher levels of reasoning. These findings highlighted the
need for adjustments to address these discrepancies. This study contributes to ongoing research
on automatic evaluation methods for generated questions, which complements traditional
approaches and improves scalability.
ontologies with query graphs, 2024. Proceedings of the 28th International Conference on
Knowledge-Based and Intelligent Information and Engineering Systems (KES).
[17] C. Jouault, K. Seta, Y. Hayashi, Content-dependent question generation using lod for
history learning in open learning space, New Generation Computing 34 (2016) 367–394.
[18] D. Seyler, M. Yahya, K. Berberich, Knowledge questions from knowledge graphs, in:
Proceedings of the ACM SIGIR international conference on theory of information retrieval,
2017, pp. 11–18.
[19] T. Raboanary, S. Wang, C. M. Keet, Generating answerable questions from ontologies
for educational exercises, in: Research Conference on Metadata and Semantics Research,
Springer, 2021, pp. 28–40.
[20] B. Diatta, A. Basse, S. Ouya, Bilingual ontology-based automatic question generation, in:
2019 IEEE Global Engineering Education Conference (EDUCON), IEEE, 2019, pp. 679–684.
[21] J. Bao, N. Duan, Z. Yan, M. Zhou, T. Zhao, Constraint-based question answering with
knowledge graph, in: Proceedings of COLING 2016, the 26th international conference on
computational linguistics: technical papers, 2016, pp. 2503–2514.
[22] A. Ahmed, A. Pollitt, Curriculum demands and question dificulty, in: IAEA conference,</p>
      <p>Bled, Slovenia, 1999.
[23] M. T. Chi, P. J. Feltovich, R. Glaser, Categorization and representation of physics problems
by experts and novices, Cognitive science 5 (1981) 121–152.
[24] M. T. Chi, R. D. Koeske, Network representation of a child’s dinosaur knowledge.,
Developmental psychology 19 (1983) 29.
[25] S. Alkhuzaey, F. Grasso, T. R. Payne, V. Tamma, A framework for assessing the complexity
of auto generated questions from ontologies, in: Proceedings of the European Conference
on e-Learning, volume 22, 2023, pp. 17–24.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Studer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Benjamins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fensel</surname>
          </string-name>
          ,
          <article-title>Knowledge engineering: Principles and methods</article-title>
          ,
          <source>Data &amp; knowledge engineering 25</source>
          (
          <year>1998</year>
          )
          <fpage>161</fpage>
          -
          <lpage>197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>AlKhuzaey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Grasso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tamma</surname>
          </string-name>
          ,
          <article-title>Text-based question dificulty prediction: A systematic review of automatic approaches</article-title>
          ,
          <source>International Journal of Artificial Intelligence in Education</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Benedetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caines</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buttery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cappelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giussani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Turrin</surname>
          </string-name>
          ,
          <article-title>A survey on recent approaches to question dificulty estimation from text</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E. V.</given-names>
            <surname>Vinu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>A novel approach to generate MCQs from domain ontology: Considering DL semantics and open-world assumption</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>34</volume>
          (
          <year>2015</year>
          )
          <fpage>40</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Alsubait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , U. Sattler,
          <article-title>Ontology-based multiple choice question generation</article-title>
          ,
          <source>KI-Künstliche Intelligenz</source>
          <volume>30</volume>
          (
          <year>2016</year>
          )
          <fpage>183</fpage>
          -
          <lpage>188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Leo</surname>
          </string-name>
          , G. Kurdi,
          <string-name>
            <given-names>N.</given-names>
            <surname>Matentzoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Sattler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Forge</surname>
          </string-name>
          , G. Donato, W. Dowling,
          <article-title>Ontology-based generation of medical, multi-term mcqs</article-title>
          ,
          <source>International Journal of Artificial Intelligence in Education</source>
          <volume>29</volume>
          (
          <year>2019</year>
          )
          <fpage>145</fpage>
          -
          <lpage>188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V. E.</given-names>
            <surname>Venugopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Dificulty-level modeling of ontology-based factual questions</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>1023</fpage>
          -
          <lpage>1036</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , K. VanLehn,
          <article-title>How do machine-generated questions compare to human-generated questions?, Research and practice in technology enhanced learning 11 (</article-title>
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Pang</surname>
          </string-name>
          , E. Apeh,
          <article-title>Automatically predicting quiz dificulty level using similarity measures</article-title>
          ,
          <source>in: Proceedings of the 8th international conference on knowledge capture</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Faizan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lohmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Modi</surname>
          </string-name>
          ,
          <article-title>Multiple choice question generation for slides</article-title>
          , in: Computer Science Conference for University of Bonn Students,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Faizan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lohmann</surname>
          </string-name>
          ,
          <article-title>Automatic generation of multiple choice questions from slide content using linked data</article-title>
          ,
          <source>in: Proceedings of the 8th international conference on web intelligence, mining and semantics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jouault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hayashi</surname>
          </string-name>
          ,
          <article-title>Quality of lod based semantically generated questions</article-title>
          ,
          <source>in: Artificial Intelligence in Education: 17th International Conference, AIED 2015</source>
          , Madrid, Spain, June 22-26,
          <year>2015</year>
          . Proceedings 17, Springer,
          <year>2015</year>
          , pp.
          <fpage>662</fpage>
          -
          <lpage>665</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Stasaski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hearst</surname>
          </string-name>
          ,
          <article-title>Multiple choice question generation utilizing an ontology</article-title>
          ,
          <source>in: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>303</fpage>
          -
          <lpage>312</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kurdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Sattler</surname>
          </string-name>
          ,
          <article-title>An experimental evaluation of automatically generated multiple choice questions from ontologies</article-title>
          , in: OWL: Experiences and
          <string-name>
            <surname>Directions-Reasoner</surname>
            <given-names>Evaluation</given-names>
          </string-name>
          : 13th International Workshop, OWLED 2016,
          <article-title>and</article-title>
          5th International Workshop, ORE 2016, Bologna, Italy, November
          <volume>20</volume>
          ,
          <year>2016</year>
          ,
          <source>Revised Selected Papers 13</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F. B.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Kim</surname>
          </string-name>
          , et al.,
          <source>The basics of item response theory using R</source>
          , volume
          <volume>969</volume>
          , Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Alkhuzaey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Grasso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tamma</surname>
          </string-name>
          ,
          <article-title>Generating complex questions from</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>