<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generating Multiple Choice Questions From Ontologies: Lessons Learnt</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tahani Alsubait</string-name>
          <email>alsubait@cs.man.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bijan Parsia</string-name>
          <email>bparsia@cs.man.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Uli Sattler</string-name>
          <email>sattler@cs.man.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science, The University of Manchester</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Ontologies with potential educational value are available in di↵erent domains. However, it is still unclear how such ontologies can be exploited to generate useful instructional content, e.g., assessment questions. In this paper, we present an approach to automatically generate multiple-choice questions from OWL ontologies. We describe a psychologically-plausible theory to control the diculty of questions. Our contributions include designing a protocol to evaluate the characteristics of the generated questions, in particular, question diculty. We report on our experience on generating questions from ontologies and present promising results of evaluating our diculty-control theory.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>A question in which a set of plausible answers are o↵ered to a student to choose
from is called a Multiple Choice Question (MCQ). Providing a set of plausible
answers might or might not make the question easier for the student. However,
preparing reasonably good alternative answers definitely requires more time and
e↵ort from the question designer. This is why we primarily focus on developing
methods to automatically generate this particular type of questions.</p>
      <p>
        Developing automatic methods for Question Generation (QG) can alleviate
the burden of both paper-and-pencil and technology-aided assessments. Of
particular interest are large-scale tests such as nation-wide standardised tests and
tests delivered as part of Massive Open Online Courses (MOOCS). Typically,
these tests consists mainly of MCQs [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. Economically speaking, it is reported
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that a large amount of money is increasingly spent on large-scale testing
(e.g., US spendings doubled from $165 million in 1996 to $330 million in 2000).
Although there is no evidence that automatic QG can help to reduce these
spendings, it is expected that the time spent on test preparation could be utilised in
better ways. In addition, di↵erent modes of delivery (e.g., static, adaptive) can
benefit from automatic QG. In fact, one of the promising applications of QG
is the delivery of questions that match the estimated abilities of test takers in
order to reduce the time spent by each test taker on the test [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>Abstractly speaking, a QG system takes, as input, a knowledge source and
some specifications describing the questions to be generated. As output, it
produces a reasonable number of questions which assess someones understanding of
that knowledge and which, of course, adhere to the given specifications. These
specifications can include, for example, the format of the question and its
diculty.</p>
      <p>
        Generally, questions answered correctly by 30%-90% of students are preferred
to those answered by more than 90% or less than 30% of students [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. But, a
well-balanced test needs questions of all kinds of diculties, in suitable
proportions. To construct a balanced test, initially, the test developer needs to predict
how certain students will perform on the items of that test. However, this is a
rather dicult process [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and there is evidence that teachers do not usually make
good predictions [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In addition, automatic estimations of questions’ diculty
can help to advance research on adaptive assessment systems which usually rely
on training data to estimate the diculty [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. As a consequence, it is necessary
for automatic QG systems to be able to control the diculty of the generated
questions, although it is rather challenging. We have addressed this challenge by
developing a novel, similarity-based theory of controlling MCQ diculty [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In
this paper, we empirically examine whether varying the similarity between the
correct and wrong answers of each question can vary the diculty of questions.
      </p>
      <p>The main goal of this study is to answer the following questions:
1. Can we control the diculty of MCQs by varying the similarity between the
key and distractors?
2. Can we generate a reasonable number of educationally useful questions?
3. How costly is ontology-based question generation, including the cost of
developing/enhancing an ontology, computation cost and post-editing cost?
2</p>
    </sec>
    <sec id="sec-2">
      <title>Ontology-based question generation</title>
      <p>
        Two alternative sources are typically used for QG: unstructured text and
ontologies.The QG workshop (2009) [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] has identified raw text as the most preferred
knowledge source according to responses gathered from participants of the
workshop. However, a drawback of text-based QG approaches is that they mostly
generate shallow questions about explicit information as it is dicult to infer
implicit relations using current NLP techniques [
        <xref ref-type="bibr" rid="ref14 ref18 ref22 ref8">8, 14, 18, 22</xref>
        ]. Many
ontologybased QG approaches have been developed [
        <xref ref-type="bibr" rid="ref15 ref26 ref35 ref36 ref9">9, 15, 26, 35, 36</xref>
        ]. These approaches
take advantage of reasoning services o↵ered by ontology tools to generate
questions about implicit knowledge. However, it remains to put ontology-based QG
approaches into practise.
      </p>
      <p>Ontologies with potential educational value are available in di↵erent domains
such as Biology, Medicine, Geography, to name a few.1 However, ontologies are
not designed particularly for educational use. Thus, there is a challenge in
generating useful instructional content from them. Of course, in some cases, there is
a need to first build or improve an ontology for a specific subject matter before
utilising it for QG. Thus, there is a trade-o↵ between the e↵orts required to build
and maintain an ontology and the overall advantage of single or multiple uses.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Multiple-choice questions</title>
      <p>
        Assessment questions have been classified in a variety of ways. A common way
is to classify questions to objective (e.g., MCQs or True/False questions) or
subjective (e.g., essays or short answers). The essential features of objective tests
1 For a list of ontology repositories, the
http://owl.cs.manchester.ac.uk/tools/repositories/
reader
is
referred
to:
include their ability to assess a broad range of knowledge in an ecient way
(compared to essays, for example). In addition, they are auto-gradable and a
well-established means to test students’ understanding. However, research
suggests that objective questions are hard to prepare and require considerable time
per question [
        <xref ref-type="bibr" rid="ref11 ref27 ref31">11, 27, 31</xref>
        ].
      </p>
      <p>Definition 1 An MCQ is a tuple &lt; S, K, D &gt; consisting of the following parts:
1. Stem (S): A statement that introduces a problem to the student.
2. Key (K): The correct answer(s).
3. Distractors (D): A number of incorrect, yet plausible, answers.</p>
      <p>The structured format of MCQs can help in making them suitable for
computerised generation. The automatic generation of MCQs in particular and
assessment questions in general can help to resolve many issues in students’
assessment. For example, constructing a bank of questions of known properties
can help to eliminate cheating by facilitating the preparation of di↵erent tests
with similar properties (e.g., item diculty, content). Also, instead of using last
years’ exams as self-assessments, an instructor can generate exams that resemble
original past exams with reasonable e↵ort.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Running example</title>
      <p>Consider the following ontology.</p>
      <p>Hospital v HealthCareP rovider, GP Clinic v HealthCareP rovider,
U niversity v EducationP rovider, School v EducationP rovider,
Registrar v 9 worksIn.Hospital, GP v 9 worksIn.GP Clinic,
T eacher v 9 worksIn.School, Instructor v 9 worksIn.U niversity,
LuckyP atient v P atient u 9 marriedT o.(9 worksIn.HealthCareP rovider),
P atient(M ark), Instructor(N ancy),
Registrar(David), GP (Sara),
treatedBy(M ark, Sara), marriedT o(M ark, Sara),
treatedBy(N ancy, Sara), marriedT o(N ancy, David)
A reasonable number of questions can be generated from the above ontology.
The generation can involve two steps: (1) generating question candidates and (2)
transforming the candidate question into a grammatically well-formed question.
The second step is out of the scope of this research. However, for readability, we
present below some stems that can be generated from the above ontology after
making any necessary grammatical renderings.
1. Give an example of a health care provider.
2. What is a GP clinic?
3. Where does an instructor work?
4. Who is Mark married to?
5. Which one of the following definitions describe a lucky patient?
6. Instructor to University is as ........ to ........?
7. Nancy to David is as ........ to ........?</p>
      <p>The above questions range from simple recall questions (e.g., 1-5) to questions
that require some sort of reasoning (e.g., 6-7). For each of the above stems, it
remains to specify a key and some suitable distractors. The challenge is to pick
distractors that look like plausible answers to those students who do not know the
actual answer. For example, for question 4, the answers are expected to be names
of persons. Including other distractors, such as names of institutions, would help
in making the correct answer stand out even for a low mastery student. So we
need a mechanism to filter out obvious wrong answers. Moreover, the utilised
mechanism should be able to select a set of distractors that make the question
suitable for a certain level of diculty.</p>
      <p>
        After seeing an example to generate questions from what can be considered
a toy ontology, we want to know what is the case in real ontologies which are
can be big and rich. For example, the average number of axioms per ontology in
BioPortal is 20,532 with a standard deviation of 115,163 and maximum number
of 1,484,923 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. This suggests that a considerably large number of questions, but
not necessarily good quality questions, can be generated from a single ontology.
We investigate this by generating some questions from a selected ontologies from
BioPortal. The detailed results can be found in [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] suggesting that a massive
number of questions can be generated from each ontology. Some of the questions
that arise here are: Are these questions all good? Can we control their diculty?
And what does an ontology look like that is suitable for auto-generation of
MCQs?
      </p>
      <p>In short, ontology-based QG can be accomplished in di↵erent steps. First,
one needs to find an ontology and possibly enhance it either on the logical level or
maybe by adding annotations. Secondly, the questions can be computed. Finally,
there is a need to filter the generated questions before administering them to
real students.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Similarity-based MCQ generation</title>
      <p>To generate pedagogically sound questions, we need a pedagogically plausible
notion of similarity for selecting good distractors. Ideally, we want students’
performance to correlate with their knowledge mastery (i.e., amount and quality
of knowledge). This means that dicult questions are expected to be answered
correctly by high mastery students only while easy questions are expected to be
answered by both low and high mastery students.</p>
      <p>The basic intuition is that o↵ering a set of very similar answers makes it
dicult to identify the correct answer; hence, decreasing the probability of knowing
correct answers just because distractors are obviously wrong. Thus, to make the
question more dicult, increase the degree of similarity between the key and
distractors. And, to make it less dicult, decrease the degree of similarity.</p>
      <p>For instance, we would expect the diculty of question 4 above to increase
by providing a list of GP names as distractors since the correct answer “Sara”
is also a GP. So someone who knows that Mark is married to a GP would still
need to know the exact name of that GP. This means that a student who knows
more about the subject of the question, performs better.</p>
      <p>
        The question that remains to be answered is how can we measure the
similarity between the stem and distractors? The choice of similarity measures has
a direct influence on the overall QG process. However, designing similarity
measures for ontologies is still a big challenge. Looking at existing similarity
measures (e.g., [
        <xref ref-type="bibr" rid="ref10 ref21 ref28 ref29 ref34">10, 21, 28, 29, 34</xref>
        ]), we found that no o↵-the-shelf existing method
satisfies our requirements. For example, some measures [
        <xref ref-type="bibr" rid="ref28 ref29 ref34">28, 29, 34</xref>
        ] are imprecise
by definition as they only consider atomic subsumptions and ignore any complex
subsumptions which can a↵ect similarity computation. Other measures [
        <xref ref-type="bibr" rid="ref10 ref21">10, 21</xref>
        ]
impose some requirements on the ontology that can be used for similarity
computation (e.g., low expressivity, no cycles, availability of an ABox or external
corpus of annotated text). This has motivated us to develop a new family of
similarity measures which can be used with any arbitrary OWL ontology. The
basic rationale of the new measures is that similar concepts have more common
and fewer distinguishing features. Details of the new similarity measures can be
found in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In what follows, we primarily use the measure SubSim(·) which,
in addition to Class names, considers Subsuming class expressions that occur in
the input ontology.
6
6.1
      </p>
    </sec>
    <sec id="sec-6">
      <title>Empirical evaluation</title>
      <sec id="sec-6-1">
        <title>Materials and methods</title>
        <p>
          Equipment description the following machine was used for the experiments
in this paper: Intel Quad-core i7 2.4GHz processor, 4 GB 1333 MHz DDR3
RAM, running Mac OS X 10.7.5. In addition to the following software: OWL
API v3.4.4 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and FaCT++ [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ].
        </p>
        <p>Ontology development the Knowledge Representation and Reasoning course
(COMP34512) is a third year course unit o↵ered by The School of Computer
Science at The University of Manchester. It covers various Knowledge
Representation (KR) topics including Knowledge Acquisition (KA) techniques and
KR formalisms. For the purposes of the experiment described in this section,
a KA ontology (which models the KA part of the course) was developed from
scratch.2 This particular part of the course unit was chosen as it contains mainly
declarative knowledge. Other parts of the course can be described as procedural
knowledge which is not suitable to be modelled in an OWL ontology. Assessing
student’s understanding of declarative knowledge is an essential part of various
tests.</p>
        <p>A total of 9 hours were spent to build the first version of the ontology,
excluding the time required to skim through the course contents since the ontology was
not developed by the instructor who is in charge of the course unit. The P rote´ge´
4 ontology editor was used for building the ontology.</p>
        <p>Several refinements were applied to the ontology after discussing the ontology
with an experienced knowledge modeller, among the authors of this paper, and
getting useful feedback from her. The feedback session took around 2 hours and
refinements took around 3 hours to be applied.</p>
        <p>
          The resulting ontology, after applying these refinements, is an SI ontology
consisting of 504 axioms. Among these are 254 logical axioms. Class and object
property counts are 151 and 7 respectively with one inverse and one transitive
object property.
2 Can be accessed at: http://edutechdeveloper.com/MCQGen/examples/KA.owl
Question generation In [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], we have described a variety of MCQs targeting
di↵erent levels of Bloom taxonomy (i.e., a classification of educational
objectives) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. It remains here to determine which questions to be generated for the
purpose of the current experiment. We choose to avoid questions which require
complicated cognitive processing (not necessarily dicult ones). Questions at
lower Bloom levels are more suitable for our current purposes as they require
less administration time. A variety of memory-recall questions have been
generated which we describe below. Questions that require higher cognitive skills
(e.g., reasoning) have been examined before, for more details see Section 6.3.
        </p>
        <p>
          A total of 913 questions have been generated from the KA ontology
described above.3 Among these are 633 easy questions and 280 dicult questions,
see details below. Only 535 questions out of the 913 questions have at least 3
distractors (with 474 easy questions and 82 dicult questions). Out of these,
we randomly select 50 questions for further evaluation by 3 reviewers. The 50
questions contain 5 easy and 5 dicult questions from 6 di↵erent question
categories which are described below, where 2 categories contain only easy questions
and one category contains a total of 5 dicult questions. The number of optimal
distractors for MCQs remains debatable [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. We choose to randomly select 3
distractors for each question. A description of each category of questions and
the number of generated questions for each category is presented in Table 1.
        </p>
        <p>It makes sense to have only easy questions in two of the above categories.
This is because these questions where designed such that there is a concept S1
such that the key is not subsumed by S1 but the distractors are. And since
similarity depends on the number of subsumers, similarity between the key and
distractors should be low, especially when S1 is an atomic subsumer. Hence the
generated distractors only fit the criteria for easy question generation.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Specifying easy vs. dicult questions using similarity measures</title>
        <p>
          SubSim(·) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] has been used to generate most of the questions described above
with the exception of using GrammarSim(·) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to generate What is X? (with
class expressions as answers). This is justified by the fact that, when the answers
are expressed in detail (e.g., class expressions rather than simply class names),
the similarity measure should be more precise. It remains to specify the upper
and lower bounds of similarity values which can be used generate appropriate
distractors for easy and dicult questions. Rather than specifying a random
number, we choose to calculate the average similarity values between all siblings
in the ontology. This average similarity value is then used as the lower bound for
generating a dicult distractor, where 1 is the upper bound. The lower bound to
generate an easy distractor is set to be two thirds of the lower bound of dicult
distractors.
        </p>
        <p>Question review Three reviewers involved in leading the course have been
asked to evaluate the 50 randomly selected questions using a web interface.
For each question, the reviewer first attempts to solve the questions and then
specifies whether he/she thinks that the question is (0) not useful at all, (1) useful
as a seed for another question, (2) useful but requires major improvements, (3)
useful but requires minor improvements or (4) useful as it is. Then, the reviewer
3 A web-based interface for the
http://edutechdeveloper.com/MCQGen/
QG
tool
can
be
accessed
at
predicts how certain third year students participating in the current experiment
would perform on the question. To distinguish between acceptable and extreme
levels of diculty, we ask the reviewers to choose one of the following options
for each questions: (1) too easy, (2) reasonably easy, (3) reasonably dicult and
(4) too dicult. In what follows, we number the reviewers based on their job
completion time. Hence, first reviewer refers to the reviewer who first finished
the reviewing process.</p>
        <p>Question administration A sample of the questions which have been rated
by the reviewers as useful (or useful with minor improvements) by at least 2
reviewers have been administered to third year students4 who are enrolled in
the course unit for the academic year 2013/14 and who were about to sit the
final exam. Two sets of the questions have been administered in two di↵erent
rounds to increase participation rate. In the first round, a total of 6 questions (3
easy, 3 dicult) have been administered to 19 students using paper-and-pencils
during a revision session at the end of the course. The students had 10 minutes
to answer the 6 questions. In the second round, another set of 6 questions (3
easy, 3 dicult) have been administered to 7 students via BlackBoard one week
before the final exam and the students were allowed to answer the questions at
any time during this week.</p>
        <p>
          Statistical analysis Item response theory (IRT) [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] has been used for the
statistical analysis of students results. IRT studies the statistical behaviour of
good/bad questions. In particular, it studies the following properties: (i) item
diculty, (ii) discrimination between good and poor students and (iii)
guessability.
6.2
        </p>
      </sec>
      <sec id="sec-6-3">
        <title>Analogous/prior experiments</title>
        <p>
          Mitkov et. al [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] have developed an approach to automatically generate MCQs
using NLP methods. They also utilise WordNet [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] to generate distractors that
are similar to the key. They do not explicitly link item diculty to similarity
patterns between keys and distractors. However, they report that average item
diculty for the generated questions was above 0.5 (i.e., considered dicult)
which can be explained by our similarity theory [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] since they choose to generate
distractors that are similar to the key.
        </p>
        <p>
          In an earlier study [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we have evaluated a large set of multiple-choice
analogy questions5 which have been generated from three di↵erent ontologies. The
evaluation was carried out using an automated solver which simulates a student
trying to answer these questions. The use of the automated solver facilitated the
evaluation of the large number of questions. The current experiment in which
we recruit a group of students in real class settings confirms the results of study
carried earlier using the automated solver.
4 This study has been approved by the ethics committee in the School of Computer
        </p>
        <p>Science, The University of Manchester (approval number: CS125).
5 In an analogy question, a pair of concepts of the form “A is to B” is presented to
the student who is asked to identify the most similar pair of concept out of a set of
pairs provided as alternative answers to the question.
6.3</p>
      </sec>
      <sec id="sec-6-4">
        <title>Results and discussion</title>
        <p>
          Overall cost. As mentioned earlier, the cost of QG might, in some cases, include
any costs of developing a new ontology or reviewing/editing an existing one. For
the current experiment, we experienced the extreme case of having to build an
ontology from scratch for the purpose of QG. A total of 14 hours were spent to
develop the ontology described above. For computation time, we need to consider
both the time required to compute pairwise similarity for the underlying ontology
and the time required to compute the questions. Computing pairwise similarity
for all sub-concept expressions (including concept names) in the KA ontology
took 22 minutes. This includes time required to compute similarities using both
SubSim(·) and GrammarSim(·) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for a total of 296 sub-concept expressions.
Computing a total of 913 questions took around 21 minutes. Computing “which
is the odd one out?” questions took 17 minutes out of the 19 minutes while
computing all other questions took less than 4 minutes.
        </p>
        <p>Finally, we also have to consider any time required to review the questions
(possibly including post-editing time). As the reviewers were allowed to review
each item in an unrestricted manner, it is dicult to determine the exact time
that each reviewer has spent on each item. For example, for a set of questions,
a reviewer might start looking at a question on a given day and then submits
the review on the next day after getting interrupted for any reason. We exclude
questions for which the recorded time was more than 60 minutes as this clearly
shows that the reviewer was interrupted in the middle of the reviewing process.
The first reviewer spent between 13 and 837 seconds to review each of the 50
questions, including time for providing suggestions to improve the questions. The
second reviewer spent between 12 and 367 seconds. And the third reviewer spent
between 17 and 917 seconds. Note that these times include the time required to
attempt to answer the question by the reviewer.</p>
        <p>Usefulness of questions. A question is considered “useful” if it is rated
as either “useful as it is” or “useful but requires minor improvements” by a
reviewer. 46 out of 50 questions were considered useful by at least 1 reviewer. 17
out of the 46 questions were considered useful by at least 2 reviewers. The first
reviewer rated 37 questions as being useful while the second and third reviewer
rated 8 and 33 questions as useful respectively. Note that the third reviewer is
the main instructor of the course unit during the academic year in which the
experiment has been conducted while the second reviewer taught the course unit
in the previous year. The first reviewer has not taught this course unit before
but has general knowledge of the content.</p>
        <p>Usefulness of distractors. A given distractor is considered “useful” if it has
been functional (i.e., picked by at least one student). For the 6 questions which
were administered on paper, at least 2 out of 3 distractors were useful. In 5 out of
the 6 questions, the key answer was picked more frequently than the distractors.
Exceptionally, in 1 question, a particular linguistically unique distractor was
picked more frequently than the key. The course instructor justified this by
pointing out that this question was not covered explicitly in class. For the 6
questions which have been administered on BlackBoard, at least one distractor
was useful except for one question which has been answered correctly by all 7
students.</p>
        <p>
          Item discrimination. We used Pearson’s coecient to compute item
discrimination to show the correlation between students’ performance on a given
questions and the overall performance of each student on all questions. The
range of item discrimination is [-1,+1]. A good discrimination value is greater
than 0.4 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. For the 6 questions administered on paper and 4 out of 6
questions administered via BlackBoard, item discrimination was greater than 0.4.
For one question administered via BlackBoard, item discrimination could not be
calculated as 100% of students answered that question correctly. Finally, item
discrimination was poor for only one question. The third reviewer pointed out
that this question is highly guessable.
        </p>
        <p>
          Item diculty. One of the core functionalities of the presented QG tool is
to be able to control item diculty. To evaluate this functionality, we examine
tool-reviewers agreement and tool-students agreement. As described above, the
tool generates questions and labels them as easy or as dicult. Each reviewer
can estimate the diculty of a question by choosing one of the following options:
(1) too easy, (2) reasonably easy, (3) reasonably dicult and (4) too dicult.
A question is too dicult for a particular group of students if it is answered
correctly by less than 30% of the students and is too easy if answered by more
than 90% of the students [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. In both cases, the question needs to be reviewed
and improved. Accordingly, we consider a question to be dicult if answered
correctly by 30-60% and easy if answered correctly by 60-90% of the students.
        </p>
        <p>Before discussing tool-reviewers agreement, it is worth to note agreements
among reviewers. We distinguish between loose agreements and strict
agreements. A loose agreement occurs when two reviewers agree that a question is
easy/dicult but disagree whether it is too easy/dicult or reasonably
easy/dicult. Table 2 summarises agreements among reviewers. Each reviewer agrees
with the tool on 31 (not necessarily the same) questions.</p>
        <p>With regard to the 6 questions delivered on paper, 2 questions were
reasonably dicult and 2 were reasonably easy for the students. These 4 questions
were in line with diculty estimations by the QG tool. 1 out of the 6 questions
was too dicult for the students. Most of the students picked a linguistically
unique distractor rather than the key. Remarkably, the tool and the three
reviewers have rated this item as easy. Finally, 1 question was too easy for the
students however it was rated as dicult by the tool. This is due to having a
clue in the stem. Similarly, for BlackBoard questions, 1 question was reasonably
dicult and 1 question was reasonably easy for the students; just in line with
tool estimations. 1 out of the 6 questions was too easy for the students (100%
correct answers). This question was rated as easy by the tool. Again, 1 question
was rated as dicult by the tool but was easy for the students due to having a
clue in the stem. 2 questions were not in line with tool estimations but were in
line with at least 2 reviewers estimations.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and future research directions</title>
      <p>With the growing interest in ontology-based QG approaches, there is a clear
need of putting these approaches into practise. We have examined the
pedagogical characteristics of questions generated from a handcrafted ontology. The
results are promising with regard to both usefulness of questions and diculty
prediction. For future work, we will experiment with more ontologies, in
particular, existing ontologies rather than handcrafted ones. Also, it remains to
incorporate NLP methods to account for linguistic/lexical similarities..
Question</p>
      <p>description</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Achieve</surname>
          </string-name>
          .
          <article-title>Testing: Setting the record straight</article-title>
          .
          <source>Technical report</source>
          , Washington, DC: Author,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>T.</given-names>
            <surname>Alsubait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>U.</given-names>
            <surname>Sattler</surname>
          </string-name>
          .
          <article-title>Mining ontologies for analogy questions: A similarity-based approach</article-title>
          .
          <source>In OWLED</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>T.</given-names>
            <surname>Alsubait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>U.</given-names>
            <surname>Sattler</surname>
          </string-name>
          .
          <article-title>Next generation of e-assessment: automatic generation of questions</article-title>
          .
          <source>International Journal of Technology Enhanced Learning</source>
          ,
          <volume>4</volume>
          (
          <issue>3</issue>
          /4):
          <fpage>156</fpage>
          -
          <lpage>171</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>T.</given-names>
            <surname>Alsubait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>U.</given-names>
            <surname>Sattler</surname>
          </string-name>
          .
          <article-title>A similarity-based theory of controlling mcq diculty</article-title>
          . In Second International Conference on e
          <article-title>-Learning and e-Technologies in Education (ICEEE)</article-title>
          , pages
          <fpage>283</fpage>
          -
          <lpage>288</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>T.</given-names>
            <surname>Alsubait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>U.</given-names>
            <surname>Sattler</surname>
          </string-name>
          .
          <article-title>Measuring similarity in ontologies: How bad is a cheap measure?</article-title>
          <source>In 27th Inernational Workshop on Description Logics (DL-2014)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>I. Bejar.</surname>
          </string-name>
          <article-title>Subject matter experts' assessment of item statistics</article-title>
          .
          <source>Applied Psychological Measurement</source>
          ,
          <volume>7</volume>
          :
          <fpage>303</fpage>
          -
          <lpage>310</lpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Bloom</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Krathwohl</surname>
          </string-name>
          .
          <article-title>Taxonomy of educational objectives: The classification of educational goals by a committee of college and university examiners. Handbook 1. Cognitive domain</article-title>
          . New York: Addison-Wesley,
          <year>1956</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>J. Brown</surname>
            , G. Firshko↵, and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Eskenazi</surname>
          </string-name>
          .
          <article-title>Automatic question generation for vocabulary assessment</article-title>
          .
          <source>In Proceedings of HLT/EMNLP</source>
          , pages
          <fpage>819</fpage>
          -
          <lpage>826</lpage>
          , Vancuver, Canada,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Cubric</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Tosic</surname>
          </string-name>
          .
          <article-title>Towards automatic generation of e-assessment using semantic web technologies</article-title>
          .
          <source>In Proceedings of the 2010 International Computer Assisted Assessment Conference</source>
          , University of Southampton,
          <year>July 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>C. d'Amato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Fanizzi</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Esposito</surname>
          </string-name>
          .
          <article-title>A dissimilarity measure for ALC concept descriptions</article-title>
          .
          <source>Proceedings of the 21st Annual ACM Symposium of Applied Computing, SAC2006</source>
          , Dijon, France. ACM.,
          <volume>2</volume>
          :
          <fpage>16951699</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Davis</surname>
          </string-name>
          .
          <article-title>Tools for Teaching</article-title>
          . San Francisco, CA:
          <string-name>
            <surname>Jossey-Bass</surname>
          </string-name>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>R.L.</given-names>
            <surname>Ebel</surname>
          </string-name>
          .
          <article-title>Procedures for the analysis of classroom tests</article-title>
          .
          <source>Educational and Psychological Measurement</source>
          ,
          <volume>14</volume>
          :
          <fpage>352</fpage>
          -
          <lpage>364</lpage>
          ,
          <year>1954</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>T.M. Haladyna</surname>
            and
            <given-names>S.M.</given-names>
          </string-name>
          <string-name>
            <surname>Downing</surname>
          </string-name>
          .
          <article-title>How many options is enough for a multiple choice test item?</article-title>
          <source>Educational &amp; Psychological Measurement</source>
          ,
          <volume>53</volume>
          (
          <issue>4</issue>
          ):
          <fpage>999</fpage>
          -
          <lpage>1010</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>M.</given-names>
            <surname>Heilman</surname>
          </string-name>
          .
          <article-title>Automatic Factual Question Generation from Text</article-title>
          .
          <source>PhD thesis</source>
          , Language Technologies Institute, School of Computer Science, Carnegie Mellon University,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. E. Holohan,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McMullen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Pahl</surname>
          </string-name>
          .
          <article-title>The generation of e-learning exercise problems from subject ontologies</article-title>
          .
          <source>In Proceedings of the Sixth IEEE International Conference on Advanced Learning Technologies</source>
          , pages
          <fpage>967</fpage>
          -
          <lpage>969</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>M.</given-names>
            <surname>Horridge</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bechhofer</surname>
          </string-name>
          .
          <article-title>The OWL API: A Java API for working with OWL 2 ontologies</article-title>
          . In
          <source>In Proceedings of the 6th International Workshop on OWL: Experiences and Directions (OWLED)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>M. Horridge</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Parsia</surname>
            , and
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Sattler</surname>
          </string-name>
          .
          <article-title>The state of biomedical ontologies</article-title>
          .
          <source>In BioOntologies 2011 15th-16th July</source>
          , Vienna Austria,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>A.</given-names>
            <surname>Hoshino</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          .
          <article-title>Real-time multiple choice question generation for language testing: a preliminary study</article-title>
          .
          <source>In Proceedings of the Second Workshop on Building Educational Applications using Natural Language Processing</source>
          , pages
          <fpage>17</fpage>
          -
          <lpage>20</lpage>
          , Ann Arbor, US,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>A.</given-names>
            <surname>Hoshino</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          .
          <article-title>Predicting the difculty of multiple-choice cloze questions for computer-adaptive testing</article-title>
          .
          <source>In Proceedings of the 11th International Conference on Intelligent Text Processing and Computational Linguistics, Romania, March</source>
          <volume>21</volume>
          -27
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>J.</given-names>
            <surname>Impara</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Plake</surname>
          </string-name>
          .
          <article-title>Teachers' ability to estimate item diculty: A test of the assumptions in the ango↵ standard setting method</article-title>
          .
          <source>Journal of Educational Measurement</source>
          ,
          <volume>35</volume>
          (
          <issue>1</issue>
          ):
          <fpage>69</fpage>
          -
          <lpage>81</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>K.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Turhan</surname>
          </string-name>
          .
          <article-title>A framework for semantic-based similarity measures for ELH-concepts</article-title>
          .
          <source>JELIA</source>
          <year>2012</year>
          , pages
          <fpage>307</fpage>
          -
          <lpage>319</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>C. Liu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Applications of lexical information for algorithmically composing multiple-choice cloze items</article-title>
          .
          <source>In Proceedings of the Second Workshop on Building Educational Applications using Natural Language Processing</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          , Ann Arbor, US,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>G.</given-names>
            <surname>MILLER.</surname>
          </string-name>
          <article-title>WordNet: A lexical database for English</article-title>
          .
          <source>COMMUNICATIONS OF THE ACM</source>
          ,
          <volume>38</volume>
          (
          <issue>11</issue>
          ),
          <year>November 1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>M. Miller</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Linn</surname>
            , and
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Gronlund</surname>
          </string-name>
          .
          <article-title>Measurement and Assessment in Teaching, Tenth Edition</article-title>
          . Pearson,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <given-names>R.</given-names>
            <surname>Mitkov</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>An Ha, and</article-title>
          <string-name>
            <given-names>N.</given-names>
            <surname>Karamani</surname>
          </string-name>
          .
          <article-title>A computer-aided environment for generating multiple-choice test items</article-title>
          .cambridge university press.
          <source>Natural Language Engineering</source>
          ,
          <volume>12</volume>
          (
          <issue>2</issue>
          ):
          <fpage>177</fpage>
          -
          <lpage>194</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <given-names>A.</given-names>
            <surname>Papasalouros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kotis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Kanaris</surname>
          </string-name>
          .
          <article-title>Automatic generation of multiplechoice questions from domain ontologies</article-title>
          .
          <source>In IADIS e-Learning 2008 conference, Amsterdam</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <given-names>M.</given-names>
            <surname>Paxton</surname>
          </string-name>
          .
          <article-title>A linguistic perspective on multiple choice questioning</article-title>
          .
          <source>Assessment &amp; Evaluation in Higher Education</source>
          ,
          <volume>25</volume>
          (
          <issue>2</issue>
          ):
          <fpage>109</fpage>
          -
          <lpage>119</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <given-names>R.</given-names>
            <surname>Rada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mili</surname>
          </string-name>
          , E. Bicknell, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Blettner</surname>
          </string-name>
          .
          <article-title>Development and application of a metric on semantic nets</article-title>
          .
          <source>In IEEE Transaction on Systems, Man, and Cybernetics</source>
          , volume
          <volume>19</volume>
          , page 1730,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <given-names>P.</given-names>
            <surname>Resnik</surname>
          </string-name>
          .
          <article-title>Using information content to evaluate semantic similarity in a taxonomy</article-title>
          .
          <source>In In Proceedings of the 14th international joint conference on Artificial intelligence (IJCAI95)</source>
          , volume
          <volume>1</volume>
          , pages
          <fpage>448</fpage>
          -
          <lpage>453</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <given-names>V.</given-names>
            <surname>Rus</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Graesser</surname>
          </string-name>
          . Workshop report:
          <article-title>The question generation task and evaluation challenge. Institute for Intelligent Systems</article-title>
          , Memphis,
          <string-name>
            <surname>TN</surname>
          </string-name>
          ,
          <source>ISBN: 978- 0-615-27428-7</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31. J. T. Sidick,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Barrett</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Doverspike</surname>
          </string-name>
          .
          <article-title>Three-alternative multiple-choice tests: An attractive option</article-title>
          .
          <source>Personnel Psychology</source>
          ,
          <volume>47</volume>
          :
          <fpage>829</fpage>
          -
          <lpage>835</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>M. Simon</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ercikan</surname>
          </string-name>
          , and M. Rousseau, editors.
          <source>Improving Large Scale Education Assessment: Theory</source>
          , Issues, and
          <string-name>
            <surname>Practice</surname>
          </string-name>
          . Routledge, New York,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsarkov</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Horrocks.</surname>
          </string-name>
          <article-title>FaCT++ description logic reasoner: System description</article-title>
          .
          <source>In Proceedings of the 3rd International Joint Conference on Automated Reasoning (IJCAR)</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          and MS. Palmer.
          <article-title>Verb semantics and lexical selection</article-title>
          .
          <source>In Proceedings of the 32nd</source>
          .
          <article-title>Annual Meeting of the Association for Computational Linguistics (ACL</article-title>
          <year>1994</year>
          ), page 133138,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <given-names>B.</given-names>
            <surname>Zitko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stankov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosic</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Grubisic</surname>
          </string-name>
          .
          <article-title>Dynamic test generation over ontology-based knowledge representation in authoring shell</article-title>
          .
          <source>Expert Systems with Applications: An International Journal</source>
          ,
          <volume>36</volume>
          (
          <issue>4</issue>
          ):
          <fpage>8185</fpage>
          -
          <lpage>8196</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <given-names>K.</given-names>
            <surname>Zoumpatianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Papasalouros</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Kotis</surname>
          </string-name>
          .
          <article-title>Automated transformation of SWRL rules into multiple-choice questions</article-title>
          .
          <source>In FLAIRS Conference'11</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>