Introduction

Mining Ontologies for Analogy Questions: A Similarity-based Approach

Tahani Alsubait

Bijan Parsia

Uli Sattler

sattlerg@cs.man.ac.uk 0 0 School of Computer Science, The University of Manchester , United Kingdom

In this paper, we propose a new approach to generate analogy questions of the form "A is to B as ... is to ?" from ontologies. Analogy questions are widely used in multiple-choice tests such as SATs and GREs and are used to assess student's higher cognitive abilities. The design, implementation and evaluation of the new approach are presented in this paper. The results show that mining ontologies for such questions is fruitful.

Introduction

Learning may be seen as its own reward; however assessment is usually required to provide various types of reward and recognition. This notion of assessment is usually referred to as summative assessment compared to formative assessment which is mainly for providing necessary feedback to students to support the learning process.

Assessment items (i.e. questions) can be classi ed into two widely used formats: (i) Objective (e.g. Multiple Choice Questions (MCQs) or True/False questions) and (ii) Subjective (e.g. essays or short answers). Each family of questions has its own advantages/disadvantages w.r.t. di erent phases of testing (i.e. Setting, Taking and Marking). On the one hand, objective tests can be used to assess a broad range of knowledge and yet require less administration time. In addition, they are scored easily, quickly and objectively either manually or automatically and can be used to provide instant feedback to test takers. On the other hand, objective questions are hard to prepare and require considerable time per each question [26]. For example, Davis [8] & Lowman [19] pointed out that even professional test developers cannot prepare more than 3-4 items per day. In addition to the considerable preparation time, manual construction of MCQs does not necessarily imply that they are well-constructed. See for example the study carried out by Paxton [23] who analysed a large number of MCQs and reported that they are often not well-constructed.

A major challenge in preparing MCQs is the need for good distractors that should appear as plausible answers to the question for those students who have not achieved the objective being assessed. At the same time, distractors should appear as implausible answers for those students who have achieved the objective [3]. Moreover, a well-written MCQ is a question that does not confuse students, and yields scores that can be used in determining the extent to which students have achieved educational objectives [3, 17].

Many guidelines have been proposed to ensure the e ectiveness of distractors; however, many major issues are still debatable such as the appropriate number of distractors [10, 23].

Before the e ectiveness of MCQs is discussed further, the di culty of evaluating such questions should be mentioned. The di culty of evaluation lies, among other things, in the need for administering those questions to real students in normal settings and analyzing their grades according to well-de ned procedures. For example, one can follow the procedures described in Item Response Theory (IRT) [16, 21, 20] which is a theory that explains the statistical behavior of good/bad questions. According to IRT, good test questions have the following three characteristics: (i) prevent students from guessing the correct answer, (ii) function towards proper discrimination between good and poor students and (iii) di erent questions in the test have di erent di culties.

In addition to the above mentioned characteristics of good questions, the need for having questions that assess di erent levels of learning objectives should be also mentioned. For example, a test developer might be interested in knowing the level of which a student has achieved a learning objective (ranging from the ability to recall information to the ability to analyse and judge the provided information) [2]. Note that questions that address a speci c level of learning objectives can be of di erent di culties for a speci c set of students. Note also that questions that assess lower level objectives are not necessarily questions of a lower quality as long as they meet the determined learning objectives [11].

Given the considerable time and e ort required to develop MCQs, we propose to automate the generation of these questions by using an ontology-based approach. Our motivation to use OWL ontologies in particular is their precise semantics, available reasoning services and considerable e orts put into their development. One of the promises of representing knowledge in such ontologies is that it can be used for di erent applications. In this paper, we investigate the potential benet of ontology-based question generation.

Recently, a handful of studies [12, 13, 22, 29, 30, 1] explored the generation of MCQs over ontologies. A brief overview of these approaches is provided in section 5. A general critique of these approaches is the lack of pedagogic theory backing which we try to overcome in this paper. Moreover, most of these approaches generate questions of the type "What is X?" or "Which of the following is an example of X?" based on class-subclass and/or class-individual relationships. This type of questions can only assess lower levels of learning objectives [2]. Therefore, it is crucial to design approaches capable of generating questions of other types.

In this paper, we present a new approach for generating multiple-choice analogy questions from ontologies. Such questions aim to assess the analogical reasoning ability of students (i.e. the ability to determine the underlying relation between a pair of concepts and identifying a similar pair that has the same underlying relation). We also describe the notion of relational similarity and how to use it to control the di culty of the generated questions. In addition, we report on some experiments carried out to evaluate the new approach using a large corpus. 2

Preliminaries

To understand the procedure required to generate MCQs, we rst present a simple, yet general, de nition for MCQs in what follows.

De nition 1. A multiple choice question M CQ is a tool that can be used to evaluate whether (or not) our students have achieved a certain learning objective. It consists of the following parts: { A statement S that introduces a problem to the student (i.e. stem). { A number of functional options A = fAij2 i maxg that can be further divided into two sets: 1. A number of correct options K = fKmj1 m ig (i.e. key) 2. A number of incorrect options D = fDnjn := i mg (i.e. distractors).

To generate good MCQs we need a psychologically plausible theory that guides us in the generation. In this paper, we propose to use the notion of similarity to control the characteristics of the generated questions. For example, consider a question that has a stem that is similar to the key and di erent from the distractors. We would expect that the students will nd this question to be an easy one (assuming that they notice the clues provided with the correct answer). Similarly, we would expect the question to be di cult if the stem is very similar to one (or all) of the distractors and di erent from the key.

There are at least two major types of similarity. In addition, similarity is different than the general notion of relatedness. For example, we say that cars and fuel are closely related compared to cars and bicycles that are closely similar. This notion of similarity is usually referred to as semantic similarity. A number of measures have been proposed to measure semantic similarity between concepts. See for example [24, 25, 18, 15] for general similarity measures and [7] for a semantic similarity measure that was designed for DL ontologies. Another important type of similarity is relational similarity [28, 27] which corresponds to similarities in the underlying relations. For example, food is to body as fuel is to car. When two pairs of concepts have a strong relational similarity, we say that they are analogous. In analogical reasoning, we compare two di erent types of objects and identify points of resemblance.

Di erent types of similarity can play di erent roles in question generation. For example, semantic similarity can be used to generate plausible distractors for simple recall questions of the form "What is X?". Also, controlling the degree of similarity between the stem, key and distractors allows us to generate questions of di erent di culties. Along similar lines, relational similarity plays a major role in generating questions that assess higher cognitive abilities. As an example of such questions that can be generated using our proposed similarity-based approach, we consider analogy questions that have the form "X is to Y as:". The alternative answers to such questions take the form "Xi is to Yi" where the key is the pair (Xi; Yi) that has the same underlying relation as the pair (X; Y ) in the stem. See Table 1 for a sample multiple-choice analogy question taken from the GRE exam. For our purposes, we de ne analogy questions as follows (detailed explanation for Relatedness and Analogy functions will be provided later): De nition 2. Let Q be a multiple-choice analogy question with stem S = (X; Y ), key K = (V; W ) and a set of distractors D = fDi = (Ai; Bi)j1 < i maxg. We assume that Q satis es the following conditions:

1. The stem S, the key K, the distractor Di are all good (i:e:Relatedness(X; Y ) R; Relatedness(V; W ) R; Relatedness(Ai; Bi) R). 2. The key K is signi cantly more analogous to S compared to the distractors (i:e:Analogy(S; K) Analogy(S; Di)+ 1). 3. The key K is su ciently analogous to S (i:e:Analogy(S; K) 2). 4. The distractors should be analogous to S to an extent (i:e:Analogy(S; Di) 3). 5. Each distractor Di is unique (i:e:Analogy(S; Di) 6= Analogy(S; Dj )).

We would like to be able to control the di culty of the generated questions. According to De nition 2 and Propositions 1.a, 1.b, 1.c we can control the difculty of Q by increasing or decreasing 1, 2 and 3.

Proposition 1. a. Increasing 1 decreases the di culty of Q. b. Increasing 2 decreases the di culty of Q. c. Decreasing 3 decreases the di culty of Q.

To generate analogy questions, we need to de ne two functions, Relatedness and Analogy. A very basic example for the Relatedness function is to consider concepts that are both referenced in one (or more) of the ontological axioms in the source ontology as su ciently related concepts (e.g. X v 9r:Y ! Relatedness(X; Y ) > 0). However, such a syntax-based notion of relatedness is sensitive to tautologies and therefore cannot be adopted without further considerations. For simplicity, we currently adopt a simple relatedness notion that considers a pair of named classes to be su ciently related if they have one of the structures in Figure 1. These structures have at most one change in direction in the path connecting the two nodes and at most two steps in each direction. Other structures were discarded to avoid too di cult (and probably confusing) questions. While in the most general case, one should consider pairs with arbitrary related classes (e.g. by considering user-de ned relations), for current purposes we only consider class-subclass relations. This simpli es the problem considerably in several dimensions while still generates reasonable number of candidate pairs (as we will see later). In addition, we need to de ne the function Analogy which is the core function for generating multiple-choice analogy questions. This function is de ned as follows: De nition 3. Let Analogy(x; y) be the function that takes two pairs of concepts and returns a numerical score for their analogy value according to the equation: Analogy(x; y) =

SharedSteps(x; y)

T otalSteps(x; y)

SharedDirections(x; y)

T otalDirections(x; y) ( 1 ) 3

Extracting Analogy Questions from Ontologies

One of the questions that arise here is how many MCQs can be generated from a given ontology? To answer this question we need to rst determine what parts of the ontology (i.e. classes, individuals, properties, and annotations) will be considered in the generation process. Secondly, we need to determine whether (or not) a ltering mechanism is used to di erentiate between good and bad questions and to generate questions that are supposed to be good questions only. As an example, the following equation (2) can be used to count the number of possible multiple-choice questions of the form "What is [class name]?" with one key and three distractors (all are class names), assuming that no ltering mechanism is used (n is the number of classes in the given ontology and Ti is the number of correct answers (i.e. super-classes) for class i): n X i=1

Ti 1 n 1 3

Ti (2) Needless to say, the number of questions increases rapidly as n grows (see Figure 2 for some examples). Also, it reaches its maximum value when Ti equals n4 1 (i.e. the ratio of correct answers to wrong answers is 1:3). The number of possible questions that can be generated from a given ontology can further be increased if we consider other parts of the ontology (e.g. individuals, properties). Having said this, we should also mention that generating a large number of questions is not desirable unless the generated questions are expected to be all good. A similar analysis of the number of possible analogy questions is part of future work. In what follows, we provide an algorithm (see Algorithm 1) that can be used to generate multiple choice analogy questions from a given ontology O. The algorithm is founded on the premise that varying the relational similarity (i.e. the analogy degree) between the stem, the key and distractors allows us to control the di culty of the generated questions. This can be achieved by setting the parameters 1, 2 and 3 to di erent values. The proposed approach consists of two phases: (i) extraction of interesting pairs of concepts which can be determined using the proposed Relatedness function, those pairs can be used as stems, keys or distractors and (ii) generation of multiple-choice questions based on the similarity between pairs which can be derived from the proposed Analogy function. Note that this approach can be generalized to generate other types of questions such as nding the antonyms/odds.

Empirical Evaluation

To evaluate the proposed approach, we implemented a question generation engine that utilizes algorithm 1 and used the implemented engine to generate analogy questions from three ontologies (one specialized ontology and two tutorialbased ontologies). The three ontologies are presented in Table 2 below with some basic ontology statistics. The rst ontology is the Gene Ontology which is a structured vocabulary for the annotation of gene products. It has three main parts: (i) molecular function, (ii) cellular component and (iii) biological role. The two other ontologies are the People & Pets Ontology and the Pizza Ontology which are very simple ontologies that are usually used in ontology development tutorials. The table shows the number of classes in each ontology and the number of sample questions generated by the engine (this is only a representative sample of all the possible questions). The table also shows the percentage of questions that our proposed solver agent can solve correctly. The details of the approach used to simulate question solving are explained in what follows. Other ontologies can be used as input for our implemented question generation engine; however we tried to avoid ontologies that use di cult-to-read labels (e.g. labels that have no spaces between words).

Table 2. Ontologies used to generate analogy-questions along with basic statistics Gene Ontology People & Pets Pizza Ontology

No. of Classes No. of questions %Correct 36146 25 8% 58 15 67% 97 16 88%

In order to evaluate the proposed similarity-based approach de ned in Algorithm 1, we need at least to simulate students while solving the generated questions and check whether (or not) the proposed approach can be used to successfully control the di culty of questions. To do this, we follow the method explained by Turney & Littman [28, 27] for evaluating analogies using a large corpus.

In their study, Turney & Littman reported that their method can solve about 47% of multiple-choice analogy questions (compared to an average of 57% correct answers solved by high school students). The solver takes a pair of words representing the stem of the question and 5 other pairs representing the answers presented to students. Their proposed method is inspired by the Vector Space Model (VSM) of informational retrieval. For each provided answer, the solver creates two vectors representing the stem (R1) and the given answer (R2). The solver returns a numerical value for the degree of analogy between the stem and the given answer. Then, the answers are ranked according to their analogy value and the answer with the highest rank is considered the correct answer. To create the vectors, they proposed a table of 64 joining terms that can be used to join the two words in each pair (stem or answer). The two words are joined by these joining terms in two di erent ways (e.g. "X is Y" and "Y is X") to create a vector of 128 features. The actual values stored in each vector are calculated by counting the frequencies of those constructed terms in a large corpus (e.g. web resources indexed by a search engine). To improve the accuracy of their proposed method, they suggested to use the logarithm of the frequency instead of the frequency itself.

In this paper, we follow a similar procedure to evaluate the di culty of our generated MCQs. First, we constructed a table of joining terms relevant to the relations considered in our approach (e.g. "is a", "type", "and", "or"). Based on these joining terms, we create vectors of 10 features for the stem, the key and each distractor. The constructed terms are sent as a query to a search engine (Yahoo!) and the logarithm of the hit count is stored in the corresponding element in the vector. The hit count is always incremented by one to avoid getting unde ned values. Following this procedure, our proposed solver agent solved 8% of the questions generated from the Gene Ontology, 67% of the questions generated from the People and Pets Ontology and 88% of the questions generated from the Pizza Ontology. We argue that this is caused by the speci c terminology used in the Gene Ontology and lack of web resources that have information regarding it compared to the other ontology.

Examples of the questions that were generated using our proposed approach are presented in Tables 3 & 4. Those questions were generated from the People & Pets ontology and Pizza ontology respectively. Moreover, we varied the di culty-control parameters to generate di erent sets of questions (i.e. questions of di erent di culties) from the two tutorial ontologies. The results (See Table 5) show that the proposed parameters su ciently controlled the di culty of the generated questions. 5

Related Work

Chung, Niemi, and Bewley (2003) [4] developed the Assessment Design and Delivery System (ADDS). The purpose of ADDS is to assist non-expert physics teachers in designing appropriate assessments by constraining the design process by structure-based and cognitive-based rules derived from an ontology that was speci cally designed for the system. In addition, ADDS's domain ontology has links to a set of reusable assessment tasks or components of tasks (i.e. text, graphic, multimedia) along with information to guide teachers practice.

Holohan et al. (2005) [12] described the OntAWare system which is an ontologybased authoring environment for learning content. It employs an ontology graph traversal algorithm that generates MCQs of the form "Which of the following items is (or is not) an example of the concept, X?". The alternative answers will be generated randomly and the question as a whole can be exported to external systems that conform to the IMS/QTI [14] standard. One of the central problems in OntAwar, other than the highly constrained forms of questions, is that the ontology graph transformations employed in the system are hardcoded (in Java) to incorporate implicit instructional strategies and therefore their approach is not ready to be generalized and adopted in other systems. They extended their work in Holohan et al. (2006) [13] by focusing on the generation of SQL exercise problems for database students using domain-dependent algorithms.

Stankov and Zitko (2008) [29] proposed templates and algorithms for automatic generation of objective questions (i.e. MCQs, T/F) over ontologies. The focus in their work was to extend the functionality of a previously implemented tutoring system (Tex-Sys) by concentrating on the assessment component. The proposed methodology generates a set of random alternative answers for each MCQ without an attempt to lter them according to their pedagogical appropriateness.

Papasalouros et al. (2008) [22] presented various ontology-based strategies for automatic generation of MCQs. These strategies are used for selecting the correct and wrong (distracting) answers of the questions. The answers are later transformed into English sentences using simple natural language generation techniques. The evaluation of the produced questions by domain experts shows that the questions are satisfactory for assessment but not all of them are syntactically correct. The major problem related to this approach is the use of highly constrained rules with no theory backing that would motivates the selection of these rules. For example, the distractors in each MCQ are mainly picked from the set of siblings of the correct answer while there might be other plausible distractors.

Cubric and Tosic (2009) [5] reported their experience in implementing a Portege plugin for question generation based on the strategies proposed by Papasalouros et al. (2008) [22]. More recently, Cubric and Tosic (2010) [6] extended their previous work by considering new ontology elements (i.e. annotations). In addition, they suggested employing question templates to avoid syntactical problems in the automatically generated questions. This also enables the generation of questions in di erent levels of Blooms taxonomy [2]. 6

Conclusion and Future Work

A handful of studies have already proposed some approaches to generate MCQs over ontologies, however little have been done on the theoretical and evaluation aspectes. In this paper, we propose a new approach to generate multiple-choice analogy questions from ontologies. The paper describes the foundations of the proposed approach from a psychological point of view. In addition, the paper reports on some evaluations carried out to evaluate the proposed approach. The results show that mining ontologies for analogy questions in particular and for assessment questions in general is fruitful. Moreover, the results show that the proposed approach can be used to control the di culty of the generated questions.

For future work, we aim to generalize the proposed approach for generating analogies and consider arbitrary relations found in existing ontologies (i.e. userde ned relations instead of only class-superclass relations). To evaluate such analogies, we suggest to use Latent Relational Similarity (LRS) [27] which has the ability to learn relations instead of using prede ned joining terms. Learning Technologies, 2011. 2. B.S. Bloom and D.R. Krathwohl. Taxonomy of educational objectives: The classication of educational goals by a committee of college and university examiners.

Handbook 1. Cognitive domain. New York: Addison-Wesley, 1956. 3. S.Burton, R.Sudweeks, P.Merrill, and B.Wood. How to prepare better multiplechoice test items: Guidelines for university faculty. Brigham young university testing services and the department of instructional science. Retrieved November 22, 2011, from http://testing.byu.edu/info/handbooks/betteritems.pdf, 1991. 4. G.Chung, D.Niemi, and W.L. Bewley. Assessment applications of ontologies. In Paper presented at the Annual Meeting of the American Educational Research Association, 2003. 5. M.Cubric and M.Tosic. SEmcq: Protege plugin for automatic ontology-driven multiple choice question tests generation. In 11th Intl. Protege Conference, Poster and Demo Session, 2009. 6. M.Cubric and M.Tosic. Towards automatic generation of e-assessment using semantic web technologies. In Proceedings of the 2010 International Computer Assisted Assessment Conference, University of Southampton, July 2010. 7. C.d'Amato, S.Staab, and N.Fanizzi. On the in uence of description logics ontologies on conceptual similarity. In EKAW 08 Proceedings of the 16th international conference on Knowledge Engineering: Practice and Patterns, 2008. 8. B.B. Davis. Tools for Teaching. San Francisco, CA: Jossey-Bass, 2001. 9. GRESampleQuestions. Best sample questions. Retrieved March 10, 2012, from http://www.bestsamplequestions.com/gre-questions/analogies/. 10. T.M. Haladyna and S.M. Downing. How many options is enough for a multiple choice test item? Educational & Psychological Measurement, 53(4):9991010, 1993. 11. M.Hu er, M.AL-Smadi, and C.G. Investigating content quality of automatically and manually generated questions to support self-directed learning. In In Whitelock, D., Warburton, W., Wills, G., and Gilbert, L. (Eds.), CAA 2011 International Computer Assisted Assessment Conference, University of Southampton, 2011. 12. E.etal. Holohan. Adaptive e-learning content generation based on semantic web technology. In Proceedings of Workshop on Applications of Semantic Web Technologies for e-Learning, pages 2936, Amsterdam, The Netherlands, 2005. 13. E.etal. Holohan. The generation of e-learning exercise problems from subject ontologies. In Proceedings of the Sixth IEEE International Conference on Advanced Learning Technologies, pages 967969, 2006. 14. IMS. IMS question & test interoperability. ASI best practice & implementation guide. nal speci cation version 1.2. IMS global learning consortium Inc., June 2002. 15. J.Jiang and D.Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In In: Proc. of the 10th International Conference on Research on Computational Linguistics, Taiwan, 1997. 16. J.Kehoe. Basic item analysis for multiple-choice tests. Practical Assessment, Research & Evaluation, 4(10), 1995. 17. K.King, D.Gardner, S.Zucker, and M.Jorgensen. The distractor rationale taxonomy: Enhancing multiple-choice items in reading and mathematics. Assessment Report. Pearson, July 2004. 18. D.Lin. An information-theoretic de nition of similarity. In In: Proc. of the 15th International Conference on Machine Learning, page 296-304, San Francisco, CA, 1998. Morgan Kaufmann. 19. J.Lowman. Mastering the Techniques of Teaching (2nd ed.). San Francisco: Jossey

Bass, 1995. 20. M.Miller, R.Linn, and N.Gronlund. Measurement and Assessment in Teaching,

Tenth Edition. Pearson, 2008. 21. R.Mitkov, L.AnHa, and N.Karamani. A computer-aided environment for generating multiple-choice test items.cambridge university press. Natural Language Engineering, 12(2):177194, 2006. 22. A.Papasalouros, K.Kotis, and K.Kanaris. Automatic generation of multiple-choice questions from domain ontologies. In IADIS e-Learning 2008 conference, Amsterdam, 2008. 23. M.Paxton. A linguistic perspective on multiple choice questioning. Assessment &

Evaluation in Higher Education, 25(2):109119, 2001. 24. R.Rada, H.Mili, E.Bicknell, and M.Blettner. Development and application of a metric on semantic nets. In In: IEEE Transaction on Systems, Man, and Cybernetics, volume19, page 17-30, 1989. 25. P.Resnik. Using information content to evaluate semantic similarity in a taxonomy.

In In Proceedings of the 14th international joint conference on Arti cial intelligence (IJCAI'95), volume1, pages 448453, 1995. 26. J.T. Sidick, G.V. Barrett, and D.Doverspike. Three-alternative multiple-choice tests: An attractive option. Personnel Psychology, 47:829835, 1994. 27. P.Turney. Measuring semantic similarity by latent relational analysis. In IJCAI is the International Joint Conference on Arti cial Intelligence, 2005. 28. P.Turney and M.Littman. Corpus-based learning of analogies and semantic relations. Machine Learning, 60( 1-3 ):251278, 2005. 29. B.Zitko, S.Stankov, M.Rosi, and A.Grubi. Dynamic test generation over ontologybased knowledge representation in authoring shell. Expert Systems with Applications: An International Journal, 36(4):81858196, 2008. 30. K.Zoumpatianos, A.Papasalouros, and K.Kotis. Automated transformation of SWRL rules into multiple-choice questions. In FLAIRS Conference11, 2011.

Al-Yahya . Ontoque: A question generation engine for educational assessment based on domain ontologies . In 11th IEEE International Conference on Advanced