Introduction

ACM

Generating \Who Wants to Be a Millionaire?" Questions Sets Automatically from Wikidata

Markus Wohlan

Yannik Schroder

Frank Hoppner

0 0 Ostfalia University of Applied Sciences Dept. of Computer Science , D-38302 Wolfenbuttel , Germany

2016

57 10

Quiz shows and apps have enjoyed great popularity in recent years, which increases the demand for fresh question sets. We investigate how to derive such sets automatically from the Wikidata knowledge graph. Utilizing the graph, node connectivity and diversity, we propose measures to identify appealing wrong answers and rate the di culty of a question, which is a prerequisite to compose a full question set. First results align quite well with human perception of the questions.

Introduction

While Wikidata has been used in various occasions to answer questions (e.g. [6,2]), only little e ort is documented regarding the creation of questions. We have found only two theses [ 1,3 ], which consider the construction of single questions for a given topic. Compared to them, our proposal generates questions faster, combines multiple questions to a diverse question set with varying degree of di culty and puts a greater e ort on more plausible wrong answers. The Wikidata Toolkit [4] was used to access the JSON dump (gzip 50GB). To preserve a realistic chance of answering the generated questions we removed items of certain types (such as named proteins, speci c events in a series, scienti c articles, items without labels, etc.) as well as properties that were too speci c (such as internal Wikipedia categories or global coordinates), as in [ 1,3 ]. The ltering process was a trade o between a small memory footprint (eliminate as much unnecessary information as possible to keep the ltered graph in main memory using a modi ed graph library [5]) and preserving the graph's connectivity (node connections will be used for the evaluation of questions). We nally settled at a hand-crafted selection of about 250 properties (out of approx. 1100 supported Wikidata item properties). 4

Question Set Generation

Question Template. The underlying idea is to generate questions based on templates T = (q; p), where q is a query (e.g. a SparQL query) and p a phrase with placeholders completed by the query result. The query in Fig. 1, may be used together with the phrase \What is hpredicatei of hsubjecti? (a) hobjecti (b) hcandidatei ...", where placeholders such as hcategoryi have to be replaced by the corresponding variable of the query (?category). For instance \What is the place of birth of Angela Merkel? (a) Hamburg (b) Dresden (c) ..". The example query, however, delivers only a single candidate for wrong answers (usually we need three) and does not deliver additional information about the connectivity of the resulting nodes. We therefore did not use a SparQL engine, but a ( ltered) graph in main memory (as described in the previous section) to speed up the querying for multiple answers as well as the collection of additional connectivity information. However, the SparQL query language illustrates the idea quite well and demonstrates that other patterns may be employed easily. To avoid trivial questions, we accept questions only if the subject does not already contain the correct answer (reject \Who organizes the FIFA world cup?"). For the same reason candidate answers being equal to the subject are eliminated, too (`Germany shares border with Germany').

Selecting Appealing Wrong Answers. For any graph node n, let I(n) be the set of incoming edges (statement triples with n being the object) and O(n) the set of outgoing edges (statement triples with n being the subject). By i we denote the projection of a set of triples to their ith component (1=subject, 2=predicate, 3=object) and by val we denote the selection of triples that take a certain value as predicate. Suppose the statement (s; p; o) serves as starting point for a question and c is a (wrong) candidate answer. In order to reduce the amount of candidates being evaluated, a candidate anwer c is only taken into consideration if c shares at least one rdf:type with o. To identify good candidate answers, we SELECT ? subject ? p r e d i c a t e ? object ? candidate WHERE f ? subject ? p r e d i c a t e ? object . ? object rdf : type ? category . ? candidate rdf : type ? category .

NOT EXISTS f ? subject ? p r e d i c a t e ? candidate . g g category type subject predicate object type predicate candidate evaluate the similarity of c to the correct answer o and select the three candidates with highest similarity. We de ne the similarity between the correct answer o and candidate answer c as sim(o; c) := s(O(o); O(c)) with s(A; B) := Thus A = O(o) is the set of triples (statements) where the correct answer is the subject (e.g. (Hamburg,instance-of,Hanseatic City)) and likewise in B = O(c) the candidate answer is the subject. The rst term of s(A; B) identi es the fraction of properties shared with the correct answer (Hamburg and Dresden share, e.g., the population property), the second term considers the fraction of shared properties including their value (Hamburg and Dresden are located in time zone UTC+01:00), and the third term the fraction of shared types only (e.g. Hamburg and Dresden are both instance of Big City). The second term is most speci c (re ected by a higher coe cient), the rst and third are more general and become useful with less-connected nodes.

Di culty Ranking. Our idea for a di culty ranking is based on the assumption that the di culty of a question decreases the more familiar the contestant is with the question's components (s; p; o). To put this idea into practice, we introduce two di erent measures for nodes and one for properties. We measure the connectivity of a node n by C(n) = jO(n)j + 2jI(n)j, putting a higher emphasis on incoming edges, as they are a better indicator for a well known node within the graph. Secondly, we de ne a measure of homogeneity H(n) = j 2jO(O(n(n)j))j , which becomes larger the more outgoing edges share the same label. The rationale is that, say, athletes who won the same trophy several times receive higher attention in the news and are thus better known than those who have won a trophy only once. All nodes may now be ranked according to C(n) and H(n) and we intend to use the ranks r(C(n)) and r(H(n)) as ingredients for the nal di culty score. (The highest C(n) value receives a rank of r(C(n)) = 1, the lowest value a rank of N with N being the number of nodes.) However, the distributions are very skewed, the rst 40% of nodes share the same small connectivity value. As there will be millions of nodes that are barely known to the average contestant, we focus on the top % of ranks and assign a connectivity index CI(n) as follows: CI(n) = (and HI(n) accordingly). We combine both indices to a node di culty DI(n) =

CI(n) + HI(n). Finally, the popularity of a property p may be evaluated by its popularity index P I(p), which is simply the relative frequency of property p among all edges in the graph. The more frequently a property is used, the more likely the contestant is familiar with it. The di culty of a question that was derived from a triple (s; p; o) is then assessed by combining the di culties of both nodes s and o as well as the property p:

D( (s; p; o) ) = 2DI(s) + P I(p) + DI(o) 4 The subject receives a higher weight as it is most speci c for the question. Di erent choices for the parameters , and will be evaluated in section 5. Diversi ed Question Set. The algorithm generates a large number of questions by picking a random node as starting point and then creating a question as described above. The degree of di culty is evaluated and three lists of easy, medium and di cult questions are maintained. A nal set of n diverse questions is generated by picking n3 questions from each list, while taking care that a newly selected question uses a di erent property and subject. 5

Evaluation

Performance. The nal graph consisted of 11M nodes and 57M edges. Its serialization to disk occupied 1.2GB; restoring the graph from disk takes 100 seconds (Intel i7 8700k, 32 GB). The program may generate thousands of questions per minute, but the time to generate a single question depends on the number of candidate answers: for every candidate answer we apply the similarity measure, which is cumbersome for persons as there are 3.6M persons in the graph (calculations take up to 15s then). There are ways to concentrate quickly on the most relevant nodes and prune the search, but they have not yet been elaborated. Candidate Answers. Table 1 shows a few questions that were automatically generated by the system. For instance, question (c) asks for the title from the Lord of the Rings trilogy. If we were using the instance-of relationship only, any book title would do as a candidate answer. But the selected wrong answers consist of other book titles by the same author (Tolkien), although the author is not part of the question. Similarly, question (p) asks for the inventor of the special theory of relativity and the candidate answers include other well-known german physicists of that time.

Degree of Di culty. To test our proposals for the degree of di culty, we selected 18 automatically generated questions and asked 139 people (of which roughly 60% were computer science students) to assess the degree of di culty in three levels (1:easy, 2:medium, 3:hard). Fig. 1 shows an excerpt together with the difculty ranking obtained from averaging the responses. We experimented with parameter values 2 [1; 4], 2 [0; 3] and 2 [0:02; 0:2] (224 con gurations) and compared the rank (obtained from our di culty measure) via Spearman correlation. Regarding , the best results were obtained for values close to 0.1. For = 3; = 1 we obtained correlations of 0.5 (maximum) and 0.347 (on average). A closer inspection revealed that this score is mainly due to two questions (j,p), where especially the computer science students found the questions much cally assessed di culty rank (AR). Questions are ranked from 1 (easiest) to 18 (hardest) using

= 0:09 resulting in a correlation value of 0.48. easier than other participants. When excluding these two questions only, the Spearman correlation coe cient rises to 0.9 (resp. 0.72 on average). 6

Conclusions

We have presented an approach to automatically generate question sets for quizzes. The selected wrong answers mimic hand-crafted options very well. First results on evaluating the di culty score show promising results for many questions but also occassional misjudgements with the empirically assessed di culty. One has to keep in mind, however, that the whole approach assumes a correlation of the Wikidata content with the general knowledge of the contestants (which will not hold for an arbitrary audience). ponent for the research community. In Semantic Web Evaluation Challenge, pages

Bissig . Drawing questions from wikidata , 2015 . ETH Zurich. Bachelor's Thesis . 2.

Diefenbach ,

Singh ,

and P.

Maret. Wdaqua-core0: A question answering com-