<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Generating \Who Wants to Be a Millionaire?" Questions Sets Automatically from Wikidata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Markus Wohlan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yannik Schroder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Hoppner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ostfalia University of Applied Sciences Dept. of Computer Science</institution>
          ,
          <addr-line>D-38302 Wolfenbuttel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>57</volume>
      <issue>10</issue>
      <abstract>
        <p>Quiz shows and apps have enjoyed great popularity in recent years, which increases the demand for fresh question sets. We investigate how to derive such sets automatically from the Wikidata knowledge graph. Utilizing the graph, node connectivity and diversity, we propose measures to identify appealing wrong answers and rate the di culty of a question, which is a prerequisite to compose a full question set. First results align quite well with human perception of the questions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        While Wikidata has been used in various occasions to answer questions (e.g.
[6,2]), only little e ort is documented regarding the creation of questions. We
have found only two theses [
        <xref ref-type="bibr" rid="ref1">1,3</xref>
        ], which consider the construction of single
questions for a given topic. Compared to them, our proposal generates questions
faster, combines multiple questions to a diverse question set with varying degree
of di culty and puts a greater e ort on more plausible wrong answers.
The Wikidata Toolkit [4] was used to access the JSON dump (gzip 50GB). To
preserve a realistic chance of answering the generated questions we removed
items of certain types (such as named proteins, speci c events in a series,
scienti c articles, items without labels, etc.) as well as properties that were too
speci c (such as internal Wikipedia categories or global coordinates), as in [
        <xref ref-type="bibr" rid="ref1">1,3</xref>
        ].
The ltering process was a trade o between a small memory footprint
(eliminate as much unnecessary information as possible to keep the ltered graph
in main memory using a modi ed graph library [5]) and preserving the graph's
connectivity (node connections will be used for the evaluation of questions). We
nally settled at a hand-crafted selection of about 250 properties (out of approx.
1100 supported Wikidata item properties).
4
      </p>
    </sec>
    <sec id="sec-2">
      <title>Question Set Generation</title>
      <p>Question Template. The underlying idea is to generate questions based on
templates T = (q; p), where q is a query (e.g. a SparQL query) and p a phrase
with placeholders completed by the query result. The query in Fig. 1, may be
used together with the phrase \What is hpredicatei of hsubjecti? (a) hobjecti (b)
hcandidatei ...", where placeholders such as hcategoryi have to be replaced by
the corresponding variable of the query (?category). For instance \What is the
place of birth of Angela Merkel? (a) Hamburg (b) Dresden (c) ..". The example
query, however, delivers only a single candidate for wrong answers (usually we
need three) and does not deliver additional information about the connectivity
of the resulting nodes. We therefore did not use a SparQL engine, but a ( ltered)
graph in main memory (as described in the previous section) to speed up the
querying for multiple answers as well as the collection of additional connectivity
information. However, the SparQL query language illustrates the idea quite well
and demonstrates that other patterns may be employed easily. To avoid trivial
questions, we accept questions only if the subject does not already contain the
correct answer (reject \Who organizes the FIFA world cup?"). For the same
reason candidate answers being equal to the subject are eliminated, too (`Germany
shares border with Germany').</p>
      <p>Selecting Appealing Wrong Answers. For any graph node n, let I(n) be the set
of incoming edges (statement triples with n being the object) and O(n) the set
of outgoing edges (statement triples with n being the subject). By i we denote
the projection of a set of triples to their ith component (1=subject, 2=predicate,
3=object) and by val we denote the selection of triples that take a certain
value as predicate. Suppose the statement (s; p; o) serves as starting point for a
question and c is a (wrong) candidate answer. In order to reduce the amount of
candidates being evaluated, a candidate anwer c is only taken into consideration
if c shares at least one rdf:type with o. To identify good candidate answers, we
SELECT ? subject ? p r e d i c a t e ? object ? candidate WHERE f
? subject ? p r e d i c a t e ? object .
? object rdf : type ? category .
? candidate rdf : type ? category .</p>
      <p>NOT EXISTS f ? subject ? p r e d i c a t e ? candidate . g
g
category
type
subject predicate object type
predicate
candidate
evaluate the similarity of c to the correct answer o and select the three candidates
with highest similarity. We de ne the similarity between the correct answer o
and candidate answer c as sim(o; c) := s(O(o); O(c)) with s(A; B) :=
Thus A = O(o) is the set of triples (statements) where the correct answer is the
subject (e.g. (Hamburg,instance-of,Hanseatic City)) and likewise in B = O(c)
the candidate answer is the subject. The rst term of s(A; B) identi es the
fraction of properties shared with the correct answer (Hamburg and Dresden
share, e.g., the population property), the second term considers the fraction of
shared properties including their value (Hamburg and Dresden are located in
time zone UTC+01:00), and the third term the fraction of shared types only
(e.g. Hamburg and Dresden are both instance of Big City). The second term
is most speci c (re ected by a higher coe cient), the rst and third are more
general and become useful with less-connected nodes.</p>
      <p>Di culty Ranking. Our idea for a di culty ranking is based on the assumption
that the di culty of a question decreases the more familiar the contestant is with
the question's components (s; p; o). To put this idea into practice, we introduce
two di erent measures for nodes and one for properties. We measure the
connectivity of a node n by C(n) = jO(n)j + 2jI(n)j, putting a higher emphasis on
incoming edges, as they are a better indicator for a well known node within the
graph. Secondly, we de ne a measure of homogeneity H(n) = j 2jO(O(n(n)j))j , which
becomes larger the more outgoing edges share the same label. The rationale is
that, say, athletes who won the same trophy several times receive higher
attention in the news and are thus better known than those who have won a trophy
only once. All nodes may now be ranked according to C(n) and H(n) and we
intend to use the ranks r(C(n)) and r(H(n)) as ingredients for the nal di culty
score. (The highest C(n) value receives a rank of r(C(n)) = 1, the lowest value
a rank of N with N being the number of nodes.) However, the distributions are
very skewed, the rst 40% of nodes share the same small connectivity value. As
there will be millions of nodes that are barely known to the average contestant,
we focus on the top % of ranks and assign a connectivity index CI(n) as follows:
CI(n) =
(and HI(n) accordingly). We combine both indices to a node di culty DI(n) =</p>
      <p>CI(n) + HI(n). Finally, the popularity of a property p may be evaluated by
its popularity index P I(p), which is simply the relative frequency of property
p among all edges in the graph. The more frequently a property is used, the
more likely the contestant is familiar with it. The di culty of a question that
was derived from a triple (s; p; o) is then assessed by combining the di culties
of both nodes s and o as well as the property p:</p>
      <p>D( (s; p; o) ) =
2DI(s) + P I(p) + DI(o)
4
The subject receives a higher weight as it is most speci c for the question.
Di erent choices for the parameters , and will be evaluated in section 5.
Diversi ed Question Set. The algorithm generates a large number of questions
by picking a random node as starting point and then creating a question as
described above. The degree of di culty is evaluated and three lists of easy,
medium and di cult questions are maintained. A nal set of n diverse questions
is generated by picking n3 questions from each list, while taking care that a newly
selected question uses a di erent property and subject.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>Performance. The nal graph consisted of 11M nodes and 57M edges. Its
serialization to disk occupied 1.2GB; restoring the graph from disk takes 100 seconds
(Intel i7 8700k, 32 GB). The program may generate thousands of questions per
minute, but the time to generate a single question depends on the number of
candidate answers: for every candidate answer we apply the similarity measure,
which is cumbersome for persons as there are 3.6M persons in the graph
(calculations take up to 15s then). There are ways to concentrate quickly on the most
relevant nodes and prune the search, but they have not yet been elaborated.
Candidate Answers. Table 1 shows a few questions that were automatically
generated by the system. For instance, question (c) asks for the title from the
Lord of the Rings trilogy. If we were using the instance-of relationship only,
any book title would do as a candidate answer. But the selected wrong answers
consist of other book titles by the same author (Tolkien), although the author
is not part of the question. Similarly, question (p) asks for the inventor of the
special theory of relativity and the candidate answers include other well-known
german physicists of that time.</p>
      <p>Degree of Di culty. To test our proposals for the degree of di culty, we selected
18 automatically generated questions and asked 139 people (of which roughly
60% were computer science students) to assess the degree of di culty in three
levels (1:easy, 2:medium, 3:hard). Fig. 1 shows an excerpt together with the
difculty ranking obtained from averaging the responses. We experimented with
parameter values 2 [1; 4], 2 [0; 3] and 2 [0:02; 0:2] (224 con gurations)
and compared the rank (obtained from our di culty measure) via Spearman
correlation. Regarding , the best results were obtained for values close to 0.1.
For = 3; = 1 we obtained correlations of 0.5 (maximum) and 0.347 (on
average). A closer inspection revealed that this score is mainly due to two questions
(j,p), where especially the computer science students found the questions much
cally assessed di culty rank (AR). Questions are ranked from 1 (easiest) to 18 (hardest)
using</p>
      <p>= 0:09 resulting in a correlation value of 0.48.
easier than other participants. When excluding these two questions only, the
Spearman correlation coe cient rises to 0.9 (resp. 0.72 on average).
6</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We have presented an approach to automatically generate question sets for
quizzes. The selected wrong answers mimic hand-crafted options very well. First
results on evaluating the di culty score show promising results for many
questions but also occassional misjudgements with the empirically assessed di culty.
One has to keep in mind, however, that the whole approach assumes a
correlation of the Wikidata content with the general knowledge of the contestants
(which will not hold for an arbitrary audience).
ponent for the research community. In Semantic Web Evaluation Challenge, pages</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>F.</given-names>
            <surname>Bissig</surname>
          </string-name>
          .
          <source>Drawing questions from wikidata</source>
          ,
          <year>2015</year>
          . ETH Zurich.
          <source>Bachelor's Thesis</source>
          . 2.
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and P.</given-names>
            <surname>Maret.</surname>
          </string-name>
          Wdaqua-core0:
          <article-title>A question answering com-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>