=Paper=
{{Paper
|id=Vol-3192/paper02
|storemode=property
|title=Automatic Question Generation for Evidence-based Online Courseware Engineering
|pdfUrl=https://ceur-ws.org/Vol-3192/itb22_p2_short1338.pdf
|volume=Vol-3192
|authors=Machi Shimmei,Noboru Matsuda
|dblpUrl=https://dblp.org/rec/conf/aied/ShimmeiM22
}}
==Automatic Question Generation for Evidence-based Online Courseware Engineering==
Automatic Question Generation for Evidence-based
Online Courseware Engineering*
Machi Shimmei1[0000-0002-4790-8863] and Noboru Matsuda1[0000-0003-2344-1485]
1
North Carolina State University, Raleigh NC USA
@ncsu.edu
Abstract. The goal of the current study is to develop an algorithm for generating
pedagogically valuable questions. We focus on verbatim questions whose an-
swer, by definition, can be literally identified in a source text. We assume that an
important keyphrase relative to a specific learning objective can be identified in
a given source text. We then further hypothesize that a pedagogically valuable
verbatim question can be generated by converting the source text into a question
for which the keyphrase becomes an answer. We therefore propose a model that
identifies a keyphrase in a given source text with a linked learning objective. The
tagged source text is then converted into a question using an existing model of
question generation, QG-Net. An evaluation study was conducted with existing
authentic online course materials. Corresponding course instructors judged 66%
of the predicted keyphrases were suitable for the given learning objective. The
results also showed that 82% of the questions generated by pre-trained QG-Net
were judged as pedagogically valuable.
Keywords: Question Generation, Deep Neural Network, Natural Language
Processing, Learning Engineering, MOOC.
1 Introduction
Questions plays important roles on learning and teaching. On Massive Open Online
Courses (MOOC), formative questions are essential component to make the courseware
effective. A research demonstrated, for example, that students learn better when they
practice skills by answering questions than by only watching videos or reading text [1].
In a broader context, the benefit of answering questions for learning has been shown in
many studies, aka test-enhanced learning [2, 3]. However, creating questions that ef-
fectively help students’ learning requires experience and extensive efforts.
Although there are several studies on the automation of question generation in the
field of AI in education [4, 5], little has been discussed about the pedagogical value of
the questions generated. To fill this gap, we propose a method for generating questions
that supposedly ask about the key concepts the students need to learn to attain the learn-
ing objectives. As far as the authors are aware, there have been little study conducted
*
Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
2
to generate questions that align with the learning objectives. We propose to develop a
technique called QUADL (QUestion generation with an Application of Deep Learning)
that generates verbatim questions from a pair of a learning objective and a sentence.
The verbatim question is a question for which an answer can be literally identified in a
related instructional text (i.e., source text).
Our central hypothesis is that pedagogically valuable verbatim questions can be gen-
erated if source texts are tagged with keyphrases relative to a given learning objective.
Once a source text is tagged, then existing seq2seq technologies for question conversion
can be used (e.g., [6-8]). The technological contribution of the current research is there-
fore to develop a deep neural-network model to identify a keyphrase given a pair of a
source text and a learning objective.
Accordingly, QUADL consists of the Answer Prediction model and the Question
Conversion model. The Answer Prediction model identifies a keyphrase in a given
source text. The Question Conversion model generates a question by converting the
source text into a question for which the keyphrase becomes the answer.
2 Related Work
The research on the automatic question generation has been growing rapidly among the
AIED researchers. Most of the early studies of question generation adapted the rule-
based models that relied on templates constructed by experts [9-11]. The scalability is,
however, a concern for the rule-based models. They often do not work for complex
sentences. The linguistic diversity in resulted questions is therefore limited.
More recent works on question generation take a data-driven approach using neural
networks. Many variants of RNN-based models have been proposed and showed con-
siderable advances in the question generation task [12-17]. For general-purpose ques-
tion generation, large datasets collected from articles in Wikipedia or news media, such
as SQuAD [18], NewsQA [19], and MSMARCO [20], enabled to build neural-network
based models. Wang et al. [21] demonstrated that an LSTM-based model, called QG-
Net, trained on a general question generation dataset (SQuAD) can be used for gener-
ating questions on educational contents. Questions were generated from textbooks on
Biology, Sociology and History for evaluation and showed the highest BLEU score
among the state-of-the-art techniques. Yet, the pedagogical value of the generated ques-
tions has not been reported.
Techniques for keyphrase extraction has been studied to suggest an answer candidate
from a given paragraph text (e.g., [22]). Since our model aims to select target tokens
that are aligned with a given learning objective, our proposed Answer Prediction model
is essentially different from those existing keyphrase extraction models.
3 Methods
Figure 1 shows an overview of QUADL. Given a pair of a learning objective LO and a
source text S, , QUADL generates a question Q that will be suitable to achieve
the learning objective LO. The question Q is a verbatim question whose answer can be
3
literally found in source text S. The following is an example of and Q:
S + Target Token
Sentence (S) Answer Question Verbatim
Learning Objective Prediction S + Target Token <0,0> Conversion
Question(Q)
(LO) not suitable for LO
Figure 1 : The QUADL model
Learning objective (LO): Describe metabolic pathways as stepwise chemical transfor-
mations either requiring or releasing energy; and recognize conserved themes in these
pathways.
Source Text (S): Among the main pathways of the cell are photosynthesis and cellular
respiration, although there are a variety of alternative pathways such as fermentation.
Question (Q): Along with photosynthesis, what are the main pathways of the cell?
Answer: cellular respiration
Notice that the answer is tagged in source text S (underlined in the example above). We
call the tagged answer in the given source text S a target token hereafter. The target
token might contain multiple words as shown in the example above.
The Answer Prediction model identifies the target token index, , where Is and
Ie show the index of the start and end of a target token within a given source text S
relative to the learning objective LO. For the Answer Prediction model, we adopted
BERT, Bidirectional Encoder Representation from Transformers [23]. In our applica-
tion, the learning objective (LO) and the source text (S) were combined as a single input
to the model. The vector representation computed by the BERT model is given
to two different classification models: the one for predicting the start index (Is) and the
other is for the end index (Ie) of the target token. The models may output ,
indicating that the given source text is not suitable to generate a question for the given
learning objective. For the rest of the paper, we call source texts that have non-zero
indices (i.e., Is ≠ 0 and Ie ≠ 0) the target source texts, whereas others are referred to as
the non-target source texts (i.e., has the zero token index <0, 0>). The Answer Predic-
tion model was trained using training data that we created from existing online courses
at Open Learning Initiative† (OLI).
The Question Conversion model generates a question for which the target token be-
comes the answer, given a source text with the non-zero target token index. We use
QG-Net, a bidirectional-LSTM seq2seq model with attention and copy mechanisms
[21]. We used an existing, pre-trained QG-Net model that was trained on the SQuAD
datasets‡. We could train QG-Net using the OLI course data mentioned above. How-
ever, the OLI courses we used for the current study do not contain a sufficient number
of verbatim questions—many of the questions are fill-in-the-blank and multiple-choice
questions hence not suitable to generate training data for QG-Net.
†
https://oli.cmu.edu
‡
https://rajpurkar.github.io/SQuAD-explorer/
4
Table 1: Examples of triplet , Q> used in the AMT survey. The target
token is bold and underlined in S.
(a) A participant judged a target token was suitable, but the question was not suitable.
LO: Explain how the cellular organization of fused skeletal muscle cells allows
muscle tissue to contract properly.
S: Myofibrils are connected to each other by intermediate, or desmin, filaments
that attach to the Z disc.
Q: What is connected to each other?
(b) A participant judged both a target token and a question were suitable.
LO: Identify and discuss the functions of the large intestine and its structures.
S: The first part of the large intestine is the cecum, a small sac-like region that
is suspended inferior to the ileocecal valve.
Q: What is the first part of the large intestine?
4 Evaluation Study
We investigated the following research questions: RQ1: How well does the Answer
Prediction model identify target tokens (including zero token indices) in a given source
text relative to a given learning objective? RQ2: How well does the pre-trained QG-
Net generate questions for a given source text tagged with the target tokens?
To answer these research questions, we conducted a survey on Amazon Mechanical
Turk (AMT). In AMT, the participants were shown triplets , Q>. For
each of the triplets, the participants were asked if they agreed or disagreed with the
following two statements: (1) To create a question that helps attain the learning objec-
tive LO, it is adequate to convert the sentence S into a question whose answer is the
token highlighted. (2) The question Q is suitable for attaining the learning ob-
jectives LO. Each statement corresponds to each research question. The examples of
triples are shown in Table 1.
Majority votes are used to consolidate the evaluation from participants. Table 2 sum-
marizes the results for RQ1. The data showed that 49% (166/342) of the total predic-
tions about the target token index from the Answer Prediction model were accepted by
the participants. For the predictions with a non-zero target index, 88% (155/178) of the
predictions were accepted including tie. As for the non-target source text predictions
(i.e., the Answer Prediction model output the zero <0,0> index), only 41% (68/164)
were accepted. The participants considered 55% (90/164) of the predicted non-target
source texts to be target source texts. These results show that the Answer Prediction
model is rather conservative. When it outputs “positive” predictions (i.e., treating a
given source text as a target source text), 70% of such predictions are appropriate.
However, there is a large number of source texts that should have been predicted as a
target source text but missed. We argue that for the educational purposes, these results
are accepted and pragmatic.
Table 3-a shows the results for the RQ2. The table shows that participants considered
5
Table 2. The evaluation of the predicted target tokens by the Answer Prediction
model. There were 178 source texts that the Answer Prediction model predicted
target tokens and 164 source texts that the model predicted non-target. The table
shows how many of them were accepted/not accepted by the majority vote by Am-
azon Mechanical Turk participants.
non-zero target index zero-index Total
Accepted 123 (70%) 43 (26%) 166 (49%)
Tie 32 (18%) 25(15%) 57 (17%)
Not accepted 22 (12%) 90 (55%) 112 (33%)
Nonsensical 1 6 (4%) 7 (2%)
Total 178 (100%) 164 (100%) 342 (100%)
Table 3. The evaluation of the questions generated by QG-Net. The table shows the
acceptance of these questions with the majority votes done by the Amazon Mechan-
ical Turk participants. (a) All source texts that the Answer Prediction model pre-
dicted a non-zero target token index (N=178) were converted into questions. (b)
Only the target source texts accepted by participants in Table 2 (N=123) were con-
verted into questions.
(a) (b)
Number of questions Number of questions
Accepted 80 (45%) Accepted 76 (62%)
Tie 50 (28%) Tie 24 (20%)
Not accepted 43 (24%) Not accepted 21 (17 %)
Nonsensical 5 (3%) Nonsensical 2 (2%)
Total 178 (100%) Total 123 (100%)
that 73% (130/178) of the questions generated by QG-Net were appropriate for achiev-
ing the associated learning objective.
Notice that the result shown above is influenced by the performance of Answer Pre-
diction model. To investigate the capability of QG-Net separately from the performance
of the Answer Prediction model, we analyzed the performance of QG-Net given only
the “appropriate” inputs (according to the survey participants). Table 3-b shows the
evaluation of the questions when QG-Net was given only those source texts that the
Answer Prediction model output an “appropriate” target token index according to the
survey participants. There, 123 source texts satisfied this condition, which means that
82% (100/123) of questions generated from “appropriate” source texts were considered
to be suitable for achieving the associated learning objective. This indicates that the
pre-trained QG-Net can generate a fair number of suitable questions for domains other
than the one it was originally trained. Using QG-Net as a building block for QUADL
is therefore an acceptable design option.
6
5 Conclusion
We proposed QUADL for generating questions that are aligned with the given learning
objective. As far as we are aware, there have been no studies that aim to generate ques-
tions that are suitable for attaining the learning objectives. The current study showed
that when the Answer Prediction model output a non-zero index for the target token,
88% of such predictions were accepted as good predictions by the study participants.
Though we admit that the performance should be improved, this is an encouraging re-
sult showing the potential of the proposed model. The data also showed that the major-
ity of the participants believed that 55% of the target source texts that the Answer Pre-
diction model identified as not being useful for the learning objective were actually
useful for creating questions. Lowering the amount of “false negative” predictions is
certainly a crucial next step.
One of the challenges of the current study was a cost for creating the training data.
To train the Answer Prediction model, each target source text paired with a learning
objective has to be annotated to indicate the target token. For the current study, we used
the existing courseware contents taken from OLI. When the training data were created,
target source texts were tagged using answers (extracted from assessment questions) by
exact match—i.e., a non-zero token index was assigned only when the target answer
appeared literally in the source text. Those source texts that included only a part of the
answer or contained synonymous words that were equally plausible as the original an-
swer were not tagged with appropriate token indices. The current study utilized a sur-
vey on Amazon Mechanical Turk. Evaluating the effectiveness of generated questions
with real students in an authentic context is an important next step to be conducted.
Acknowledgement
The research reported here was supported by the National Science Foundation Grant
No. 2016966 to North Carolina State University.
References
1. Koedinger, K.R., et al. Learning is not a spectator sport: Doing is better than
watching for learning from a MOOC. in Proceedings of the second (2015) ACM
conference on learning@ scale. 2015.
2. Rivers, M.L., Metacognition about practice testing: A review of learners’ beliefs,
monitoring, and control of test-enhanced learning. Educational Psychology
Review, 2020.
3. Pan, S.C. and T.C. Rickard, Transfer of test-enhanced learning: Meta-analytic
review and synthesis. Psychological Bulletin, 2018. 144(7): p. 710-756.
4. Kurdi, G., et al., A Systematic Review of Automatic Question Generation for
Educational Purposes. International Journal of Artificial Intelligence in Education,
2020. 30(1): p. 121-204.
5. Pan, L., et al., Recent advances in neural question generation. arXiv preprint
arXiv:1905.08949, 2019.
7
6. Kim, Y., et al. Improving neural question generation using answer separation. in
Proceedings of the AAAI Conference on Artificial Intelligence. 2019.
7. Nema, P., et al., Let's Ask Again: Refine Network for Automatic Question
Generation. arXiv preprint arXiv:1909.05355, 2019.
8. Yuan, X., et al., Machine comprehension by text-to-text neural question generation.
arXiv preprint arXiv:1705.02012, 2017.
9. Mazidi, K. and P. Tarau, Automatic question generation: from NLU to NLG.
International Conference on Intelligent Tutoring Systems, 2016: p. pp.23-33.
10. Mitkov, R. Computer-aided generation of multiple-choice tests. in Proceedings of
the HLT-NAACL 03 workshop on Building educational applications using natural
language processing. 2003.
11. Heilman, M. and N.A. Smith, Question generation via overgenerating
transformations and ranking. 2009, Carnegie-Mellon Univ Pittsburgh pa language
technologies insT.
12. Zhao, Y., et al., Paragraph-level neural question generation with maxout pointer
and gated self-attention networks, in Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, E. Blanco and W. Lu, Editors.
2018. p. 3901-3910.
13. Wang, S., et al., A multi-agent communication framework for question-worthy
phrase extraction and question generation, in Proceedings of the AAAI Conference
on Artificial Intelligence, P. Stone, P.V. Hentenryck, and Z.-H. Zhou, Editors. 2019.
p. 7168-7175.
14. Song, L., et al., Leveraging context information for natural question generation, in
Proceedings of the Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, F. Liu and T.
Solorio, Editors. 2018. p. 569-574.
15. Ma, X., et al., Improving question generation with sentence-level semantic matching
and answer position inferring, in Proceedings of the AAAI Conference on Artificial
Intelligence, F. Rossi, V. Conitzer, and F. Sha, Editors. 2020. p. 8464-8471.
16. Kim, Y., et al., Improving neural question generation using answer separation, in
Proceedings of the AAAI Conference on Artificial Intelligence, P. Stone, P.V.
Hentenryck, and Z.-H. Zhou, Editors. 2019. p. 6602-6609.
17. Tang, D., et al., Learning to collaborate for question answering and asking, in
Proceedings of the Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, T.S. Fei Liu,
Editor. 2018. p. 1564-1574.
18. Rajpurkar, P., R. Jia, and P. Liang, Know what you don't know: Unanswerable
questions for SQuAD. arXiv preprint arXiv:1806.03822, 2018.
19. Trischler, A., et al., Newsqa: A machine comprehension dataset. arXiv preprint
arXiv:1611.09830, 2016.
20. Bajaj, P., et al., Ms marco: A human generated machine reading comprehension
dataset. arXiv preprint arXiv:1611.09268, 2016.
21. Wang, Z., et al. QG-net: a data-driven question generation model for educational
content. in Proceedings of the Fifth Annual ACM Conference on Learning at Scale.
2018.
8
22. Willis, A., et al. Key phrase extraction for generating educational question-answer
pairs. in Proceedings of the Sixth (2019) ACM Conference on Learning@ Scale.
2019.
23. Devlin, J., et al., Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805, 2018.