Sentence Selection for Cloze Item Creation: A
                 Standardized Task and Preliminary Results

                                                             Andrew M. Olney
                                                          University of Memphis
                                                      365 Innovation Drive, Suite 303
                                                       Memphis, Tennessee 38152
                                                        aolney@memphis.edu

ABSTRACT                                                                  9, 22], meaning that sentences in the text are selected for
Cloze items are commonly used for both assessing learning                 cloze items depending on the presence of relevant keywords.
and as a learning activity. This paper investigates the selec-            These keywords are then deleted to make cloze items. Sim-
tion of sentences for cloze item creation by comparing meth-              ilar to text-insensitive methods, a keyword-first approach
ods ranging from simple heuristics to deep learning summa-                emphasizes local properties of the text and so aligns with
rization models. An evaluation using human-generated cloze                common language-learning concerns like grammar and vo-
items from three different science texts indicates that simple            cabulary, while allowing for more control over content. In
heuristics substantially outperform summarization models,                 contrast, research on cloze items for text comprehension
including state-of-the-art deep learning models. These re-                tends to be sentence-first [1, 15, 19], meaning that impor-
sults suggest that sentence selection for cloze item genera-              tant sentences in the text are selected first, followed by pro-
tion should be considered a distinct task from summariza-                 cedures for deleting words to make cloze items. A common
tion and that continued advances on this task will require                approach to selecting important sentences for cloze items is
large datasets of human-generated cloze items.                            to use extractive summarization techniques [1, 15]. Extrac-
                                                                          tive summarization systems attempt to create a coherent
Keywords                                                                  summary of a text by filtering out unimportant sentences in
cloze item, assessment, learning, extractive summarization                a text (conversely selecting important sentences) [18] and so
                                                                          intuitively appear relevant for this task. Because sentence-
                                                                          first approaches focus on the non-local properties of the text,
1.    INTRODUCTION                                                        they are aligned with text comprehension concerns.
Cloze items, also known as fill-in-the-blank questions, are
common in educational practice, with applications both for                Research on automated cloze item creation has predomi-
assessing learning and for promoting learning [16]. Because               nantly been theory-driven rather than data-driven, likely
cloze items may be created directly from text simply by                   because large datasets of human-created cloze items have
deleting a word or phrase, automated methods for creating                 not been available until recently and only then for language-
cloze items have been considered since their inception. In-               learning goals [26]. Given the absence of data with which to
deed, the work widely viewed as introducing the cloze item                train and evaluate models, researchers have used rule-based
also proposed creating them by randomly deleting words or                 and statistical techniques that are fundamentally heuristic,
deleting every nth word [24], and these methods became a                  and they have evaluated their systems largely using rubric-
common practice in the following decades [2]. For learn-                  based human evaluation of the cloze items created, rather
ing applications, however, such text-insensitive automated                than by comparing them to human-generated cloze items.
methods offer no control over content, and for assessment                 One notable exception is Olney et al. [19], who compare
applications, research suggests that text-insensitive meth-               their method with human-generated items and randomly
ods are better aligned with local properties of the text (e.g.            generated items on learning outcomes. However, that work
grammar and vocabulary) than with non-local properties                    does not present a detailed comparison of automatic- and
associated with text comprehension [2, 3, 4].                             human-generated cloze items.
Advances in natural language processing (NLP) since 1990                  Research on automated cloze item creation could benefit
have enabled text-sensitive approaches to cloze item creation             from adopting common practices in other areas of NLP, such
for both learning and assessment applications. Research in                as common datasets, standard evaluation metrics, and the
this area has broadly organized around two different goals,               comparisons these allow with previous work. To this end,
creating cloze items for language learning (native or foreign             the present paper proposes sentence selection as a standard-
language) and for text comprehension (i.e., learning from                 ized task associated with cloze item creation. The sentence
text). These two goals have led to different approaches                   selection task is ideal for standard evaluation metrics be-
for creating text-sensitive cloze items. Research on cloze                cause automated selections can be directly compared to hu-
items for language learning tends to be keyword-first [5, 8,              man selections. The remainder of this paper compares mul-
                                                                          tiple existing methods and their performance on the sentence
Copyright ©2021 for this paper by its authors. Use permitted under Cre-
ative Commons License Attribution 4.0 International (CC BY 4.0)
                                                                          selection task, including Olney et al. [19], a recent updated
                                                                          version of that model [20] with several variants, and three
extractive summarizers.
                                                                                   Table 1: Text characteristics
                                                                      Text            FK Grade Words Sents              Selected
2. SENTENCE SELECTION MODELS                                          Circulatory         6.2        987       73          21
2.1 Olney et al. (2017)                                               Nitrogen cycle      8.2        976       94          26
Olney et al. [19] used a coreference resolution system [12] for       Photosynthesis      8.2        977       75          24
selecting sentences. A coreference chain is a sequence of re-
peated mentions of the same entity across a text. A common
example of a coreference chain is between a noun and corre-         n such sentences, skipping sentences that are too similar to
sponding pronouns (e.g., “Jill” and “her”), but mentions can        already included sentences.
be less obviously connected (e.g., “Queen of England” and
“Elizabeth”). Intuitively, a long chain represents an entity        2.4    SMRZR summarizer
that is important to the discourse, and a sentence containing       The SMRZR summarizer focuses on summarizing lectures
multiple such chains is important because it involves multi-        using deep learning, is open source, and is freely available at
ple such entities. Olney et al. operationalized this intuition      https://smrzr.io/ [13]. The summarizer uses BERT [6] to
with the heuristic that important sentences should contain          project each sentence in the document to an sxwxe matrix,
at least three coreference chains (i.e., should contain men-        where s is the number of requested summary sentences, w is
tions in these chains) and that the chains themselves should        the words, and e is the embedding dimension. This matrix
have a length of at least two mentions. These sentences             is then reduced to an sxe matrix by averaging over words,
were then filtered using criteria from a discourse parser [23],     and each of the s sentence vectors in this reduced matrix
specifically nuclearity of elementary discourse units [11]. Un-     is submitted to K-means clustering using k = n, the num-
der the theory implemented by the parser, clauses that carry        ber of requested sentences. The sentences returned by the
little or no meaning are called satellites and are contrasted       summarizer are those closest to the centroid of each of the
with nuclei that carry substantial meaning. Thus, selected          clusters. SMRZR was not trained on a corpus but rather
sentences were deselected if they consisted of only satellite       used a pre-trained BERT model. The layer from which the
discourse units. This two-step heuristic was developed by           sxwxe matrix is extracted was manually selected based on
inspecting a single text on the circulatory system and se-          experimentation with a small set of test cases.
lecting criteria such that the number of selected sentences
exactly matched the number of human-selected cloze sen-             2.5    BERTSumExt summarizer
tences; the sentences themselves were not observed in the           The BERTSumExt summarizer is a document-level BERT
development of the heuristic. In later unpublished work,            encoder that stacks inter-sentence Transformer [25] layers
the above method was extended by ranking the sentences              on top of BERT and is open source and freely available
on the above criteria as well as the summed length of all           [10]. In this BERT variant, input sentences are separated
coreference chains in a sentence. This extension makes it           by [cls] tokens to learn sentence representations encoded in
straightforward to return the top n sentences that meet the         corresponding token vectors at the output layer. These sen-
original two-step heuristic criteria while also relaxing these      tence representation vectors are then input to inter-sentence
criteria when more sentences are requested.                         Transformer layers with position embeddings to capture sen-
                                                                    tence position, and these lead to a sigmoid classifier output
2.2    Pavlik et al. (2020)                                         layer that indicates the importance of the sentence. The
Pavlik et al. [20] describe a reimplementation of Olney et          top n such sentences can be returned to create an extractive
al. [19]. The reimplementation differs in several respects, in-     summary. Unlike SMRZR and MEAD, BERTSumExt is di-
cluding using a new coreference system based on deep learn-         rectly trained on news corpora. BERTSumExt was state
ing [7] and doing away with the discourse parser constraint         of the art on extractive summarization for the CNN/Daily
of nuclearity. It preserves the first step of the heuristic, pri-   Mail dataset [14] and was only recently surpassed by a sys-
oritizing sentences having at least three coreferences chains       tem with less than a 1 point improvement in recall [27].
of at least length two, and similarly ranks sentences using
that criteria as well as the summed length of all coreference       3. EVALUATION
chains in a sentence. No comparison with Olney et al. [19]          3.1 Procedure
was reported.                                                       Evaluation data were obtained by asking expert judges to
                                                                    create cloze items for three texts on science topics, includ-
2.3    MEAD summarizer                                              ing the circulatory system, the nitrogen cycle, and photosyn-
The MEAD summarizer [21] is a widely-used, publicly avail-          thesis. The text and cloze items for the circulatory system
able summarizer applicable to multiple documents and mul-           were taken from Olney et al. [19]. The other texts were
tiple languages. Although MEAD has an orientation to ex-            created by a graduate student blind to the purpose of the
tractive summarization of multiple documents on the same            study to match the length and difficulty of the circulatory
topic (e.g., a news story), it can also be used to summarize a      system text. As shown in Table 1, texts matched closely in
single document. MEAD uses a variety of features to select          number of words but somewhat less so in terms of difficulty,
sentences for summarization, including sentence length, po-         with both nitrogen cycle and photosynthesis texts being ap-
sition in the document, cosine with other sentences, keyword        proximately two Flesch-Kincade grades level units higher in
match, and LexPageRank, a measure of sentence centrality            difficulty than the circulatory system text.
with respect to words in the document. By default, MEAD
uses a linear combination of these features to identify impor-      Cloze items for the circulatory text were created by a grad-
tant sentences and can be used to return the specified top          uate student who operationalized the task as selecting sen-
                                                                    to improve as their simplicity increases. The most sophisti-
           Table 2: Recall of Sentence Selection                    cated model, BERTSumExt, which is near state of the art
 Model            Circ. Sys. Nit. Cyc. Photosyn.            M
                                                                    on extractive summarization, performs below chance on 2/3
 Olney et al.        .57          .19           .33         .37
                                                                    of the texts as well as below chance on average. SMRZR, an-
 Pavlik et al.       .57          .35          .46          .46
                                                                    other deep learning model, is similarly below chance on 1/3
 MEAD                .29          .42           .33         .35
                                                                    of the texts and only 1% above chance on average. MEAD,
 SMRZR               .33          .19           .38         .30
                                                                    the simplest and oldest model, is approximately at chance
 BERTSumExt          .10          .27           .38         .25
                                                                    on 2/3 texts, though its average score is elevated by its top
 Random              .29          .28           .32         .29     performance on the nitrogen cycle text. Overall, these re-
 Two chains          .48          .27           .38         .37     sults suggest that the intuition that summarization models
 # chains            .52          .27           .38         .39     are suitable for the sentence selection task of cloze item cre-
 No restriction      .29          .35           .42         .35     ation is incorrect. Indeed it appears that models trained
                                                                    on newswire text, like BERTSumExt, may be particularly
                                                                    poorly suited for this task.
tences conveying the main ideas. Cloze items for the other
two texts were created by a high school biology teacher who         Finally, the variant results indicate that the current heuris-
was blind to the purpose of the study. Both human judges            tics used by Pavlik et al. are not overfitted to the original
selected similar numbers of sentences across texts.                 circulatory system text. No variant achieves a higher score
                                                                    on any single text or overall. However, the variant results
Each of the three texts was input into the models described         suggest that heuristics involving the number of chains in a
in Section 2 along with the parameter n, the number of sen-         sentence are particularly significant for improving the score
tences selected by a human judge for that text. The primary         of the circulatory system text.
evaluation metric was the number of sentences returned that
were selected by human judges (i.e. overlap), divided by n.         4.   DISCUSSION
This metric is equivalent to recall for extractive summa-
                                                                    We have proposed sentence selection as a standardized task
rization, which some have argued is more appropriate than
                                                                    associated with automated cloze item creation. Unlike pre-
precision given the variability in human sentence selection
                                                                    vious work that has used rubrics to evaluate cloze items,
[17].
                                                                    sentence selection allows automated selections to be directly
                                                                    compared to human selections using standard evaluation
Additionally, we evaluated several variants of the Pavlik et
                                                                    metrics like recall. Because our results show that simple
al. model that varied according to the primary heuristic of
                                                                    heuristics outperform extractive summarization models, in-
having at least three coreferences chains of at least length
                                                                    cluding a state of the art deep learning model, we argue that
two. The variants included having at least two corefer-
                                                                    sentence selection for cloze item generation should be consid-
ences chains of at least length two, replacing this restriction
                                                                    ered a distinct task from extractive summarization, partic-
by ranking by the total number of chains in the sentence,
                                                                    ularly extractive summarization in the context of newswire
and removing this restriction entirely. Each variant ranks
                                                                    text, where it has historically focused. Previous researchers
the sentences, post-constraint, by the summed length of all
                                                                    have raised concerns with the type of direct evaluation we
coreference chains in a sentence, just as the original.
                                                                    propose, based in part on the variability of sentences human
                                                                    judges will select for extraction [17]. We believe that these
3.2    Results                                                      concerns are more valid for newswire text as opposed to
Results are presented in Table 2, which shows the best model        academic text, which by definition is designed for learning.
recall score per text in bold font, with the final column show-     While experts may not agree on what parts of a current
ing the average recall across texts. The initial rows of Table 2    news story are most important in a summary, we suspect
correspond to the models in Section 2, followed by a random         that experts on photosynthesis generally agree on key ideas,
baseline (i.e., random selection of n sentences), followed by       and thus key sentences in a text. However, we have not pre-
the variants of the Pavlik et al. model.                            sented evidence confirming this suspicion in this paper, nor
                                                                    are we aware of research that has investigated this question.
The best performing model is Pavlik et al. [20], which has          This suggests a new direction in automated cloze item cre-
the best average score as well as the top score (or tied) for ev-   ation: the creation of large datasets of cloze items on diverse
ery text with the exception of the nitrogen cycle, for which        texts, where each text has been annotated by a large enough
MEAD achieves the highest score. The increased perfor-              sample of human judges that we can estimate human agree-
mance of Pavlik et al. model relative to the original Olney         ment reliably enough to calculate whether an automated
et al. [19] suggests that the discourse parser constraint of        method agrees as much (or more) with humans as humans
nuclearity is not contributing heavily to performance and           do with each other. Without common datasets, standard
that these contributions are easily overwhelmed by using a          evaluation metrics, and the comparisons these allow with
higher-performing coreference resolution system. However,           previous work, we fear that researchers will continue to cre-
it is notable that although the systems achieve the same            ate novel systems and evaluate them in isolation, which will
score on the circulatory system, they do not make identical         ultimately contribute little to progress on automated cloze
predictions: 25% of the correct predictions differ between          item creation.
the two models.
                                                                    5.   ACKNOWLEDGMENTS
It is remarkable both how badly the summarization models            This material is based upon work supported by the National
perform on this task as well as how their performance seems         Science Foundation under Grants 1918751 and 1934745 by
the Institute of Education Sciences under Grant R305A190448.        natural language processing toolkit. In Proceedings of
Any opinions, findings, and conclusions or recommendations          52nd Annual Meeting of the Association for
expressed in this material are those of the author(s) and           Computational Linguistics: System Demonstrations,
do not necessarily reflect the views of the National Science        pages 55–60, Baltimore, Maryland, June 2014.
Foundation or the Institute of Education Sciences.                  Association for Computational Linguistics.
                                                               [13] D. Miller. Leveraging BERT for extractive text
6.   REFERENCES                                                     summarization on lectures. CoRR, abs/1906.04165,
 [1] M. Agarwal and P. Mannem. Automatic gap-fill                   2019.
     question generation from text books. In Proceedings of    [14] R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre,
     the Sixth Workshop on Innovative Use of NLP for                and B. Xiang. Abstractive text summarization using
     Building Educational Applications, pages 56–64,                sequence-to-sequence RNNs and beyond. In
     Portland, Oregon, June 2011. Association for                   Proceedings of The 20th SIGNLL Conference on
     Computational Linguistics.                                     Computational Natural Language Learning, pages
 [2] J. C. Alderson. The cloze procedure and proficiency in         280–290. Association for Computational Linguistics,
     english as a foreign language. TESOL Quarterly,                2016.
     13(2):219–227, 1979.                                      [15] A. Narendra, M. Agarwal, and R. Shah. Automatic
 [3] L. F. Bachman. The trait structure of cloze test               cloze-questions generation. In Proceedings of the
     scores. TESOL Quarterly, 16(1):61–70, 1982.                    International Conference Recent Advances in Natural
 [4] L. F. Bachman. Performance on cloze tests with                 Language Processing RANLP 2013, pages 511–515,
     fixed-ratio and rational deletions. TESOL Quarterly,           Hissar, Bulgaria, Sept. 2013.
     19(3):535–556, 1985.                                      [16] National Institute of Child Health and Human
 [5] D. Coniam. From text to test, automatically - an               Development. Report of the National Reading Panel.
     evaluation of a computer cloze-test generator. Hong            Teaching children to read: An evidence-based
     Kong Journal of Applied Linguistics, 3(1):41–60, 1998.         assessment of the scientific research literature on
                                                                    reading and its implications for reading instruction.
 [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova.
                                                                    NIH Publication No. 00-4769. U.S. Government
     BERT: Pre-training of deep bidirectional transformers
                                                                    Printing Office, Washington, DC, 2000.
     for language understanding. In Proceedings of the 2019
     Conference of the North American Chapter of the           [17] A. Nenkova. Summarization evaluation for text and
     Association for Computational Linguistics: Human               speech: issues and approaches. In INTERSPEECH
     Language Technologies, Volume 1 (Long and Short                2006 - ICSLP, Ninth International Conference on
     Papers), pages 4171–4186, Minneapolis, Minnesota,              Spoken Language Processing, Pittsburgh, PA, USA,
     June 2019. Association for Computational Linguistics.          September 17-21, 2006. ISCA, 2006.
 [7] M. Gardner, J. Grus, M. Neumann, O. Tafjord,              [18] A. Nenkova and K. McKeown. Automatic
     P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and               summarization. Foundations and Trends in
     L. Zettlemoyer. AllenNLP: A deep semantic natural              Information Retrieval, 5(2–3):103–233, 2011.
     language processing platform. In Proceedings of           [19] A. M. Olney, P. J. Pavlik Jr., and J. K. Maass.
     Workshop for NLP Open Source Software (NLP-OSS),               Improving reading comprehension with automatically
     pages 1–6, Melbourne, Australia, July 2018.                    generated cloze item practice. In E. André, R. Baker,
     Association for Computational Linguistics.                     X. Hu, M. M. T. Rodrigo, and B. du Boulay, editors,
 [8] A. Kurtasov. A system for generating cloze test items          Artificial Intelligence in Education, Lecture Notes in
     from Russian-language text. In Proceedings of the              Computer Science, pages 262–273. Springer, 2017.
     Student Research Workshop associated with RANLP           [20] P. I. Pavlik Jr., A. M. Olney, A. Banker, L. Eglington,
     2013, pages 107–112, Hissar, Bulgaria, Sept. 2013.             and J. Yarbro. The mobile fact and concept textbook
 [9] C.-L. Liu, C.-H. Wang, Z.-M. Gao, and S.-M. Huang.             system (mofacts). In S. Sosnovsky, P. Brusilovsky,
     Applications of lexical information for algorithmically        R. Baraniuk, and A. Lan, editors, Proceedings of the
     composing multiple-choice cloze items. In Proceedings          Second International Workshop on Intelligent
     of the Second Workshop on Building Educational                 Textbooks 2020 co-located with 21st International
     Applications Using NLP, pages 1–8, Ann Arbor,                  Conference on Artificial Intelligence in Education
     Michigan, June 2005. Association for Computational             (AIED 2020), pages 35–49, 2020.
     Linguistics.                                              [21] D. Radev, T. Allison, S. Blair-Goldensohn, J. Blitzer,
[10] Y. Liu and M. Lapata. Text summarization with                  A. Çelebi, S. Dimitrov, E. Drabek, A. Hakim,
     pretrained encoders. In Proceedings of the 2019                W. Lam, D. Liu, J. Otterbacher, H. Qi, H. Saggion,
     Conference on Empirical Methods in Natural Language            S. Teufel, M. Topper, A. Winkel, and Z. Zhang.
     Processing and the 9th International Joint Conference          MEAD - a platform for multidocument multilingual
     on Natural Language Processing (EMNLP-IJCNLP),                 text summarization. In Proceedings of the Fourth
     pages 3730–3740, Hong Kong, China, Nov. 2019.                  International Conference on Language Resources and
     Association for Computational Linguistics.                     Evaluation (LREC’04), Lisbon, Portugal, May 2004.
                                                                    European Language Resources Association (ELRA).
[11] W. C. Mann and S. A. Thompson. Rhetorical
     structure theory: Toward a functional theory of text      [22] A. Skory and M. Eskenazi. Predicting cloze task
     organization. Text, 8(3):243–281, 1988.                        quality for vocabulary training. In Proceedings of the
                                                                    NAACL HLT 2010 Fifth Workshop on Innovative Use
[12] C. Manning, M. Surdeanu, J. Bauer, J. Finkel,
                                                                    of NLP for Building Educational Applications, pages
     S. Bethard, and D. McClosky. The Stanford CoreNLP
     49–56, Los Angeles, California, June 2010. Association
     for Computational Linguistics.
[23] M. Surdeanu, T. Hicks, and M. A.
     Valenzuela-Escarcega. Two practical rhetorical
     structure theory parsers. In Proceedings of the 2015
     Conference of the North American Chapter of the
     Association for Computational Linguistics:
     Demonstrations, pages 1–5, Denver, Colorado, June
     2015. Association for Computational Linguistics.
[24] W. L. Taylor. “cloze procedure”: A new tool for
     measuring readability. Journalism Quarterly,
     30(4):415–433, 1953.
[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
     Attention is all you need. In I. Guyon, U. von
     Luxburg, S. Bengio, H. M. Wallach, R. Fergus,
     S. V. N. Vishwanathan, and R. Garnett, editors,
     Proceedings of the Thirty-first Annual Conference on
     Neural Information Processing Systems, pages
     5998–6008, 2017.
[26] Q. Xie, G. Lai, Z. Dai, and E. Hovy. Large-scale cloze
     test dataset created by teachers. In Proceedings of the
     2018 Conference on Empirical Methods in Natural
     Language Processing, pages 2344–2356, Brussels,
     Belgium, Oct.-Nov. 2018. Association for
     Computational Linguistics.
[27] M. Zhong, P. Liu, Y. Chen, D. Wang, X. Qiu, and
     X. Huang. Extractive summarization as text
     matching. In Proceedings of the 58th Annual Meeting
     of the Association for Computational Linguistics,
     pages 6197–6208, Online, July 2020. Association for
     Computational Linguistics.