Sentence Selection for Cloze Item Creation: A Standardized Task and Preliminary Results Andrew M. Olney University of Memphis 365 Innovation Drive, Suite 303 Memphis, Tennessee 38152 aolney@memphis.edu ABSTRACT 9, 22], meaning that sentences in the text are selected for Cloze items are commonly used for both assessing learning cloze items depending on the presence of relevant keywords. and as a learning activity. This paper investigates the selec- These keywords are then deleted to make cloze items. Sim- tion of sentences for cloze item creation by comparing meth- ilar to text-insensitive methods, a keyword-first approach ods ranging from simple heuristics to deep learning summa- emphasizes local properties of the text and so aligns with rization models. An evaluation using human-generated cloze common language-learning concerns like grammar and vo- items from three different science texts indicates that simple cabulary, while allowing for more control over content. In heuristics substantially outperform summarization models, contrast, research on cloze items for text comprehension including state-of-the-art deep learning models. These re- tends to be sentence-first [1, 15, 19], meaning that impor- sults suggest that sentence selection for cloze item genera- tant sentences in the text are selected first, followed by pro- tion should be considered a distinct task from summariza- cedures for deleting words to make cloze items. A common tion and that continued advances on this task will require approach to selecting important sentences for cloze items is large datasets of human-generated cloze items. to use extractive summarization techniques [1, 15]. Extrac- tive summarization systems attempt to create a coherent Keywords summary of a text by filtering out unimportant sentences in cloze item, assessment, learning, extractive summarization a text (conversely selecting important sentences) [18] and so intuitively appear relevant for this task. Because sentence- first approaches focus on the non-local properties of the text, 1. INTRODUCTION they are aligned with text comprehension concerns. Cloze items, also known as fill-in-the-blank questions, are common in educational practice, with applications both for Research on automated cloze item creation has predomi- assessing learning and for promoting learning [16]. Because nantly been theory-driven rather than data-driven, likely cloze items may be created directly from text simply by because large datasets of human-created cloze items have deleting a word or phrase, automated methods for creating not been available until recently and only then for language- cloze items have been considered since their inception. In- learning goals [26]. Given the absence of data with which to deed, the work widely viewed as introducing the cloze item train and evaluate models, researchers have used rule-based also proposed creating them by randomly deleting words or and statistical techniques that are fundamentally heuristic, deleting every nth word [24], and these methods became a and they have evaluated their systems largely using rubric- common practice in the following decades [2]. For learn- based human evaluation of the cloze items created, rather ing applications, however, such text-insensitive automated than by comparing them to human-generated cloze items. methods offer no control over content, and for assessment One notable exception is Olney et al. [19], who compare applications, research suggests that text-insensitive meth- their method with human-generated items and randomly ods are better aligned with local properties of the text (e.g. generated items on learning outcomes. However, that work grammar and vocabulary) than with non-local properties does not present a detailed comparison of automatic- and associated with text comprehension [2, 3, 4]. human-generated cloze items. Advances in natural language processing (NLP) since 1990 Research on automated cloze item creation could benefit have enabled text-sensitive approaches to cloze item creation from adopting common practices in other areas of NLP, such for both learning and assessment applications. Research in as common datasets, standard evaluation metrics, and the this area has broadly organized around two different goals, comparisons these allow with previous work. To this end, creating cloze items for language learning (native or foreign the present paper proposes sentence selection as a standard- language) and for text comprehension (i.e., learning from ized task associated with cloze item creation. The sentence text). These two goals have led to different approaches selection task is ideal for standard evaluation metrics be- for creating text-sensitive cloze items. Research on cloze cause automated selections can be directly compared to hu- items for language learning tends to be keyword-first [5, 8, man selections. The remainder of this paper compares mul- tiple existing methods and their performance on the sentence Copyright ©2021 for this paper by its authors. Use permitted under Cre- ative Commons License Attribution 4.0 International (CC BY 4.0) selection task, including Olney et al. [19], a recent updated version of that model [20] with several variants, and three extractive summarizers. Table 1: Text characteristics Text FK Grade Words Sents Selected 2. SENTENCE SELECTION MODELS Circulatory 6.2 987 73 21 2.1 Olney et al. (2017) Nitrogen cycle 8.2 976 94 26 Olney et al. [19] used a coreference resolution system [12] for Photosynthesis 8.2 977 75 24 selecting sentences. A coreference chain is a sequence of re- peated mentions of the same entity across a text. A common example of a coreference chain is between a noun and corre- n such sentences, skipping sentences that are too similar to sponding pronouns (e.g., “Jill” and “her”), but mentions can already included sentences. be less obviously connected (e.g., “Queen of England” and “Elizabeth”). Intuitively, a long chain represents an entity 2.4 SMRZR summarizer that is important to the discourse, and a sentence containing The SMRZR summarizer focuses on summarizing lectures multiple such chains is important because it involves multi- using deep learning, is open source, and is freely available at ple such entities. Olney et al. operationalized this intuition https://smrzr.io/ [13]. The summarizer uses BERT [6] to with the heuristic that important sentences should contain project each sentence in the document to an sxwxe matrix, at least three coreference chains (i.e., should contain men- where s is the number of requested summary sentences, w is tions in these chains) and that the chains themselves should the words, and e is the embedding dimension. This matrix have a length of at least two mentions. These sentences is then reduced to an sxe matrix by averaging over words, were then filtered using criteria from a discourse parser [23], and each of the s sentence vectors in this reduced matrix specifically nuclearity of elementary discourse units [11]. Un- is submitted to K-means clustering using k = n, the num- der the theory implemented by the parser, clauses that carry ber of requested sentences. The sentences returned by the little or no meaning are called satellites and are contrasted summarizer are those closest to the centroid of each of the with nuclei that carry substantial meaning. Thus, selected clusters. SMRZR was not trained on a corpus but rather sentences were deselected if they consisted of only satellite used a pre-trained BERT model. The layer from which the discourse units. This two-step heuristic was developed by sxwxe matrix is extracted was manually selected based on inspecting a single text on the circulatory system and se- experimentation with a small set of test cases. lecting criteria such that the number of selected sentences exactly matched the number of human-selected cloze sen- 2.5 BERTSumExt summarizer tences; the sentences themselves were not observed in the The BERTSumExt summarizer is a document-level BERT development of the heuristic. In later unpublished work, encoder that stacks inter-sentence Transformer [25] layers the above method was extended by ranking the sentences on top of BERT and is open source and freely available on the above criteria as well as the summed length of all [10]. In this BERT variant, input sentences are separated coreference chains in a sentence. This extension makes it by [cls] tokens to learn sentence representations encoded in straightforward to return the top n sentences that meet the corresponding token vectors at the output layer. These sen- original two-step heuristic criteria while also relaxing these tence representation vectors are then input to inter-sentence criteria when more sentences are requested. Transformer layers with position embeddings to capture sen- tence position, and these lead to a sigmoid classifier output 2.2 Pavlik et al. (2020) layer that indicates the importance of the sentence. The Pavlik et al. [20] describe a reimplementation of Olney et top n such sentences can be returned to create an extractive al. [19]. The reimplementation differs in several respects, in- summary. Unlike SMRZR and MEAD, BERTSumExt is di- cluding using a new coreference system based on deep learn- rectly trained on news corpora. BERTSumExt was state ing [7] and doing away with the discourse parser constraint of the art on extractive summarization for the CNN/Daily of nuclearity. It preserves the first step of the heuristic, pri- Mail dataset [14] and was only recently surpassed by a sys- oritizing sentences having at least three coreferences chains tem with less than a 1 point improvement in recall [27]. of at least length two, and similarly ranks sentences using that criteria as well as the summed length of all coreference 3. EVALUATION chains in a sentence. No comparison with Olney et al. [19] 3.1 Procedure was reported. Evaluation data were obtained by asking expert judges to create cloze items for three texts on science topics, includ- 2.3 MEAD summarizer ing the circulatory system, the nitrogen cycle, and photosyn- The MEAD summarizer [21] is a widely-used, publicly avail- thesis. The text and cloze items for the circulatory system able summarizer applicable to multiple documents and mul- were taken from Olney et al. [19]. The other texts were tiple languages. Although MEAD has an orientation to ex- created by a graduate student blind to the purpose of the tractive summarization of multiple documents on the same study to match the length and difficulty of the circulatory topic (e.g., a news story), it can also be used to summarize a system text. As shown in Table 1, texts matched closely in single document. MEAD uses a variety of features to select number of words but somewhat less so in terms of difficulty, sentences for summarization, including sentence length, po- with both nitrogen cycle and photosynthesis texts being ap- sition in the document, cosine with other sentences, keyword proximately two Flesch-Kincade grades level units higher in match, and LexPageRank, a measure of sentence centrality difficulty than the circulatory system text. with respect to words in the document. By default, MEAD uses a linear combination of these features to identify impor- Cloze items for the circulatory text were created by a grad- tant sentences and can be used to return the specified top uate student who operationalized the task as selecting sen- to improve as their simplicity increases. The most sophisti- Table 2: Recall of Sentence Selection cated model, BERTSumExt, which is near state of the art Model Circ. Sys. Nit. Cyc. Photosyn. M on extractive summarization, performs below chance on 2/3 Olney et al. .57 .19 .33 .37 of the texts as well as below chance on average. SMRZR, an- Pavlik et al. .57 .35 .46 .46 other deep learning model, is similarly below chance on 1/3 MEAD .29 .42 .33 .35 of the texts and only 1% above chance on average. MEAD, SMRZR .33 .19 .38 .30 the simplest and oldest model, is approximately at chance BERTSumExt .10 .27 .38 .25 on 2/3 texts, though its average score is elevated by its top Random .29 .28 .32 .29 performance on the nitrogen cycle text. Overall, these re- Two chains .48 .27 .38 .37 sults suggest that the intuition that summarization models # chains .52 .27 .38 .39 are suitable for the sentence selection task of cloze item cre- No restriction .29 .35 .42 .35 ation is incorrect. Indeed it appears that models trained on newswire text, like BERTSumExt, may be particularly poorly suited for this task. tences conveying the main ideas. Cloze items for the other two texts were created by a high school biology teacher who Finally, the variant results indicate that the current heuris- was blind to the purpose of the study. Both human judges tics used by Pavlik et al. are not overfitted to the original selected similar numbers of sentences across texts. circulatory system text. No variant achieves a higher score on any single text or overall. However, the variant results Each of the three texts was input into the models described suggest that heuristics involving the number of chains in a in Section 2 along with the parameter n, the number of sen- sentence are particularly significant for improving the score tences selected by a human judge for that text. The primary of the circulatory system text. evaluation metric was the number of sentences returned that were selected by human judges (i.e. overlap), divided by n. 4. DISCUSSION This metric is equivalent to recall for extractive summa- We have proposed sentence selection as a standardized task rization, which some have argued is more appropriate than associated with automated cloze item creation. Unlike pre- precision given the variability in human sentence selection vious work that has used rubrics to evaluate cloze items, [17]. sentence selection allows automated selections to be directly compared to human selections using standard evaluation Additionally, we evaluated several variants of the Pavlik et metrics like recall. Because our results show that simple al. model that varied according to the primary heuristic of heuristics outperform extractive summarization models, in- having at least three coreferences chains of at least length cluding a state of the art deep learning model, we argue that two. The variants included having at least two corefer- sentence selection for cloze item generation should be consid- ences chains of at least length two, replacing this restriction ered a distinct task from extractive summarization, partic- by ranking by the total number of chains in the sentence, ularly extractive summarization in the context of newswire and removing this restriction entirely. Each variant ranks text, where it has historically focused. Previous researchers the sentences, post-constraint, by the summed length of all have raised concerns with the type of direct evaluation we coreference chains in a sentence, just as the original. propose, based in part on the variability of sentences human judges will select for extraction [17]. We believe that these 3.2 Results concerns are more valid for newswire text as opposed to Results are presented in Table 2, which shows the best model academic text, which by definition is designed for learning. recall score per text in bold font, with the final column show- While experts may not agree on what parts of a current ing the average recall across texts. The initial rows of Table 2 news story are most important in a summary, we suspect correspond to the models in Section 2, followed by a random that experts on photosynthesis generally agree on key ideas, baseline (i.e., random selection of n sentences), followed by and thus key sentences in a text. However, we have not pre- the variants of the Pavlik et al. model. sented evidence confirming this suspicion in this paper, nor are we aware of research that has investigated this question. The best performing model is Pavlik et al. [20], which has This suggests a new direction in automated cloze item cre- the best average score as well as the top score (or tied) for ev- ation: the creation of large datasets of cloze items on diverse ery text with the exception of the nitrogen cycle, for which texts, where each text has been annotated by a large enough MEAD achieves the highest score. The increased perfor- sample of human judges that we can estimate human agree- mance of Pavlik et al. model relative to the original Olney ment reliably enough to calculate whether an automated et al. [19] suggests that the discourse parser constraint of method agrees as much (or more) with humans as humans nuclearity is not contributing heavily to performance and do with each other. Without common datasets, standard that these contributions are easily overwhelmed by using a evaluation metrics, and the comparisons these allow with higher-performing coreference resolution system. However, previous work, we fear that researchers will continue to cre- it is notable that although the systems achieve the same ate novel systems and evaluate them in isolation, which will score on the circulatory system, they do not make identical ultimately contribute little to progress on automated cloze predictions: 25% of the correct predictions differ between item creation. the two models. 5. ACKNOWLEDGMENTS It is remarkable both how badly the summarization models This material is based upon work supported by the National perform on this task as well as how their performance seems Science Foundation under Grants 1918751 and 1934745 by the Institute of Education Sciences under Grant R305A190448. natural language processing toolkit. In Proceedings of Any opinions, findings, and conclusions or recommendations 52nd Annual Meeting of the Association for expressed in this material are those of the author(s) and Computational Linguistics: System Demonstrations, do not necessarily reflect the views of the National Science pages 55–60, Baltimore, Maryland, June 2014. Foundation or the Institute of Education Sciences. Association for Computational Linguistics. [13] D. Miller. Leveraging BERT for extractive text 6. REFERENCES summarization on lectures. CoRR, abs/1906.04165, [1] M. Agarwal and P. Mannem. Automatic gap-fill 2019. question generation from text books. In Proceedings of [14] R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, the Sixth Workshop on Innovative Use of NLP for and B. Xiang. Abstractive text summarization using Building Educational Applications, pages 56–64, sequence-to-sequence RNNs and beyond. In Portland, Oregon, June 2011. Association for Proceedings of The 20th SIGNLL Conference on Computational Linguistics. Computational Natural Language Learning, pages [2] J. C. Alderson. The cloze procedure and proficiency in 280–290. Association for Computational Linguistics, english as a foreign language. TESOL Quarterly, 2016. 13(2):219–227, 1979. [15] A. Narendra, M. Agarwal, and R. Shah. Automatic [3] L. F. Bachman. The trait structure of cloze test cloze-questions generation. In Proceedings of the scores. TESOL Quarterly, 16(1):61–70, 1982. International Conference Recent Advances in Natural [4] L. F. Bachman. Performance on cloze tests with Language Processing RANLP 2013, pages 511–515, fixed-ratio and rational deletions. TESOL Quarterly, Hissar, Bulgaria, Sept. 2013. 19(3):535–556, 1985. [16] National Institute of Child Health and Human [5] D. Coniam. From text to test, automatically - an Development. Report of the National Reading Panel. evaluation of a computer cloze-test generator. Hong Teaching children to read: An evidence-based Kong Journal of Applied Linguistics, 3(1):41–60, 1998. assessment of the scientific research literature on reading and its implications for reading instruction. [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. NIH Publication No. 00-4769. U.S. Government BERT: Pre-training of deep bidirectional transformers Printing Office, Washington, DC, 2000. for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the [17] A. Nenkova. Summarization evaluation for text and Association for Computational Linguistics: Human speech: issues and approaches. In INTERSPEECH Language Technologies, Volume 1 (Long and Short 2006 - ICSLP, Ninth International Conference on Papers), pages 4171–4186, Minneapolis, Minnesota, Spoken Language Processing, Pittsburgh, PA, USA, June 2019. Association for Computational Linguistics. September 17-21, 2006. ISCA, 2006. [7] M. Gardner, J. Grus, M. Neumann, O. Tafjord, [18] A. Nenkova and K. McKeown. Automatic P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and summarization. Foundations and Trends in L. Zettlemoyer. AllenNLP: A deep semantic natural Information Retrieval, 5(2–3):103–233, 2011. language processing platform. In Proceedings of [19] A. M. Olney, P. J. Pavlik Jr., and J. K. Maass. Workshop for NLP Open Source Software (NLP-OSS), Improving reading comprehension with automatically pages 1–6, Melbourne, Australia, July 2018. generated cloze item practice. In E. André, R. Baker, Association for Computational Linguistics. X. Hu, M. M. T. Rodrigo, and B. du Boulay, editors, [8] A. Kurtasov. A system for generating cloze test items Artificial Intelligence in Education, Lecture Notes in from Russian-language text. In Proceedings of the Computer Science, pages 262–273. Springer, 2017. Student Research Workshop associated with RANLP [20] P. I. Pavlik Jr., A. M. Olney, A. Banker, L. Eglington, 2013, pages 107–112, Hissar, Bulgaria, Sept. 2013. and J. Yarbro. The mobile fact and concept textbook [9] C.-L. Liu, C.-H. Wang, Z.-M. Gao, and S.-M. Huang. system (mofacts). In S. Sosnovsky, P. Brusilovsky, Applications of lexical information for algorithmically R. Baraniuk, and A. Lan, editors, Proceedings of the composing multiple-choice cloze items. In Proceedings Second International Workshop on Intelligent of the Second Workshop on Building Educational Textbooks 2020 co-located with 21st International Applications Using NLP, pages 1–8, Ann Arbor, Conference on Artificial Intelligence in Education Michigan, June 2005. Association for Computational (AIED 2020), pages 35–49, 2020. Linguistics. [21] D. Radev, T. Allison, S. Blair-Goldensohn, J. Blitzer, [10] Y. Liu and M. Lapata. Text summarization with A. Çelebi, S. Dimitrov, E. Drabek, A. Hakim, pretrained encoders. In Proceedings of the 2019 W. Lam, D. Liu, J. Otterbacher, H. Qi, H. Saggion, Conference on Empirical Methods in Natural Language S. Teufel, M. Topper, A. Winkel, and Z. Zhang. Processing and the 9th International Joint Conference MEAD - a platform for multidocument multilingual on Natural Language Processing (EMNLP-IJCNLP), text summarization. In Proceedings of the Fourth pages 3730–3740, Hong Kong, China, Nov. 2019. International Conference on Language Resources and Association for Computational Linguistics. Evaluation (LREC’04), Lisbon, Portugal, May 2004. European Language Resources Association (ELRA). [11] W. C. Mann and S. A. Thompson. Rhetorical structure theory: Toward a functional theory of text [22] A. Skory and M. Eskenazi. Predicting cloze task organization. Text, 8(3):243–281, 1988. quality for vocabulary training. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use [12] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, of NLP for Building Educational Applications, pages S. Bethard, and D. McClosky. The Stanford CoreNLP 49–56, Los Angeles, California, June 2010. Association for Computational Linguistics. [23] M. Surdeanu, T. Hicks, and M. A. Valenzuela-Escarcega. Two practical rhetorical structure theory parsers. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 1–5, Denver, Colorado, June 2015. Association for Computational Linguistics. [24] W. L. Taylor. “cloze procedure”: A new tool for measuring readability. Journalism Quarterly, 30(4):415–433, 1953. [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Proceedings of the Thirty-first Annual Conference on Neural Information Processing Systems, pages 5998–6008, 2017. [26] Q. Xie, G. Lai, Z. Dai, and E. Hovy. Large-scale cloze test dataset created by teachers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2344–2356, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics. [27] M. Zhong, P. Liu, Y. Chen, D. Wang, X. Qiu, and X. Huang. Extractive summarization as text matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6197–6208, Online, July 2020. Association for Computational Linguistics.