<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sentence Selection for Cloze Item Creation: A Standardized Task and Preliminary Results</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrew M. Olney</string-name>
          <email>aolney@memphis.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Memphis 365 Innovation Drive</institution>
          ,
          <addr-line>Suite 303 Memphis, Tennessee 38152</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Cloze items are commonly used for both assessing learning and as a learning activity. This paper investigates the selection of sentences for cloze item creation by comparing methods ranging from simple heuristics to deep learning summarization models. An evaluation using human-generated cloze items from three di erent science texts indicates that simple heuristics substantially outperform summarization models, including state-of-the-art deep learning models. These results suggest that sentence selection for cloze item generation should be considered a distinct task from summarization and that continued advances on this task will require large datasets of human-generated cloze items.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;cloze item</kwd>
        <kwd>assessment</kwd>
        <kwd>learning</kwd>
        <kwd>extractive summarization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Cloze items, also known as ll-in-the-blank questions, are
common in educational practice, with applications both for
assessing learning and for promoting learning [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Because
cloze items may be created directly from text simply by
deleting a word or phrase, automated methods for creating
cloze items have been considered since their inception.
Indeed, the work widely viewed as introducing the cloze item
also proposed creating them by randomly deleting words or
deleting every nth word [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], and these methods became a
common practice in the following decades [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For
learning applications, however, such text-insensitive automated
methods o er no control over content, and for assessment
applications, research suggests that text-insensitive
methods are better aligned with local properties of the text (e.g.
grammar and vocabulary) than with non-local properties
associated with text comprehension [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>
        Advances in natural language processing (NLP) since 1990
have enabled text-sensitive approaches to cloze item creation
for both learning and assessment applications. Research in
this area has broadly organized around two di erent goals,
creating cloze items for language learning (native or foreign
language) and for text comprehension (i.e., learning from
text). These two goals have led to di erent approaches
for creating text-sensitive cloze items. Research on cloze
items for language learning tends to be keyword- rst [5, 8,
Copyright '2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0)
9, 22], meaning that sentences in the text are selected for
cloze items depending on the presence of relevant keywords.
These keywords are then deleted to make cloze items.
Similar to text-insensitive methods, a keyword- rst approach
emphasizes local properties of the text and so aligns with
common language-learning concerns like grammar and
vocabulary, while allowing for more control over content. In
contrast, research on cloze items for text comprehension
tends to be sentence- rst [
        <xref ref-type="bibr" rid="ref1 ref15 ref19">1, 15, 19</xref>
        ], meaning that
important sentences in the text are selected rst, followed by
procedures for deleting words to make cloze items. A common
approach to selecting important sentences for cloze items is
to use extractive summarization techniques [
        <xref ref-type="bibr" rid="ref1 ref15">1, 15</xref>
        ].
Extractive summarization systems attempt to create a coherent
summary of a text by ltering out unimportant sentences in
a text (conversely selecting important sentences) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and so
intuitively appear relevant for this task. Because
sentencerst approaches focus on the non-local properties of the text,
they are aligned with text comprehension concerns.
Research on automated cloze item creation has
predominantly been theory-driven rather than data-driven, likely
because large datasets of human-created cloze items have
not been available until recently and only then for
languagelearning goals [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Given the absence of data with which to
train and evaluate models, researchers have used rule-based
and statistical techniques that are fundamentally heuristic,
and they have evaluated their systems largely using
rubricbased human evaluation of the cloze items created, rather
than by comparing them to human-generated cloze items.
One notable exception is Olney et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], who compare
their method with human-generated items and randomly
generated items on learning outcomes. However, that work
does not present a detailed comparison of automatic- and
human-generated cloze items.
      </p>
      <p>
        Research on automated cloze item creation could bene t
from adopting common practices in other areas of NLP, such
as common datasets, standard evaluation metrics, and the
comparisons these allow with previous work. To this end,
the present paper proposes sentence selection as a
standardized task associated with cloze item creation. The sentence
selection task is ideal for standard evaluation metrics
because automated selections can be directly compared to
human selections. The remainder of this paper compares
multiple existing methods and their performance on the sentence
selection task, including Olney et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], a recent updated
version of that model [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] with several variants, and three
extractive summarizers.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. SENTENCE SELECTION MODELS 2.1 Olney et al. (2017)</title>
      <p>
        Olney et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] used a coreference resolution system [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for
selecting sentences. A coreference chain is a sequence of
repeated mentions of the same entity across a text. A common
example of a coreference chain is between a noun and
corresponding pronouns (e.g., \Jill" and \her"), but mentions can
be less obviously connected (e.g., \Queen of England" and
\Elizabeth"). Intuitively, a long chain represents an entity
that is important to the discourse, and a sentence containing
multiple such chains is important because it involves
multiple such entities. Olney et al. operationalized this intuition
with the heuristic that important sentences should contain
at least three coreference chains (i.e., should contain
mentions in these chains) and that the chains themselves should
have a length of at least two mentions. These sentences
were then ltered using criteria from a discourse parser [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ],
speci cally nuclearity of elementary discourse units [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Under the theory implemented by the parser, clauses that carry
little or no meaning are called satellites and are contrasted
with nuclei that carry substantial meaning. Thus, selected
sentences were deselected if they consisted of only satellite
discourse units. This two-step heuristic was developed by
inspecting a single text on the circulatory system and
selecting criteria such that the number of selected sentences
exactly matched the number of human-selected cloze
sentences; the sentences themselves were not observed in the
development of the heuristic. In later unpublished work,
the above method was extended by ranking the sentences
on the above criteria as well as the summed length of all
coreference chains in a sentence. This extension makes it
straightforward to return the top n sentences that meet the
original two-step heuristic criteria while also relaxing these
criteria when more sentences are requested.
2.2 Pavlik et al. (2020)
Pavlik et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] describe a reimplementation of Olney et
al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The reimplementation di ers in several respects,
including using a new coreference system based on deep
learning [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and doing away with the discourse parser constraint
of nuclearity. It preserves the rst step of the heuristic,
prioritizing sentences having at least three coreferences chains
of at least length two, and similarly ranks sentences using
that criteria as well as the summed length of all coreference
chains in a sentence. No comparison with Olney et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
was reported.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.3 MEAD summarizer</title>
      <p>
        The MEAD summarizer [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is a widely-used, publicly
available summarizer applicable to multiple documents and
multiple languages. Although MEAD has an orientation to
extractive summarization of multiple documents on the same
topic (e.g., a news story), it can also be used to summarize a
single document. MEAD uses a variety of features to select
sentences for summarization, including sentence length,
position in the document, cosine with other sentences, keyword
match, and LexPageRank, a measure of sentence centrality
with respect to words in the document. By default, MEAD
uses a linear combination of these features to identify
important sentences and can be used to return the speci ed top
Text
Circulatory
Nitrogen cycle
Photosynthesis
      </p>
      <p>Selected
21
26
24
n such sentences, skipping sentences that are too similar to
already included sentences.</p>
    </sec>
    <sec id="sec-4">
      <title>2.4 SMRZR summarizer</title>
      <p>
        The SMRZR summarizer focuses on summarizing lectures
using deep learning, is open source, and is freely available at
https://smrzr.io/ [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The summarizer uses BERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to
project each sentence in the document to an sxwxe matrix,
where s is the number of requested summary sentences, w is
the words, and e is the embedding dimension. This matrix
is then reduced to an sxe matrix by averaging over words,
and each of the s sentence vectors in this reduced matrix
is submitted to K-means clustering using k = n, the
number of requested sentences. The sentences returned by the
summarizer are those closest to the centroid of each of the
clusters. SMRZR was not trained on a corpus but rather
used a pre-trained BERT model. The layer from which the
sxwxe matrix is extracted was manually selected based on
experimentation with a small set of test cases.
      </p>
    </sec>
    <sec id="sec-5">
      <title>2.5 BERTSumExt summarizer</title>
      <p>
        The BERTSumExt summarizer is a document-level BERT
encoder that stacks inter-sentence Transformer [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] layers
on top of BERT and is open source and freely available
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In this BERT variant, input sentences are separated
by [cls] tokens to learn sentence representations encoded in
corresponding token vectors at the output layer. These
sentence representation vectors are then input to inter-sentence
Transformer layers with position embeddings to capture
sentence position, and these lead to a sigmoid classi er output
layer that indicates the importance of the sentence. The
top n such sentences can be returned to create an extractive
summary. Unlike SMRZR and MEAD, BERTSumExt is
directly trained on news corpora. BERTSumExt was state
of the art on extractive summarization for the CNN/Daily
Mail dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and was only recently surpassed by a
system with less than a 1 point improvement in recall [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>3. EVALUATION</title>
    </sec>
    <sec id="sec-7">
      <title>3.1 Procedure</title>
      <p>
        Evaluation data were obtained by asking expert judges to
create cloze items for three texts on science topics,
including the circulatory system, the nitrogen cycle, and
photosynthesis. The text and cloze items for the circulatory system
were taken from Olney et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The other texts were
created by a graduate student blind to the purpose of the
study to match the length and di culty of the circulatory
system text. As shown in Table 1, texts matched closely in
number of words but somewhat less so in terms of di culty,
with both nitrogen cycle and photosynthesis texts being
approximately two Flesch-Kincade grades level units higher in
di culty than the circulatory system text.
      </p>
      <p>
        Cloze items for the circulatory text were created by a
graduate student who operationalized the task as selecting
senM
.37
.46
.35
.30
.25
.29
.37
.39
.35
tences conveying the main ideas. Cloze items for the other
two texts were created by a high school biology teacher who
was blind to the purpose of the study. Both human judges
selected similar numbers of sentences across texts.
Each of the three texts was input into the models described
in Section 2 along with the parameter n, the number of
sentences selected by a human judge for that text. The primary
evaluation metric was the number of sentences returned that
were selected by human judges (i.e. overlap), divided by n.
This metric is equivalent to recall for extractive
summarization, which some have argued is more appropriate than
precision given the variability in human sentence selection
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>Additionally, we evaluated several variants of the Pavlik et
al. model that varied according to the primary heuristic of
having at least three coreferences chains of at least length
two. The variants included having at least two
coreferences chains of at least length two, replacing this restriction
by ranking by the total number of chains in the sentence,
and removing this restriction entirely. Each variant ranks
the sentences, post-constraint, by the summed length of all
coreference chains in a sentence, just as the original.</p>
    </sec>
    <sec id="sec-8">
      <title>3.2 Results</title>
      <p>Results are presented in Table 2, which shows the best model
recall score per text in bold font, with the nal column
showing the average recall across texts. The initial rows of Table 2
correspond to the models in Section 2, followed by a random
baseline (i.e., random selection of n sentences), followed by
the variants of the Pavlik et al. model.</p>
      <p>
        The best performing model is Pavlik et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], which has
the best average score as well as the top score (or tied) for
every text with the exception of the nitrogen cycle, for which
MEAD achieves the highest score. The increased
performance of Pavlik et al. model relative to the original Olney
et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] suggests that the discourse parser constraint of
nuclearity is not contributing heavily to performance and
that these contributions are easily overwhelmed by using a
higher-performing coreference resolution system. However,
it is notable that although the systems achieve the same
score on the circulatory system, they do not make identical
predictions: 25% of the correct predictions di er between
the two models.
      </p>
      <p>It is remarkable both how badly the summarization models
perform on this task as well as how their performance seems
to improve as their simplicity increases. The most
sophisticated model, BERTSumExt, which is near state of the art
on extractive summarization, performs below chance on 2/3
of the texts as well as below chance on average. SMRZR,
another deep learning model, is similarly below chance on 1/3
of the texts and only 1% above chance on average. MEAD,
the simplest and oldest model, is approximately at chance
on 2/3 texts, though its average score is elevated by its top
performance on the nitrogen cycle text. Overall, these
results suggest that the intuition that summarization models
are suitable for the sentence selection task of cloze item
creation is incorrect. Indeed it appears that models trained
on newswire text, like BERTSumExt, may be particularly
poorly suited for this task.</p>
      <p>Finally, the variant results indicate that the current
heuristics used by Pavlik et al. are not over tted to the original
circulatory system text. No variant achieves a higher score
on any single text or overall. However, the variant results
suggest that heuristics involving the number of chains in a
sentence are particularly signi cant for improving the score
of the circulatory system text.</p>
    </sec>
    <sec id="sec-9">
      <title>4. DISCUSSION</title>
      <p>
        We have proposed sentence selection as a standardized task
associated with automated cloze item creation. Unlike
previous work that has used rubrics to evaluate cloze items,
sentence selection allows automated selections to be directly
compared to human selections using standard evaluation
metrics like recall. Because our results show that simple
heuristics outperform extractive summarization models,
including a state of the art deep learning model, we argue that
sentence selection for cloze item generation should be
considered a distinct task from extractive summarization,
particularly extractive summarization in the context of newswire
text, where it has historically focused. Previous researchers
have raised concerns with the type of direct evaluation we
propose, based in part on the variability of sentences human
judges will select for extraction [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. We believe that these
concerns are more valid for newswire text as opposed to
academic text, which by de nition is designed for learning.
While experts may not agree on what parts of a current
news story are most important in a summary, we suspect
that experts on photosynthesis generally agree on key ideas,
and thus key sentences in a text. However, we have not
presented evidence con rming this suspicion in this paper, nor
are we aware of research that has investigated this question.
This suggests a new direction in automated cloze item
creation: the creation of large datasets of cloze items on diverse
texts, where each text has been annotated by a large enough
sample of human judges that we can estimate human
agreement reliably enough to calculate whether an automated
method agrees as much (or more) with humans as humans
do with each other. Without common datasets, standard
evaluation metrics, and the comparisons these allow with
previous work, we fear that researchers will continue to
create novel systems and evaluate them in isolation, which will
ultimately contribute little to progress on automated cloze
item creation.
      </p>
    </sec>
    <sec id="sec-10">
      <title>5. ACKNOWLEDGMENTS</title>
      <p>This material is based upon work supported by the National
Science Foundation under Grants 1918751 and 1934745 by
the Institute of Education Sciences under Grant R305A190448.
Any opinions, ndings, and conclusions or recommendations
expressed in this material are those of the author(s) and
do not necessarily re ect the views of the National Science
Foundation or the Institute of Education Sciences.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Mannem</surname>
          </string-name>
          .
          <article-title>Automatic gap- ll question generation from text books</article-title>
          .
          <source>In Proceedings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , pages
          <volume>56</volume>
          {
          <fpage>64</fpage>
          ,
          <string-name>
            <surname>Portland</surname>
          </string-name>
          , Oregon,
          <year>June 2011</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Alderson</surname>
          </string-name>
          .
          <article-title>The cloze procedure and pro ciency in english as a foreign language</article-title>
          .
          <source>TESOL Quarterly</source>
          ,
          <volume>13</volume>
          (
          <issue>2</issue>
          ):
          <volume>219</volume>
          {
          <fpage>227</fpage>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L. F.</given-names>
            <surname>Bachman</surname>
          </string-name>
          .
          <article-title>The trait structure of cloze test scores</article-title>
          .
          <source>TESOL Quarterly</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ):
          <volume>61</volume>
          {
          <fpage>70</fpage>
          ,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. F.</given-names>
            <surname>Bachman</surname>
          </string-name>
          .
          <article-title>Performance on cloze tests with xed-ratio and rational deletions</article-title>
          .
          <source>TESOL Quarterly</source>
          ,
          <volume>19</volume>
          (
          <issue>3</issue>
          ):
          <volume>535</volume>
          {
          <fpage>556</fpage>
          ,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Coniam</surname>
          </string-name>
          .
          <article-title>From text to test, automatically - an evaluation of a computer cloze-test generator</article-title>
          .
          <source>Hong Kong Journal of Applied Linguistics</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <volume>41</volume>
          {
          <fpage>60</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          {
          <fpage>4186</fpage>
          ,
          <string-name>
            <surname>Minneapolis</surname>
          </string-name>
          , Minnesota,
          <year>June 2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Grus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tafjord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dasigi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmitz</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. Zettlemoyer.</surname>
          </string-name>
          <article-title>AllenNLP: A deep semantic natural language processing platform</article-title>
          .
          <source>In Proceedings of Workshop for NLP Open Source Software (NLP-OSS)</source>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          6,
          <string-name>
            <surname>Melbourne</surname>
          </string-name>
          , Australia,
          <year>July 2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kurtasov</surname>
          </string-name>
          .
          <article-title>A system for generating cloze test items from Russian-language text</article-title>
          .
          <source>In Proceedings of the Student Research Workshop associated with RANLP</source>
          <year>2013</year>
          , pages
          <fpage>107</fpage>
          {
          <fpage>112</fpage>
          ,
          <string-name>
            <surname>Hissar</surname>
          </string-name>
          , Bulgaria, Sept.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>C.-L. Liu</surname>
            ,
            <given-names>C.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.-M.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            , and
            <given-names>S.-M.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Applications of lexical information for algorithmically composing multiple-choice cloze items</article-title>
          .
          <source>In Proceedings of the Second Workshop on Building Educational Applications Using NLP</source>
          , pages
          <volume>1</volume>
          {8,
          <string-name>
            <surname>Ann</surname>
            <given-names>Arbor</given-names>
          </string-name>
          , Michigan,
          <year>June 2005</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          .
          <article-title>Text summarization with pretrained encoders</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          , pages
          <fpage>3730</fpage>
          {
          <fpage>3740</fpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China, Nov.
          <year>2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Mann</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Thompson</surname>
          </string-name>
          .
          <article-title>Rhetorical structure theory: Toward a functional theory of text organization</article-title>
          .
          <source>Text</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ):
          <volume>243</volume>
          {
          <fpage>281</fpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Surdeanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Finkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bethard</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>McClosky</surname>
          </string-name>
          .
          <article-title>The Stanford CoreNLP natural language processing toolkit</article-title>
          .
          <source>In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>
          , pages
          <volume>55</volume>
          {
          <fpage>60</fpage>
          ,
          <string-name>
            <surname>Baltimore</surname>
          </string-name>
          , Maryland,
          <year>June 2014</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Leveraging BERT for extractive text summarization on lectures</article-title>
          . CoRR, abs/
          <year>1906</year>
          .04165,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nallapati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , C. dos
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gulcehre</surname>
            , and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Xiang</surname>
          </string-name>
          .
          <article-title>Abstractive text summarization using sequence-to-sequence RNNs and beyond</article-title>
          .
          <source>In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning</source>
          , pages
          <volume>280</volume>
          {
          <fpage>290</fpage>
          . Association for Computational Linguistics,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Narendra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <article-title>Automatic cloze-questions generation</article-title>
          .
          <source>In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP</source>
          <year>2013</year>
          , pages
          <fpage>511</fpage>
          {
          <fpage>515</fpage>
          ,
          <string-name>
            <surname>Hissar</surname>
          </string-name>
          , Bulgaria, Sept.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[16] National Institute of Child Health and Human Development. Report of the National Reading Panel</source>
          .
          <article-title>Teaching children to read: An evidence-based assessment of the scienti c research literature on reading and its implications for reading instruction</article-title>
          . NIH Publication No.
          <fpage>00</fpage>
          -
          <lpage>4769</lpage>
          . U.S. Government Printing O ce, Washington, DC,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nenkova</surname>
          </string-name>
          .
          <article-title>Summarization evaluation for text and speech: issues and approaches</article-title>
          .
          <source>In INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing</source>
          , Pittsburgh, PA, USA, September
          <volume>17</volume>
          -
          <issue>21</issue>
          ,
          <year>2006</year>
          . ISCA,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nenkova</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>McKeown</surname>
          </string-name>
          .
          <article-title>Automatic summarization</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          {3):
          <volume>103</volume>
          {
          <fpage>233</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>A. M. Olney</surname>
            ,
            <given-names>P. J. Pavlik</given-names>
          </string-name>
          <string-name>
            <surname>Jr.</surname>
            , and
            <given-names>J. K.</given-names>
          </string-name>
          <string-name>
            <surname>Maass</surname>
          </string-name>
          .
          <article-title>Improving reading comprehension with automatically generated cloze item practice</article-title>
          . In E. Andre,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. T. Rodrigo</surname>
          </string-name>
          , and B. du Boulay, editors,
          <source>Arti cial Intelligence in Education, Lecture Notes in Computer Science</source>
          , pages
          <volume>262</volume>
          {
          <fpage>273</fpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P. I. Pavlik</given-names>
            <surname>Jr.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Olney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Banker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Eglington</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Yarbro.</surname>
          </string-name>
          <article-title>The mobile fact and concept textbook system (mofacts)</article-title>
          . In S. Sosnovsky,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baraniuk</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Lan, editors,
          <source>Proceedings of the Second International Workshop on Intelligent Textbooks</source>
          <year>2020</year>
          co
          <article-title>-located with 21st International Conference on Arti cial Intelligence in Education (AIED</article-title>
          <year>2020</year>
          ), pages
          <fpage>35</fpage>
          {
          <fpage>49</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Allison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Blair-Goldensohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Blitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Celebi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Drabek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hakim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Otterbacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Teufel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Topper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Winkel</surname>
          </string-name>
          , and
          <string-name>
            <surname>Z. Zhang.</surname>
          </string-name>
          <article-title>MEAD - a platform for multidocument multilingual text summarization</article-title>
          .
          <source>In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04)</source>
          , Lisbon, Portugal, May
          <year>2004</year>
          .
          <article-title>European Language Resources Association (ELRA).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Skory</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskenazi</surname>
          </string-name>
          .
          <article-title>Predicting cloze task quality for vocabulary training</article-title>
          .
          <source>In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , pages
          <volume>49</volume>
          {
          <fpage>56</fpage>
          , Los Angeles, California, June 2010.
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Surdeanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hicks</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Valenzuela-Escarcega</surname>
          </string-name>
          .
          <article-title>Two practical rhetorical structure theory parsers</article-title>
          .
          <source>In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations</source>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          5, Denver, Colorado,
          <year>June 2015</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>W. L.</given-names>
            <surname>Taylor</surname>
          </string-name>
          . \
          <article-title>cloze procedure": A new tool for measuring readability</article-title>
          .
          <source>Journalism Quarterly</source>
          ,
          <volume>30</volume>
          (
          <issue>4</issue>
          ):
          <volume>415</volume>
          {
          <fpage>433</fpage>
          ,
          <year>1953</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Polosukhin.</surname>
          </string-name>
          <article-title>Attention is all you need</article-title>
          . In I. Guyon, U. von Luxburg, S. Bengio,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V. N.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , and R. Garnett, editors,
          <source>Proceedings of the Thirty- rst Annual Conference on Neural Information Processing Systems</source>
          , pages
          <fpage>5998</fpage>
          {
          <fpage>6008</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Hovy</surname>
          </string-name>
          .
          <article-title>Large-scale cloze test dataset created by teachers</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <volume>2344</volume>
          {
          <fpage>2356</fpage>
          , Brussels, Belgium, Oct.-
          <source>Nov</source>
          .
          <year>2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhong</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Extractive summarization as text matching</article-title>
          .
          <source>In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <volume>6197</volume>
          {
          <fpage>6208</fpage>
          ,
          <string-name>
            <surname>Online</surname>
          </string-name>
          ,
          <year>July 2020</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>