<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automated Generation of Assessment Test Items from Text: Some Quality Aspects</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrey Kurtasov</string-name>
          <email>akurtasov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vologda State University</institution>
          ,
          <addr-line>Vologda</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>91</fpage>
      <lpage>95</lpage>
      <abstract>
        <p>This paper overviews the problem of automated generation of assessment test items from natural-language text. In a previously published article, an experimental system aimed at generating fill-in-theblank test items from Russian text was decribed. In this paper, some aspects of the system's quality are analyzed. Main directions for future work are defined, including evaluation of the system and development of methods for filtering text fragments and selecting words to blank out.</p>
      </abstract>
      <kwd-group>
        <kwd>educational assessment</kwd>
        <kwd>natural language processing</kwd>
        <kwd>Russian language</kwd>
        <kwd>test item generation</kwd>
        <kwd>question generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The teaching process of today widely uses electronic text resources that were
not originally intended for use as teaching aids. This is especially true for
subjects that deal with rapidly developing domains such as information technology.
Teaching these subjects may benefit from use of various articles and technical
papers, which do not contain test questions or exercises, as opposed to
textbooks. Developing the exercises is a complex task that may require a teacher to
spend a significant amount of time on. A promising way to facilitate this task is
automated generation of test items from text with the help of Natural Language
Processing (NLP).</p>
      <p>
        The general idea is to extract fragments from the source text document and to
transform them into questions or test items. This idea has been studied by several
researchers, and is commonly considered difficult to implement. For instance,
Heilman [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has discovered numerous challenges in question generation from
text. These include linguistic challenges (lexical, syntactic, discourse-related) as
well as various challenges related to the application of question generation tools
in classrooms (usability, human-computer interaction issues).
      </p>
      <p>
        Previously, we have described an experimental system for generating
fillin-the-blank test items from Russian-language text, which is designed for use
with the e-learning platform Moodle1 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We have showed that the automated
generation of test items is not accomplished easily, but can yield some useful
results. In this paper, we are going to review the quality aspects of the approach
being studied and consider ways to improve it.
      </p>
      <sec id="sec-1-1">
        <title>1 https://moodle.org/</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Generation: Approach and Quality Aspects</title>
      <p>
        We present the approach as the sequential application of text processors that
perform the following operations on the document:
1. Text preprocessing to convert a raw text file into a well-defined sequence
of linguistically meaningful units (as defined in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), or segments
2. Segment filtering to filter the set of segments so that it contains the most
salient segments
3. Test item generation to transform the text segments into test items
      </p>
      <p>Let us consider each of the operations from the quality perspective.
2.1</p>
      <sec id="sec-2-1">
        <title>Text Preprocessing</title>
        <p>This operation consists of two stages: document triage and text segmentation.
Document triage is the process of converting a digital file into a well-defined
text document. It involves such actions as character encoding identification and
text sectioning (identificating the actual content within a file while discarding
headers, links, and formatting features). This stage is solely technical and easy
to accomplish with available software tools. However, it could crucially affect the
results (e.g. improper encoding detection would make the Russian text
unreadable), and should be a significant concern to the software developers.</p>
        <p>
          Text segmentation is performed to acquire segments from which to produce
test items. Previously we referred to this stage as sentence splitting, because we
use sentences as basic segments, while considering a sentence to be a semantically
complete portion of text. At first sight, a sentence is a sequence of characters
that ends with “.”, “!”, or “?”, but in practice one should keep in mind that these
characters can also be used inside one sentence [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Today’s NLP tools perform
sentence splitting with fairly high precision. In preliminary experiments we used
a tokenization module provided by the AOT toolkit2, which recognizes common
Russian abbreviations with periods, such as “г.” (year), “гг.” (years), “и т. д.”
(etc.), “т. е.” (i.e.), “т. н.” (so called), as well as special text features such as
bulleted lists, sentences enclosed in quotation marks or parentheses, and URLs.
The experiments have shown that this step does not introduce a significant
number of errors in the resulting test items.
        </p>
        <p>In some cases, it may be reasonable to include more than one sentence in a
segment (when multiple sentences are used to express one significant thought).
While automatic detection of such sentence groups is a complex
semanticsrelated task, we assume that the user should be given an ability to see the
context of the processed sentence at the test item generation step. This ability
would allow the user to expand the segment if needed, and should be considered
for implementation during the user interface design of the generating software.</p>
        <sec id="sec-2-1-1">
          <title>2 http://www.aot.ru/</title>
          <p>2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Segment Filtering</title>
        <p>
          It is obvious that not every text sentence is appropriate for test item generation.
We assume that proper filtering of acquired sentences could have a convincing
impact on the quality of the resulting test items set, and we propose using
extractive text summarization to filter out the unneccessary text portions. In
NLP, different methods for scoring sentences by importance are applied (usually
in combination) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]: sentence length cut-off (short sentences are excluded), use of
cue phrases (inclusion of sentences with phrases such as “in conclusion”), sentence
position in a document/paragraph, occurrence of frequent terms (based on
TFIDF term weighting), and occurrence of title words.
        </p>
        <p>We are planning to leverage an existing summarization toolkit and attempt
taming it for our task. For example, MEAD3 is claimed to be modifiable to
support languages other than English. Similarly to the text segmentation, it
would be reasonable to show the highest-scoring sentences inline, so that the
user could see the discarded portions and use them if they appear to be useful.</p>
        <p>The performance of this step is to be evaluated experimentally. We are
planning to compare the summarization output with the selection made by human
experts and calculate such metrics as precision and recall (commonly used in
informational retrieval).
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Test Item Generation</title>
        <p>As a starting point of the research, we generate fill-in-the-blank test items (“cloze
questions”). To produce a cloze question, we take a sentence and replace some of
the words in the sentence with blanks. For additional clarity, we add a hint into
the question, explaining what kind of answer is expected. Below is an example:
Source: В отличие от перцептронов рефлекторный алгоритм
напрямую рассчитывает адекватную входным воздействиям реакцию
интеллектуальной системы.</p>
        <p>Result: В отличие от перцептронов ......... (какой?) алгоритм
напрямую рассчитывает адекватную входным воздействиям реакцию
интеллектуальной системы.</p>
        <p>Or, in English:
Source: In contrast to perceptrons, the reflective algorithm directly calculates
the reaction of the intelligent system with respect to input actions.
Result: In contrast to perceptrons, the ......... (what?) algorithm directly
calculates the reaction of the intelligent system with respect to input actions.</p>
        <p>
          The system recognized an adjective (“рефлекторный” “reflective”),
replaced it with a blank, and inserted a hint in parentheses: “какой?” (“what?”).
Also, the current system is able to add appropriate hints for acronyms, numbers,
definitions, sentence subjects, adverbials (more examples were shown in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]).
        </p>
        <sec id="sec-2-3-1">
          <title>3 http://www.summarization.com/mead/</title>
          <p>The main problem here is to determine which words should be blanked out
to produce a useful question. A good approach could be finding the sentence’s
focus (in the sense of information structure), which is difficult to do with the
state-of-the-art NLP tools. Another idea is based on the assumption that it is
more useful to blank out special terms than common words. We could match
words of the sentence against either a pre-existing domain-specific bag of words
or a bag of words acquired through terminology extraction from the processed
text, and blank out the matches.</p>
          <p>Another issue, which arises at this step, is that the processed sentences may
contain anaphora. Without an implementation of automatic anaphora resolution,
the user could resolve the anaphora manually (e.g. to replace pronouns with
corresponding nouns) using the in-context display of the processed sentence.</p>
          <p>
            While cloze items are fairly easy to produce from sentences, fill-in-the-blank
is a trivial style of test. This concern could be addressed by considering the two
ideas: generation of interrogative sentences (it would require text simplification
and word reordering [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]) and generation of distracting answers for multiple-choice
tests (a possible solution is described in [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]).
3
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Future Work</title>
      <p>Based on the preceding research, we have considered several quality aspects of the
automated generation of assessment test items from natural-language text. We
have discovered the following directions for quality improvement of our system:
1. The user interface should display the context of the text excerpt being
processed in a user-friendly way for efficient human-computer interaction.
2. We will leverage a summarization toolkit for segment filtering and evaluate
it experimentally.
3. Other directions include anaphora resolution, interrogative sentence
generation, and distractor generation for multiple-choice tests.
Аннотация В работе приведен обзор задачи автоматизированной
генерации тестовых заданий для проверки знаний из текста на
естественном языке. В ранее опубликованной статье была описана
экспериментальная система для генерации заданий на заполнение
пропусков из русскоязычного текста. В данной работе
проанализированы некоторые аспекты качества работы системы. Определены
основные направления для дальнейшей работы, включая оценку системы
и разработку методов фильтрации текстовых фрагментов и выбора
слов для замены на пропуски.
Ключевые слова: оценка знаний в образовании, автоматическая
обработка текста, генерация тестовых заданий, генерация вопросов.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Heilman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Automatic Factual Question Generation from Text</article-title>
          .
          <source>Ph.D. Dissertation</source>
          . Carnegie Mellon University, Pittsburgh, USA,
          <year>2011</year>
          . 195 p.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kurtasov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>A System for Generating Cloze Test Items from Russian-Language Text /</article-title>
          <source>In Proceedings of the Student Research Workshop associated with The 9th International Conference RANLP</source>
          <year>2013</year>
          . P.
          <volume>107</volume>
          -
          <fpage>112</fpage>
          . Hissar, Bulgaria,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Indurkhya</surname>
          </string-name>
          , N.;
          <string-name>
            <surname>Damerau</surname>
            ,
            <given-names>F. J</given-names>
          </string-name>
          . (eds).
          <source>Handbook of Natural Language Processing (Second Edition)</source>
          .
          <source>Chapman and Hall/CRC</source>
          ,
          <year>2010</year>
          . 704 p.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Grefenstette</surname>
            , G.; Tapanainen,
            <given-names>P.</given-names>
          </string-name>
          <article-title>What is a Word, what is a Sentence? Problems of Tokenisation /</article-title>
          <source>In Proceedings of The 3rd Conference on Computational Lexicography and Text Research</source>
          . Budapest, Hungary,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hynek</surname>
            , J.; Jzeeˇk,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Practical approach to automatic text summarization</article-title>
          . /
          <source>In Proceedings of the ELPUB 2003 conference. Guimaraes, Portugal</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mitkov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ha</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karamanis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <article-title>A computer-aided environment for generating multiple-choice test items // Natural Language Engineering</article-title>
          .
          <year>2006</year>
          .
          <volume>12</volume>
          (
          <issue>2</issue>
          ). P. 1-
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>