<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of QA4MRE 2013 Entrance Exams Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anselmo Peñas</string-name>
          <email>anselmo@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yusuke Miyao</string-name>
          <email>yusuke@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduard Hovy</string-name>
          <email>hovy@cmu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pamela Forner</string-name>
          <email>forner@celct.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noriko Kando</string-name>
          <email>kando@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CELCT</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Carnegie Mellon University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>NLP&amp;IR group</institution>
          ,
          <addr-line>UNED</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>National Institute of Informatics</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the Question Answering for Machine Reading (QA4MRE) Entrance Exams at the 2013 Cross Language Evaluation Forum. The data set of this task is extracted from actual university entrance examinations as-is, and therefore includes a variety of topics in daily life. Another unique feature of the Entrance Exams task is that questions are designed originally for testing human examinees, rather than evaluating computer systems. Therefore, the data set is expected to have a natural distribution of human ability for reading and understanding texts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The Entrance Exams task at CLEF 2013 QA4MRE is focused on
solving Reading Comprehension tests of English examinations. Reading
Comprehension tests are routinely used to assess the degree to which
people comprehend what they read, so we work with the hypothesis
that it is reasonable to use these tests to assess the degree to which a
machine “comprehends” what it is reading.</p>
      <p>In QA4MRE, tests are usually made in an artificial way by
organizers, in order to test properly systems performance on a controlled
set of question types and a defined level of inference.</p>
      <p>In such scenarios, the question arises how the performance of
systems on artificial tests compares to their performance when
confronted with real human tests. We believe that finding a real benchmark
able to test real systems performance over the time offers great value to
assess real progress in the field along the future years.</p>
      <p>With this goal in mind, CLEF and NTCIR started collaboration
around the idea of testing systems against University Entrance Exams,
the same exams humans have to pass to enter University. The data set
was prepared and distributed by NTCIR, while other organization
efforts, including announcements, collecting and evaluating submissions,
etc. were managed by CLEF. This style of the organization reduced the
workload of each side, since the NTCIR side is already familiar with
the contents of the data and its copyright issues, while the CLEF side
has already established other organization processes such as submission
management and evaluation. The success of this coordination also
owes to the standard data format and evaluation methodology, which
were also adopted for this pilot task. The next round of this task is
expected to be organized in a similar manner.
2</p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>The form of the task is essentially the same as the QA4MRE Main
Task. Participant systems are asked to read a given document and
answer questions. Questions are given in multiple-choice format, with
several options from which a single answer must be selected.</p>
      <p>A crucial difference from the other QA4MRE tasks is that
background text collections are not provided. Systems have to answer
questions by referring to "common sense knowledge" that high school
students who aim to enter the university are expected to have. Another
important difference is that we do not intend to restrict question types.
Any types of reading comprehension questions in real entrance exams
will be included in the test data.
3</p>
    </sec>
    <sec id="sec-3">
      <title>DATA</title>
      <p>Japanese University Entrance Exams include questions formulated at
various levels of complexity and test a wide range of capabilities. The
challenge of "Entrance Exams" aims at evaluating systems under the
same conditions that humans are evaluated to enter the University. In
this first campaign we reduced the challenge to Reading
Comprehension exercises contained in the English exams.</p>
      <p>The data set is extracted from standardized English
examinations for university admission in Japan. Exams are created by the
Japanese National Center for University Admissions Tests.</p>
      <p>Original examinations include various styles of questions, such
as word filling, grammatical error recognition, sentence filling, etc.</p>
      <p>One of such styles is reading comprehension; a test provides a
text that describes some daily life situation, and questions about the text
are asked. Since this type of questions is suitable for the QA4MRE lab,
we extracted questions of this type automatically from XML files of the
examination data, and converted the XML annotations to fit the
standard format of QA4MRE.</p>
      <p>For each examination, one text is given, and five questions on the
given text are asked. Each question has four choices. For this year
campaign, we selected 10 examinations, one of which was delivered as
development data while the others were provided as final test data. That
is, we provided 9 documents, 46 questions1 and 184 choices.
4</p>
    </sec>
    <sec id="sec-4">
      <title>EVALUATION</title>
      <p>Scoring of the output produced by participant systems was performed
automatically by comparing the answers of systems against the gold
standard collection with annotations made by humans. No manual
assessment was performed.</p>
      <p>
        Each test receives an evaluation score between 0 and 1 using c@1
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This measure, used in previous CLEF QA Tracks, encourages
systems to reduce the number of incorrect answers while maintaining the
number of correct ones by leaving some questions unanswered.
Systems received evaluation scores from two different perspectives:
      </p>
    </sec>
    <sec id="sec-5">
      <title>1. At the question-answering level: correct answers are counted</title>
      <p>individually without grouping them
2. At the reading-test level: figures both for each reading test as a
whole are given.
5</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS</title>
      <p>During registration, 27 different groups showed interest in the task. Out
of them, 10 groups fulfilled the data agreements, and finally, only 5
teams submitted runs. Despite their interest in the task, some groups
expressed that the difficulty of the tests exceeded the current state of
the art in the field and decided not to participate. Table 1 enumerates
the participating groups and their reference paper in CLEF 2013
Working Notes.
1 One test document was accompanied with 6 questions exceptionally.</p>
      <p>
        NIIJ
JUCS
NARA
CMU
LIMSCNRS
National Institute of Informatics, Japan
Jadavpur University, India
Nara Institute of Science and Technology,
Japan
Carnegie Mellon University, United States
ILES – LIMSI, France
Results are summarized in Tables 2 and 3 for the QA and for Reading
perspectives respectively. According to Table 2, the system with higher
score (jucs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) is the one that answered incorrectly less questions. It is
also the unique system that answered more questions correctly than
incorrectly, finding a better balance with leaving some questions
unanswered. This indicates that their modules to detect whether they have
enough evidence about the correctness of the answer are working pretty
well.
      </p>
      <p>Table 3 shows results under the reading perspective. First column
corresponds to systems run id, second column to the overall c@1
obtained, third column shows the number of tests that the systems have
passed if we consider the threshold of 0.5, and the rest of columns
correspond to the c@1 value for each particular test.</p>
      <p>
        Run
jucs
NIIJ-3
NIIJ-5
AVERAGE
NIIJ-4
RANDOM
NIIJ-2
lims-cnrs-1
MEDIAN
NIIJ-1
nara
lims-cnrs-2
cmu
JUCS [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] report very good results using a system based on Textual
Entailment and answer ranking. One particularity of this system is that it
only answered 23 questions out of the 46. From these 13 were right and
10 wrong. This strategy is rewarded by c@1, since that provides partial
credit when no answer is given instead of an incorrect one. It is worth
noticing the difference in score among different tests. In particular,
authors report that the difference depends on the type of questions of tests
1 and 7.
      </p>
      <p>
        The NIIJ system [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] also performed above average and random
baseline. It is also based on Textual Entailment after combining
relevant sentences, questions, and answers. In their case, the best run
answered all questions, being 16 correct answers and 30 incorrect ones.
      </p>
      <p>Results also show that systems based only on statistical analysis
of words alone can’t perform the kind of inferences required to solve
the tests.
6</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSIONS</title>
      <p>The dataset together with results suggest something very interesting:
the need to develop strategies to reject answers more than strategies to
accept answers. In one hand, the dataset shows that in some cases, the
way to select the correct answer is by discarding the other candidates.
In the other hand, most systems still select more incorrect answers than
correct ones, while a measure of progress in systems development is,
precisely, the reduction in selecting wrong answers.</p>
      <p>The Entrance Exams task shows that Question Answering is a
task far from being solved. This is true even for the simplified scenario
where only one text is given and a set of options are provided as
candidate answers to the question.</p>
      <p>Results also show that systems based only on statistical analysis
of words alone can’t perform the kind of inferences required to solve
the tests. In other words, that systems based only on textual similarity
can’t address the challenge.</p>
      <p>Finally, we think that Entrance Exams provides a real benchmark
able to assess real progress in the field along future years.
7</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGEMENTS</title>
      <p>The collaboration has been developed in the framework of Todai Robot
Project in Japan, and the CHIST-ERA Readers project in Europe. The
Todai Robot Project is a grand challenge headed by NII, and aims to
develop an end-to-end AI system that can solve real entrance
examinations of universities in Japan integrating heterogeneous AI
technologies, such as natural language processing, situation understanding, math
formula processing or vision processing.
8</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Anselmo</given-names>
            <surname>Peñas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alvaro</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          .
          <article-title>A Simple Measure to Assess Non-response</article-title>
          .
          <source>In Proceedings of 49th Annual</source>
          <article-title>Meeting of the Association for Computational Linguistics - Human Language Technologies (ACL-HLT</article-title>
          <year>2011</year>
          ), Portland, Oregon, USA,
          <year>2011</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Xinjian</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Tian</given-names>
            <surname>Ran</surname>
          </string-name>
          , Ngan L.T. Nguyen, Yusuke Miyao and
          <string-name>
            <given-names>Akiko</given-names>
            <surname>Aizawa</surname>
          </string-name>
          .
          <article-title>Question Answering System for Entrance Exams in QA4MRE</article-title>
          .
          <source>CLEF 2013 Working Notes</source>
          ,
          <year>2013</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Somnath</given-names>
            <surname>Banerjee</surname>
          </string-name>
          , Pinaki Bhaskar, Partha Pakray, Sivaji Bandyopadhyay and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          .
          <article-title>Multiple Choice Question (MCQ) Answering System for Entrance Examination</article-title>
          .
          <source>CLEF 2013 Working Notes</source>
          ,
          <year>2013</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Philip</given-names>
            <surname>Arthur</surname>
          </string-name>
          , Graham Neubig, Sakriani Sakti,
          <article-title>Tomoki Toda and Satoshi Nakamura. NAIST at the CLEF 2013 QA4MRE Pilot Task</article-title>
          .
          <source>CLEF 2013 Working Notes</source>
          ,
          <year>2013</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>