<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Solving Open-Domain Multiple Choice Questions with Textual Entailment and Text Similarity Measures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Neil Dhruvaz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oliver Ferschkeyz</string-name>
          <email>c@1</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Gurevychyz</string-name>
        </contrib>
      </contrib-group>
      <fpage>1375</fpage>
      <lpage>1385</lpage>
      <abstract>
        <p>In this paper, we present a system for automatically answering opendomain, multiple choice reading comprehension questions about short English narrative texts. The system is based on state-of-the-art text similarity measures, textual entailment metrics and coreference resolution and does not make use of any additional domain specific background knowledge. Each answer option is scored with a combination of all evaluation metrics and ranked according to their overall score in order to determine the most likely correct answer. Our best configuration achieved the second highest score across all competing system in the entrance exam grading challenge with a c@1 score of 0.375.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Question Answering (QA) systems have always been a key focus of Information
Retrieval and Natural Language Processing research. Such systems aim to automatically
answer questions posed in natural language. The objective of the Entrance Exams task
in the CLEF Question Answering Track is the creation of a system for determining
the correct answers to a set of multiple choice reading comprehension questions about
English narrative texts.</p>
      <p>
        The task demands a deep understanding of a short passage of text and is closely
related to the QA4MRE1 main task [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], which aims to focus on the concept of
answer validation. However, the main difference between the main task, and the Entrance
Exams task is the fact that no additional knowledge (background material) is provided
for the latter. The aim is thus to evaluate automatic answering systems under the same
conditions as humans are evaluated, considering that humans have certain background
knowledge from real-life experiences. A detailed task definition is provided in
Section 2.
      </p>
      <p>Based on conclusions drawn from a state-of-the-art analysis, which is discussed in
Section 3, our system uses three main components: coreference resolution, text
similarity and textual entailment. Text-Hypothesis (T-H) pairs are generated for each entrance
exam using combinations of sentences from the text, questions about the text and the
corresponding answer options. These T-H pairs form the basic processing units for
similarity and entailment analyses, which are employed to determine the correct answer to
a question. We describe our approach and the architecture of our system in Section 4.</p>
      <p>Finally, we present the results obtained with different system configuration in
Section 5 and provide a detailed error analysis in Section 6. We close with a summarization
of our findings and a discussion of future research directions in Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Task Definition</title>
      <p>The objective of the task tackled in this paper is to identify the correct answer option for
a multiple choice question in a reading comprehension test for a given English narrative
text. Since the questions are not restricted to a particular topic, they require a wide
range of inference. Additionally, no background information is provided and systems
are supposed to determine answers based on common sense knowledge that high school
students are supposed to possess.</p>
      <p>
        The dataset provided in this task is composed of reading comprehension tests taken
from Japanese university entrance exams and is available in English, Russian, French,
Spanish and Italian. The labeled training data consists of 12 documents with 60
questions and an average of 5 questions for each document. The test data comprises 12
documents with a total of 56 questions. We consider a test item to be a combination of
the narrative text, a single question about this text and all answer options for this
question. An exam is then the set of all test items with the same text. Each exam is evaluated
with a score between 0 and 1 using the c@1 measure [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], which is defined as:
1
n
      </p>
      <p>nc
nc + nu n
where, nc is the number of correctly answered questions, nu, the number of unanswered
questions, and n is the total number of questions. The main idea of this metric is to
encourage systems to reduce the number of incorrect answers while maintaining the
number of correct ones by leaving some questions unanswered.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>The question answering task on university entrance exams was first introduced in this
form at the CLEF QA4MRE Lab in 2013. In this section, we review the previous work
from the pilot task. Related work on enabling technologies such as textual entailment is
introduced later in the paper.</p>
      <p>
        The best performing system in the 2013 entrance exam task was presented by
Banerjee et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Their answer determination module comprises of named entity
recognition, syntactic similarity measures as well as textual entailment. They select the correct
answer using a maximum weighted score algorithm, but more importantly, they have
successfully developed a technique for avoiding questions for which the system shows
low confidence rather than answering them wrongly. Overall, the system achieved a
c@1 score of 0.42 by answering 13 of the 23 attempted questions correctly.
      </p>
      <p>
        The runner up system in the performance ranking was presented by Li et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
with an overall c@1 score of 0.35. The system heavily relies on coreference resolution
in the narrative text. Moreover, it includes a sentence extractor to identify the most
relevant sentences to a particular question while the final answer is determined with a
textual entailment classifier. However, the authors conclude that their sentence extractor
was the main bottleneck and that the textual entailment component used in their system
did not perform up to the mark.
      </p>
      <p>
        Finally, Arthur et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] employ the same system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] they proposed for the QA4MRE
2013 main task [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. It makes no use of textual entailment and fails to clear the
baseline (c@1 = 0:25) with a c@1 score of 0:22. The authors conclude that the reliance
on statistical methods alone, and the absence of logical inference hinders the answer
determination ability of their system for this task.
      </p>
      <p>
        As discussed by Pen˜as et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], using purely statistical analyses of words to
determine the correct answer to a reading comprehension questions does not work well
for this task. Using textual entailment, on the other hand, provides a means to carry out
necessary inferences to determine the correct answer. Banerjee et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], who created
the best performing system in the pilot task, also highlight the use of text similarity
measures in addition to textual entailment. With these conclusions at hand, we present
our system based on textual entailment, text similarity and coreference resolution.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>System Architecture</title>
      <p>We now present the architecture of our question answering system which is based on
text similarity measures, textual entailment and coreference resolution. The general idea
of the system is to retrieve sentences from the text that are relevant for a given question
and to identify the correct answer option among the multiple choices based on textual
entailment and similarity between each option and the retrieved sentences.</p>
      <p>
        Our system builds upon the Darmstadt Knowledge Processing Software Repository
(DKPro) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which is based on the Apache UIMA Framework [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In particular, it uses
NLP components from DKPro Core2 and textual similarity measures from the DKPro
Similarity3 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] framework.
      </p>
      <p>The system is divided into six main modules: corpus reader, preprocessing, sentence
retrieval, answer similarity analysis, entailment analysis, and answer selection. Figure 1
gives a schematic overview of the system architecture. The individual components are
described in the remainder of this section.
4.1</p>
      <sec id="sec-4-1">
        <title>Corpus Reader</title>
        <p>The corpus reader parses the provided XML document with the entrance exam data and
initializes a separate UIMA Common Analysis Structure (CAS) for each test item. The
text of each CAS consists of the narrative text appended with the question and the
corresponding answer options. Both the question and the answer options are then marked-up</p>
        <sec id="sec-4-1-1">
          <title>2 http://dkpro-core-asl.googlecode.com</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>3 http://dkpro-similarity-asl.googlecode.com</title>
          <p>Corpus
Reader</p>
          <p>Preprocessing
Sentence
Retrieval</p>
          <p>Answer
Similarity</p>
          <p>Textual
Entailment</p>
          <p>Answer
Selection
with UIMA annotations which identify the exact span of text and hold additional meta
information contained in the XML source document. The final CAS is then passed on
to the next module in the pipeline for further processing.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Preprocessing</title>
        <p>
          The preprocessing module consists of several NLP components that annotate the text
before the similarity and entailment analysis is carried out. All of these components
are based on the Stanford CoreNLP4 package and employed in DKPro Core wrappers.
We use the Stanford segmenter and lemmatizer for identifying tokens, sentences and
lemmas in the text. The Stanford parser [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] is furthermore used to annotate parts of
speech (PoS) and the constituent structure. Moreover, we observed that most of the
narrative texts feature recurring characters (as also noted by Li et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]). Hence, the
Stanford Named Entity Recognizer [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is used to identify entities of the Person type
in the text. Finally, the Stanford coreference resolver [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] links the different named
entities with other occurrences in the text. Based on the resulting coreference chains,
an answer option is then modified, or resolved, such that pronouns in the answer option
are replaced with the corresponding proper nouns obtained from the question text. For
example:
        </p>
        <p>Question: What did John want?
Answer Option: He wanted a bike.</p>
        <p>Resolved Answer: John wanted a bike.</p>
        <p>To further improve the sentence retrieval, we also resolve all pronoun entities in the
main text. This resolution facilitates the identification of relevant sentences in the text
for each question and answer option, since it reduces the lexical variation introduced by
referring expressions.</p>
        <sec id="sec-4-2-1">
          <title>4 http://nlp.stanford.edu/software/corenlp.shtml</title>
          <p>4.3</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Sentence Retrieval</title>
        <p>
          We found that the downstream textual entailment module is sensitive to irrelevant
information. That is, the longer the text from which a given hypothesis is to be inferred,
the more likely the system is to report a falsely positive inference. It is therefore
necessary to first identify sentences in the text that are most relevant to a particular question
before proceeding with further analyses of the answer options. We use three different
text similarity measures to identify sentences from the document that are most relevant
to a particular question:
1. Lexical similarity: The Word n-gram Jaccard measure from the DKPro Similarity
package implements a generalization of the Ferret measure [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] to support
similarity calculations for token n-grams of arbitrary length using the Jaccard coefficient.
        </p>
        <p>
          We employ the Jaccard measure based on token unigrams.
2. ESA-based similarity: Text similarity based on explicit semantic analysis [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] using
a Wikipedia based index. For the similarity calculation, cosine is used as an inner
product along with L2 vector normalization in the cector comparator provided by
DKPro Similarity.
3. PoS similarity: A part of speech based measure for calculating the overlap between
two bags of PoS. We only consider nouns, verbs and adjectives for the similarity
calculations.
        </p>
        <p>For all similarity measures, the input texts are stopword-filtered using a subset of the
English stopword list from the DKPro WSD5 package. Each similarity measure
provides a score between 0 and 1. Using a linear combination of these scores, we select the
top k sentences for each question. In our experiments, we empirically found k = 5 to
be an optimal value for our setup. We use only these sentences in the following modules
to determine the correct answer among all possible options.</p>
        <p>As a basic processing unit for the following analyses, we define Text-Hypothesis
(T-H) pairs. We use each retrieved sentence from the narrative text as T and all answer
options along with all possible combinations of the question with each answer option
as H. Both for T and H, we use resolved and unresolved version of the text, depending
on the system configuration.</p>
        <p>We noticed that, for many questions, the correct answer lies in the sentence before
or after the selected sentence. Hence, instead of restricting the set of T entities entirely
to the selected sentences, we added an option to include the previous and next sentences
as well. These are added as separate T entities to the set. This setting is referred to as
S later on (see Table 1).</p>
        <p>All downstream analyses are then performed on all possible T-H pairs. Our approach
to T-H pair creation is prone to overgeneration, i.e. it will contain irrelevant pairs that
can even degrade the system performance. We therefore implemented several selection
techniques that only consider particular T-H pairs. An overview of the different
configurations will be described in Section 5.</p>
        <sec id="sec-4-3-1">
          <title>5 http://dkpro-wsd.googlecode.com</title>
          <p>4.4</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Textual Entailment</title>
        <p>
          In order to compute textual entailment on the T-H pairs, we use the EXCITEMENT
Open Platform (EOP) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], a UIMA-based framework with implementations of
stateof-the-art textual entailment algorithms along with pre-trained models. Each T-H pair
is processed using an EOP annotator, which first carries out additional preprocessing
on the T-H pair (depending on the model used) and then uses an Entailment Decision
Algorithm (EDA) to perform the inference between T and H.
        </p>
        <p>
          We use the MaxEntClassificationEDA configuration provided in EOP, which
uses a maximum entropy model based on the OpenNLP MaxEnt [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] package for
learning an entailment classifier. Scoring techniques, such as bag of words and bag of lemmas
scoring, are used along with an option to employ additional resources including Verb
Ocean and WordNet. This enables us to incorporate real-world knowledge to aid the
entailment procedure. The classifier was trained on the RTE-3 dataset [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The entailment
decisions are binary but provide a confidence score. We store both the decision and the
confidence score in the T-H markup of the CAS.
4.5
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>Answer Similarity</title>
        <p>While the idea of T-H pairs originally stems from the textual entailment theory, we
generalize the idea to similarity analysis as well. As mentioned previously, each sentence
(or resolved sentence) is labeled as text (T), while an answer option (or resolved answer
option, or a question combined with an answer option) is labeled as the hypothesis.
Consequently, similarity measures can be applied to these T-H pairs and corresponding
scores can be used to determine the correct answer.</p>
        <p>We explained earlier that we do not only include the answer options (and their
resolved variants) in the set of hypotheses, but also a combination of the answer option
with the question. The reason for this can be seen in the following example:
Question: People are normally regarded as old when
Answer Option: they are judged to be old by the society
Combined: People are normally regarded as old when they are judged to be old
by the society.</p>
        <p>Relevant Sentence: But in general, people are old when society considers them
to be old, that is, when they retire from work at around the age of sixty or
sixty-five.</p>
        <p>It is easy to see that the combination of question and answer makes the inference of the
correct answer easier as compared to using the answer option alone.</p>
        <p>We use the same similarity measures that we described earlier in Section 4.3 to
measure the similarity between T and H in all T-H pairs. A linear combination of the
scores generated by each measure is stored in the markup of each T-H pair and used
together with the entailment score for the final answer selection.
4.6</p>
      </sec>
      <sec id="sec-4-6">
        <title>Answer Selection</title>
        <p>The entailment confidence score and the answer similarity scores obtained for each T-H
pair are finally used to select the correct answer option. We calculate a correctness score
1 Original Jaccard – – – S A, reA
2 Original Jaccard, ESA S A, reA Jaccard, ESA S a A, reA, Q+A
3 Original Jaccard, ESA – – – S A, reA, Q+A
4 Original PoS, ESA S A, reA ESA S A, reA, Q+A
5 Original Jaccard S A, reA Jaccard, ESA S A, reA, Q+A
6 Original ESA S A Jaccard, ESA S A, Q+A
7 Resolved Jaccard, ESA reS reA Jaccard, ESA reS reA, Q+A
aSentences are selected based on the top 10 sentences according to their similarity to each answer option
rather than the sentence selection procedure described in the text.
We experimented with different linear combinations of the two scores and found that
the best performance is achieved when the entailment score is given a higher weight
compared to the similarity score. More specifically, we achieved good results with
fx; yg = f2; 1g and fx; yg = f3; 2g. In cases where the entailment confidence
exceeded a threshold of .90, we set the weights to fx; yg = f1; 0g thus eliminating the
similarity score from the decision process. We finally select the answer option
corresponding to the T-H pair with the highest correctness score as the correct answer for a
given question.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>For the Entrance Exams 2014 task, we submitted a total 7 configurations. In this section,
we discuss the individual configurations, the scores achieved with each setup and finally
provide a detailed error analysis.</p>
      <p>Each configuration consists of three main components, the sentence selection
strategy, the answer similarity analysis and the entailment analysis. The exact setup for each
configuration is provided in Table 1. Overall, configuration 7 achieved the best results
with a c@1 score of 0:375. Moreover, configurations 3; 5; and 6 performed above or
equal to the random baseline of 0:25, while 1; 2; and 4 performed poorly with c@1
scores below the baseline. The reason configuration 1 failed was mainly due to the fact
that answer selection was merely performed on the sentences pertaining to the
questions, without considering those surrounding the selected sentences. Hence the idea of
using the sentence before and after a selected sentence, was not implemented for this
configuration.</p>
      <p>Configuration 2, on the other hand, used the previous and next sentences, but failed
to clear the baseline. One of the main reasons for this was the fact that sentences chosen
as H for the entailment analysis were chosen based on high answer similarity scores
rather than the sentence selection procedure based on the question. This resulted in the
selection of certain sentences that were related to incorrect answer options, and not
directly related to the question. For configurations 4, the linear combination of answer
similarity and entailment scores used to determine the final answer was ineffective, and
led to incorrect answers. Configurations 3 and 5 did well because the weight assigned
to the entailment score was increased, and a lower weight, (0 in case of configuration
3,) was assigned to the answer similarity score. Nonetheless, answer similarity helped
determine a few answers where entailment analysis alone had failed, for instance, in
configurations 5 and 7.</p>
      <p>Additionally, in all but the last run, we used only the original sentences as T.
However, for 7, we introduced the idea of using resolved sentences as T. This improved our
scores dramatically, giving a c@1 score of 0.375. As for configuration 6, without the
use of resolved answers, it failed to perform as well as 3; 5 or 7.</p>
      <p>This leads us to conclude that coreference resolution in the sentences and answer
options is highly beneficial in determining the correct answer. Additionally, using the
previous as well as the next sentence in addition to the selected sentence as T is better
than using only the selected sentence. The results confirm that many answers can be
identified from a sentence close to the selected one using the question text. Finally,
when using a linear combination of the answer similarity and entailment scores, a higher
weight should be assigned to the entailment score.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Error Analysis</title>
      <p>In order to identify systematic errors in the decision making process of our system,
we conducted an error analysis on the output of each configuration described in the
previous section.</p>
      <p>The implementation of coreference resolution replaces pronouns and phrases
describing a person entity with the proper noun. This generates some issues when the
phrase describing a person is vital to answer determination. The coreference resolution
in the preprocessing module can further be improved by replacing only certain terms,
mainly pronouns, in sentences with the corresponding named entities.</p>
      <p>While the sentence extraction works well in most cases, there are two problems
that can be encountered. The first problem is that the correct sentence pertaining to a
question is not the one with the highest score. Consequently, we select k (=5) sentences
rather than just one. Nonetheless, an improvement in identifying the most relevant
sentence, and reducing k to 1 or 2 sentences will help to narrow down possible mistakes
with answer determination. The second problem is when the question is not sufficient
to identify the correct sentence in the text. For example, a question like the following
requires sentence selection based on the answer options instead of the question text:</p>
      <p>Question: The main point the author wishes to make is that
While the answer similarity module was able to assist in determining certain answers, it
was generally outperformed by the other measures employed in our system, in particular
by the entailment module. It particularly underperformed in case of similarly phrased
answer options for a given question, which caused false predictions. Consequently, a
lower weight was assigned to the similarity score. With an improved linear combination
of the various similarity measures integrated in this module, the performance can further
be improved.</p>
      <p>While the entailment module based on EOP provides good results, especially in the
case of resolved sentences and answer options, it does not perform optimally in certain
scenarios. For instance, the module does not perform well when more than one answer
options are very similar. This is illustrated by the following example:</p>
      <p>Excerpt: “I’m from Georgia,” he said in a loud voice,“and proud of it.” “Sorry
I made the mistake,” I told him, though I just couldn’t see what difference it
made whether he came from Georgia or Alabama.</p>
      <p>Question: What mistake did the writer make about the man with glasses?
Incorrect Answer: He thought the man came from Georgia</p>
      <p>Correct Sentence: He thought the man’s home state was Alabama
Moreover, certain questions require advanced inference based on world knowledge
rather than relying on the document surface text alone. In these cases, the correct
answers are difficult to determine with the entailment module. For example:
Excerpt: He was holding onto the arms of his seat so tightly that the blood had
left his fingers. He was what is known as a “white-knuckle flier.” He would
not look out the window, and he was sweating so much that the stewardesses
had to keep bringing him towels.</p>
      <p>Question: What is meant by a “white-knuckle flier?”
Answer: It is a person who is extremely nervous on an airplane.</p>
      <p>We also encountered errors when a question and answer pair spanned multiple
sentences. Since our system considers at most 3 sentences at a time, i.e. the selected,
previous and next sentence, it sometimes fails to obtain the desired inference for the correct
answer option. With a more generalized retrieval module that supports larger text
windows, we could be able to circumvent this problem but have to take into account the
added noise that is introduced by larger spans of text.</p>
      <p>Thus, to summarize, a refined approach to coreference resolution along with an
improved answer similarity module will help to improve scores to a great extent. In
addition, the entailment module can be trained with different datasets to improve the
classifier, while a tighter sentence selection module will help to narrow down possible
sentences relevant to a particular question.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Work</title>
      <p>Automatically answering multiple choice reading comprehension questions is a
challenging task. In this paper, we presented a system designed to solve such questions
without any domain-specific background knowledge. Our strategy for determining
answers to reading comprehension questions is based on three main components: text
similarity, textual entailment and coreference resolution.</p>
      <p>As already proposed in related work, textual entailment is the key technology for
tackling this task. However, it is sensitive to irrelevant information both in the text
and the hypothesis, so a strong filter or retrieval component based on text similarity
measures drastically helps to improve the quality of the entailment results. Due to the
nature of this task, which focused on reading comprehension questions about narrative
texts, coreference resolution furthermore helped to improve the discriminative power of
the entailment and similarity scores for the final answer selection.</p>
      <p>With our best configuration, we achieved a c@1 score of 0:375, which puts our
system on the second place in the overall performance ranking. Moreover, 3 other
configurations performed better than or equal to the random baseline of c@1 = 0:25. In our
error analysis, we finally identified the most prominent error types our system
encountered.</p>
      <p>As part of future work, our priorities are to refine the coreference resolution module,
and to work on improving the performance of the answer similarity module. In addition,
the entailment module can be trained on datasets more suited to the task, in order to
improve the classifier, while a tighter sentence selection module will help narrow down
possible sentences relevant to a question.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work has been supported by the Volkswagen Foundation as part of the
LichtenbergProfessorship Program under grant No. I/82806.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arthur</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neubig</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toda</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakamura</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Inter-Sentence Features and Thresholded Minimum Error Rate Training: NAIST at CLEF 2013 QA4MRE</article-title>
          . In
          <source>: Proceedings of CLEF 2013 Evaluation Labs and Workshop</source>
          . Online Working Notes. pp.
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Arthur</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neubig</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toda</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakamura</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>NAIST at the CLEF 2013 QA4MRE Pilot Task</article-title>
          .
          <source>In: Proceedings of CLEF 2013 Evaluation Labs and Workshop</source>
          . Online Working Notes. pp.
          <fpage>2</fpage>
          -
          <lpage>5</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhaskar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Multiple Choice Question (MCQ) Answering System for Entrance Examination</article-title>
          .
          <source>In: Proceedings of CLEF 2013 Evaluation Labs and Workshop</source>
          . Online Working Notes (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Ba¨r, D.,
          <string-name>
            <surname>Zesch</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
          </string-name>
          , I.:
          <article-title>DKPro Similarity : An Open Source Framework for Text Similarity</article-title>
          .
          <source>In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>
          . pp.
          <fpage>121</fpage>
          -
          <lpage>126</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Berger</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietra</surname>
            ,
            <given-names>V.J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietra</surname>
            ,
            <given-names>S.A.D.:</given-names>
          </string-name>
          <article-title>A maximum entropy approach to natural language processing</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>22</volume>
          (
          <issue>1</issue>
          ),
          <fpage>39</fpage>
          -
          <lpage>71</lpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ferrucci</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lally</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>UIMA: an architectural approach to unstructured information processing in the corporate research environment</article-title>
          .
          <source>Natural Language Engineering</source>
          <volume>10</volume>
          (
          <issue>3-4</issue>
          ),
          <fpage>327</fpage>
          -
          <lpage>348</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Finkel</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grenager</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Incorporating non-local information into information extraction systems by Gibbs sampling</article-title>
          .
          <source>In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics</source>
          . pp.
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          . Morristown, NJ, USA (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gabrilovich</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markovitch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Computing semantic relatedness using wikipedia-based explicit semantic analysis</article-title>
          .
          <source>In: Proceedings of The Twentieth International Joint Conference for Artificial Intelligence</source>
          . pp.
          <fpage>1606</fpage>
          -
          <lpage>1611</lpage>
          . Hyderabad,
          <string-name>
            <surname>India</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Giampiccolo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dagan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dolan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The third pascal recognizing textual entailment challenge</article-title>
          .
          <source>In: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . RTE '
          <volume>07</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , Mu¨hlha¨user,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , Mu¨ller,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Steimle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Weimer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Zesch</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Darmstadt knowledge processing repository based on uima</article-title>
          .
          <source>In: Proceedings of the First Workshop on Unstructured Information Management Architecture at Biannual Conference of the Society for Computational Linguistics and Language Technology</source>
          . Tu¨bingen,
          <string-name>
            <surname>Germany</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Accurate unlexicalized parsing</article-title>
          .
          <source>In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics</source>
          . vol.
          <volume>1</volume>
          , pp.
          <fpage>423</fpage>
          -
          <lpage>430</lpage>
          . Morristown, NJ, USA (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , N.:
          <article-title>Question Answering System for Entrance Exams in QA4MRE</article-title>
          .
          <source>In: Proceedings of CLEF 2013 Evaluation Labs and Workshop</source>
          . Online Working Notes (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lyon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrett</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malcolm</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A theoretical basis to the automated detection of copying between texts, and its practical implementation in the ferret plagiarism and collusion detector</article-title>
          .
          <source>Plagiarism: Prevention, Practice and Policies</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Pado´,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Noh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.G.</given-names>
            ,
            <surname>Stern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Zanoli</surname>
          </string-name>
          , R.:
          <article-title>Design and realization of a modular architecture for textual entailment</article-title>
          .
          <source>Natural Language</source>
          Engineering pp.
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Pen˜as,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Miyao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Hovy</surname>
          </string-name>
          , E.:
          <article-title>Overview of QA4MRE 2013 Entrance Exams Task</article-title>
          .
          <source>In: Proceedings of CLEF 2013 Evaluation Labs and Workshop</source>
          . Online Working Notes. pp.
          <fpage>2</fpage>
          -
          <lpage>7</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. Pen˜as,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>A simple measure to assess non-response</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</source>
          . pp.
          <fpage>1415</fpage>
          -
          <lpage>1424</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Raghunathan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangarajan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chambers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A multi-pass sieve for coreference resolution</article-title>
          .
          <source>In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>492</fpage>
          -
          <lpage>501</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sutcliffe</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Pen˜as,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Hovy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Forner</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Overview of QA4MRE Main Task at CLEF 2013</article-title>
          .
          <source>In: Proceedings of CLEF 2013 Evaluation Labs and Workshop</source>
          . Online Working Notes (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>