<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Approach to the Main Task of QA4MRE-2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mar lia Santos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Saias</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paulo Quaresma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Informatica, ECT Universidade de Evora</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article describes the participation of a group from the University of Evora in the CLEF2013 QA4MRE main task. Our system has a super cial text analysis based approach. The methodology starts with the preprocessing of background collection documents, whose texts are lemmatized and then indexed. Named entities and numerical expressions are sought in questions and their candidate answers. Then the lemmatizer is applied and stop words are removed. Answer patterns are formed for each question+answer pair, with a search query for document retrieval. Original search terms are expanded with synonyms and hyperonyms. Finally, the texts retrieved for each candidate response are segmented and scored for answer selection. Considering only the main questions, the system best result was obtained in the third run, having answered to 206 questions, with 0.24 c@1 and 51 correct answers. When evaluating main and auxiliary questions, the nal run continued to have our better results, being answered 245 questions, with 64 right answers and 0.26 for c@1. The use of hypernyms proved to be an improvement factor in the third run, which results had a 12% increase of correct answers and a 0.02 gain in c@1.</p>
      </abstract>
      <kwd-group>
        <kwd>MRE</kwd>
        <kwd>QA</kwd>
        <kwd>NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This article describes the participation of a group from the University of Evora in
the Question Answering for Machine Reading Evaluation (QA4MRE) challenge
of the 2013 edition of Cross Language Evaluation Forum (CLEF)1. Although
some authors of this paper have previous work in other QA4MRE editions [
        <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
        ],
this work is based on a new system for the QA4MRE Main Task, associated with
the rst author's master's thesis work, and focused on the English language.
The objective of this task is the automatic understanding of one or more texts,
and the subsequent identi cation of the answer for several questions about
information that is stated or implied in those texts. While answering the questions,
systems must process single documents, and Background Collections (BC) with
documents that can be used as auxiliary information sources [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
1 http://clef2013.org/
This year's QA4MRE Main Task was composed by 4 topics, namely \Aids",
\Climate Change", \Music and Society" and \Alzheimer", and all of them
having a background collection of documents. Each topic had 4 reading tests with
15 to 20 questions each, and each question had 5 choice answers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The test
was composed by 240 main questions and 44 auxiliary questions. The latter are
duplicates of the main questions, but without the previously required inference,
allowing to test the ability of systems to use inference and its impact in the
question treatment.
      </p>
      <p>Next section presents our system arquitecture. Section 3 describes the
methodology we used to process the questions, answers and the background information.
The evaluation of the obtained results is detailed in section 4, while the last
two sections are devoted to an analysis of those results, some conclusions and a
balance of our participation.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Architecture</title>
      <p>The system architecture is shown in Figure 1 and has the following components:</p>
      <p>XML Parser - Extracts texts, questions and answers from the input and
stores them on the system;</p>
      <p>Indexing Component - Documents from BC pass through the lemmatizer
(Candc tools/ C&amp;C Boxer2) and then they are indexed with Lucene 3;</p>
      <p>Consult Index Component - Responsible for processing question and
answers and perform document retrieval. With keywords from question and
answers, this component uses Lucene to search for relevant documents in BC. The
analysis and search query creation is based on:
{ Lemmatizer - Question and answers's words are parsed to the corresponding
lemma form;
{ Named Entity Recognition (NER) - Through regular expression, the system
tries identify entity names or mentions;
{ WordNet module from Natural Language Toolkit4: the system uses
synonyms, derivationally related forms and hypernyms;
{ Numerical expressions - Through regular expression, the system tries identify
numerical expressions;
{ Remove stop words.</p>
      <p>Filter Component - Responsible for select relevant text segments, assigning a
score to each segment and to each candidate answer. This component applies a
set of criteria to choose the most plausible answer.
2 http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer
3 Apache Lucene is an open source information retrieval software library.</p>
      <p>http://lucene.apache.org/
4 http://nltk.org
The system is based on a simple approach without a deep linguistic processing. In
this edition of QA4MRE, our system generated 3 runs, having minor di erences
in con guration, as explained below. The processing performed on the BC texts,
the reading tests and questions, comprises the following steps:
1. Indexing Component (this component is used only once)
(a) Lemmatization is applied to the text of all documents in BC;
(b) BC documents are indexed, considering the lemmatizer outcome;
2. XML Parser</p>
      <p>(a) The information from the input is extracted and stored on the system;
3. Consult Index Component - Each question is processed with the following
steps and as illustrated in the examples:
(a) Entities and numerical expressions from question and candidate answers
are stored in the system. The lter uses them to score answers and text
segments;
How can Alzheimer's patients regain the sense of smell?
1 through chemotherapy
2 through clinical trials
3 through treatment with bexarotene
4 by lying in the sun
5 None of the above</p>
      <p>Entities: Alzheimer's patients
(b) Question and candidate answers pass through the lemmatizer;
How can Alzheimer's patient regain the sense of smell?
1 through chemotherapy
2 through clinical trial
3 through treatment with bexarotene
4 by lie in the sun
5 None of the above
(c) Stop words are removed from question and candidate answers;
Alzheimer's patient regain sense smell
1 chemotherapy
2 clinical trial
3 treatment bexarotene
4 lie sun
5 none
(d) For each pair (question, candidate answer) try to form an Answer
Pattern. The Answer Pattern is compound by: keywords from question;
keywords from answer; synonyms, derivationally related forms and
hypernyms (used only on the third run) from each keyword;
- Alzheimer's
synonyms: Alzheimer's disease | hypernyms: dementia
- patient
hypernyms: case
- regain
synonyms: recover | related forms: recoverer | hypernyms: get
- sense
hypernyms: awareness
- smell
hypernyms: sensation
- chemotherapy
related forms: chemotherapeutical | hypernyms: therapy
- clinical
related forms: clinic
- trial
synonyms: test | hypernyms: attempt
- treatment
related forms: treat | hypernyms: care
(e) Document retrieval, using Lucene to get relevant documents, using the
generated Answer Patterns to querying over the indexed BC;
Query:
((Alzheimer's OR dementia OR Alzheimer's_disease) OR (case
OR patient) OR (regain OR recoverer OR recover OR get) OR
(awareness OR sense) OR (smell OR sensation)) OR
((chemotherapy OR chemotherapeutical OR therapy)) OR
((clinic OR clinical) OR (test OR trial OR attempt)) OR
((care OR treatment OR treat) OR (bexarotene)) OR
((lie) OR (sun))
4. Filter Component - For each question:
(a) Each document is validated for each Answer Pattern:
{ If it doesn't contain 50% keywords from question and 50% keywords
from answer, it is discarted;
{ If the answer has a numerical expression which does not exist on the
document, it is discarted;
{ If the answer or the question has entities and if the document does
not contain 30% of them, it is discarted;
(b) When a document is valid:
{ Each Answer Pattern that validates the current document receives
a score with the sum of:</p>
      <p>Number of entities in the text;
Number of numerical expressions in the text;
Number of times that each keyword, from current Answer
Pattern, occurs in the text;
{ The document score is the sum of each of its Answer Patterns score;
(c) Thereafter, a second analysis is performed, only on the top 5 resulting
documents from the lter; (This step is used only in the rst and in the
third runs)
{ Documents are split into text segments;
{ Current Answer Pattern's score is incremented if 80% of Answer
Pattern's words are present in the current text segment and the
distance between them is less or equal to 5;
(d) Answer Selection:
{ If the lter returns no relevant documents, then the system selects
the answer \5 - None of the above";
{ The system returns Unanswer when there is more than one
maximum, or in cases where there is a small di erence between the
maximum and another answer's score;
{ If none of above applies, the system returns the answer with
maximum score.</p>
      <p>The di erence between the runs is re ected in the number of answers given, and
in system's accuracy. This di erence can be observed in the following examples:
Example 1: How can Alzheimer's patients regain the sense of smell?
Unanswered in the rst and the second run;
Answered correctly in the third run.</p>
      <p>Example 2: How can apolipoprotein E help people with Alzheimer's?
Answered wrongly in the rst and the second run;
Answered correctly in the third run.
Example 3: What is U.S. AIDS policy dominated by?
Unanswered in the second run;
Answered correctly in the rst and the third run.</p>
      <p>Examples 1 and 2 are cases where the use of hypernyms causes a small
improvement on Component Filter. Example 3 shows the importance of applying the
methodology step 4.c when the information is not dispersed.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        In QA4MRE, the evaluation of all runs submitted is based on the c@1 measure,
discussed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]:
      </p>
      <p>n1 (nR + nU nnR ) (1)
Equation (1):
nR - number of correctly answered questions;
nU - number of unanswered questions;
n - total number of questions.
4.1</p>
      <sec id="sec-3-1">
        <title>Evaluation on the main questions</title>
        <p>In the rst approach the system answered to 188 of 240 questions, of which only
45 were correct, resulting in 0.23 c@1. In the second run, 185 questions were
answered, with 0.18 c@1. And in the last run we answered to 206 questions,
with 0.24 c@1 and 51 correct answers. Table 1 shows the detail of the system
result assessment, by topic and by run.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation on all questions</title>
        <p>For the rst run, the system answered to 224 out of 284 questions. From those,
57 were correctly answered, and the c@1 was 0.24. In the second, 219 questions
were answared. The c@1 was 0.19. In the nal run, our system answered to 245
questions, nding 64 right answers and obtaining 0.26 for c@1. Table 2 shows
these results with greater detail.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>One of the main causes of this system failure is the lack of an entities
disambiguation module, because entities are, quite often, referred by di erent expressions.
Other identi ed causes are:</p>
      <p>Topic
1. Yes/no questions;
2. Answers supported by adverbs of frequency (rarely, always, never,
sometimes, ...);
3. Words with high frequency have a negative impact in our system due to way
the scoring algorithm works. This is specially noticed when it causes the
selection of non relevant documents and incorrect answers and, in this way,
it invalidates the possibility of answering \5 - None of the above". These
failures were observed essencially for the Aids topic.</p>
      <p>We have also observed that using a second analysis in the Filter Component (step
4.c in the methodology section) is only e ective when the information about the
correct answer is not disperse over several documents. However, the use of this
approach allowed the improvement of 5-8% relatively to the base option (run
2), with the exception of the topic \Music and Society", where there was no
impact. The use of hyperonyms didn't cause any improvement in the Aids topic
but in the \Alzheimer" and \Climate Change" topics it allowed an improvement
of 10% relatively to the base option and in the \Music and Society" topic an
improvement of 5%.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We described the experience in QA4MRE challenge, using a simple system, with
a super cial text analysis based approach. This system clearly needs further
developments, aiming to improve the analysis of the questions and answers.
Namely, we intend to work on the disambiguation of entities, establishment of
relations between acronyms and entities, and trying to handle the failure causes
described in the previous sections. One of the critical aspects is to change the way
our system evaluates answer patterns composed by words with high frequency;
we need to add a new component to improve the answer selection process and,
namely, to take into account the question and answer types. We have also
detected that the incorporation of an anaphora resolution module would allow the
system to answer more questions and to improve its performance.
On a more abstract level, we intend to assess the strengths of the system used
by Evora's team last year and combine strategies with some new ideas tested in
this year's work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <source>QA4MRE</source>
          <year>2013</year>
          , http://celct.fbk.eu/QA4MRE
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. QA4MRE@
          <fpage>CLEF2013</fpage>
          . Track Guidelines, http://celct.fbk.eu/QA4MRE
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Pen~as,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>A simple measure to assess non-response</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1</source>
          . pp.
          <volume>1415</volume>
          {
          <fpage>1424</fpage>
          . HLT '
          <volume>11</volume>
          ,
          <string-name>
            <surname>Association for Computational Linguistics</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Saias</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quaresma</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>The di@ue's participation in qa4mre: from qa to multiple choice challenge</article-title>
          . In: Petras,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Forner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          , P.D. (eds.)
          <article-title>CLEF 2011 Labs</article-title>
          and Workshop: Notebook Papers. Amsterdam, The Netherlands (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Saias</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quaresma</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Di@ue in clef2012: question answering approach to the multiple choice qa4mre challenge</article-title>
          .
          <source>In: Proceedings of CLEF 2012 Evaluation Labs and Workshop - Working Notes Papers</source>
          . Rome, Italy (
          <year>September 2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>