<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Text Mining Approach as Baseline for QA4MRE'12</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mathias Verbeke</string-name>
          <email>mathias.verbeke@cs.kuleuven.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jesse Davis</string-name>
          <email>jesse.davis@cs.kuleuven.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Katholieke Universiteit Leuven</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of the KU Leuven DTAI team in the pilot task on machine reading of biomedical texts about the Alzheimer disease, which is part of the 2012 Question Answering for Machine Reading Evaluation campaign (QA4MRE'12). The main objective of our research was to develop a text mining system as a strong baseline for the task. Based on the outcome of this system, we want to investigate which types of questions can be answered based solely on the input text and the question string, and for which ones we need more advanced techniques that also consider the previously acquired background knowledge from the reference document collection. Furthermore this should give us some insights into the system behavior for speci c question types and background information for the development of a tailored inference algorithm.</p>
      </abstract>
      <kwd-group>
        <kwd>Question Answering</kwd>
        <kwd>Machine Reading</kwd>
        <kwd>Text Mining</kwd>
        <kwd>Baseline</kwd>
        <kwd>Biomedical Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Machine Reading (MR) or Natural Language Understanding [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is one of the core
objectives since the emergence of Arti cial Intelligence and Natural Language
Processing. Its main goal is to automatically extract knowledge from
unstructured text and to use this knowledge to make decisions or answer questions. The
latter is also the focus of the Question Answering for Machine Reading
Evaluation (QA4MRE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) campaign, where the task is to read single documents and
identify the answers to a set of multiple-choice questions about information that
appears explicitly of implicitly in the text.
      </p>
      <p>The pilot task on machine reading of biomedical texts aims at exploring the
ability of systems to answer questions about a speci c scienti c topic, namely the
Alzheimer disease. Question answering in this domain poses additional challenges
for natural language processing which mainly arise because domain knowledge is
essential for achieving deep understanding in this setting. Consider the following
example for the QA4MRE pilot task on MR for biomedical texts:
Text: [...] Additionally, no estrogen could be detected in the APP23/Ar / mice
(data not shown), suggesting that aromatase gene knock-out prevented the
conversion of endogenous testosterone into estrogen. [...]
Question: What experimental approach is useful to create an in vivo system
where conversion of testosterone into estrogen is blocked?
Candidate Answers:
1. ELISA analysis
2. hole-board memory task
3. NEP activity assay
4. Western blot
5. knock-out of the aromatase gene
In this case, the machine reading system should identify `knock-out of the
aromatase gene" as the correct answer.</p>
      <p>In order to evaluate the system behavior for speci c question types and
investigate the necessity of using and reasoning about the background knowledge
contained in the reference document collection, our approach is based on basic
text mining techniques and does not make use of any existing resources. It is
developed to provide some insights towards a tailored inference algorithm and
can be seen as a strong baseline for the task.</p>
      <p>The paper is organized as follows. The methodology of the system built
for QA4MRE'12 is presented in section 2. Section 3 discusses the results and
analyses them by means of a detailed error analysis. Finally, section 4, concludes
and presents ideas for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>
        The main goal of our approach is to investigate the importance of background
knowledge for MR of biomedical texts. Therefore our system is solely based
on basic text mining techniques, and does not rely on the reference document
collection or any other external resources. Since the outcome should be able to
serve as a baseline for the task, preprocessing is kept to a minimum. Sentence
splitting is performed, for which we relied on the splits from the preprocessed
les provided by the task organizers. For stopword removal, the English stop
word corpus provided by NLTK [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] was used.
      </p>
      <p>We propose two di erent strategies, that mainly di er in the order in which
the text mining techniques are applied. Both approaches only use the input text,
the question string, and the multiple choice answers.
2.1</p>
      <sec id="sec-2-1">
        <title>Approach 1: Question Similarity</title>
        <p>The rst approach computes the similarity between each question in the reading
test and every sentence in the input document and selects the top k most similar
sentences. Next, each sentence votes on an answer by checking to see if it contains
(part of) an answer. If it does, the respective answer's vote is incremented by
either 1 or the normalized similarity value. The answer with the highest number
of votes, i.e., the highest weight, is selected. The pseudocode (Algorithm 1)
describes the question similarity approach.
Algorithm 1 Pseudocode of the question similarity algorithm
1: for all question q in reading test do
2: q tok = wordTokenize(q)
3: for all sentence s in input document do
4: s tok = wordTokenize(s)
5: similarity(q tok, s tok)
6: end for
7: for all answer a in choice list do
8: a tok = wordTokenize(a)
9: for all top k sentence ts do
10: if element of a tok in ts then
11: incrementVote(a)
12: end if
13: end for
14: return top voted a
15: end for
16: end for</p>
        <p>
          For the similarity calculation in line 5, we employed two di erent measures,
namely the Jaccard similarity [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] (Equation 1) and the MASI distance [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
(Equation 2). Both measures operate on sets, so both the sentence from the input
document as well as the question need to be tokenized to transform them into a set
of words. Suppose S is the set of words from a sentence in the input document
and Q is the set of words from the question. Then the two measures are de ned
as follows:
and,
        </p>
        <p>J accard(S; Q) = jS \ Qj</p>
        <p>jS [ Qj
M ASI(S; Q) = 1</p>
        <p>jS [ Qj
max(jSj; jQj)
(1)
(2)</p>
        <p>To convert the MASI distance into a similarity measure, we subtract it from
1. As can be seen from the formula, the MASI distance is a weighted version of
the Jaccard similarity, that takes into account partial agreement between sets,
by downweighting the Jaccard score when partial overlap between the sets exists.</p>
        <p>For tokens that refer to numbers, we also check if the written form of numbers
is contained in the sentence. In future work we also plan to integrate the reverse,
i.e. if the numerical form of a written token appears in the sentence.</p>
        <p>We also explored the following variations of algorithm 1:
Answer concatenation Instead of rst calculating the similarity between the
question and the sentence, followed by checking if the answer appears in the
sentence, the question and the answer can also be concatenated before the
similarity calculation. Subsequently the concatenation of the question and every
answer is compared with every sentence in the input document. Lastly, the
average similarity of the k most similar sentences is calculated from which the best
one is chosen.</p>
        <p>Weighting by similarity Instead of giving the same weight to each answer when it
appears in a top k sentence, the answers can also be weighted with the calculated
similarity value. This sum of similarity values is then normalized by the number
of times the answer was voted for, before it is used to select an answer based on
majority voting.</p>
        <p>Overall similarity If we drop the top k requirement when voting for the answer
in the last step of the algorithm, all sentences that are not completely di erent
from the question (i.e. have a MASI and Jaccard similarity di erent from 0) are
taken into account in the voting process. The answer that is contained in the
greatest number of the sentences is chosen.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Approach 2: Answer Containment</title>
        <p>The second approach reverses the procedure of the question similarity approach,
and rst checks if the answer appears in the sentence. If it does, the similarity
is calculated and the sentence with the highest similarity is selected, which can
be seen from the pseudocode in Algorithm 2.</p>
        <p>Algorithm 2 Pseudocode of the answer containment algorithm
1: for all question q in reading test do
2: q tok = wordTokenize(q)
3: containingSentences = [ ]
4: for all answer a in choice list do
5: a tok = wordTokenize(a)
6: for all sentence s in input document do
7: s tok = wordTokenize(s)
8: if element of a tok in s tok then
9: containingSentences.append(s tok)
10: end if
11: end for
12: for all sentence cs in containingSentences do
13: cs tok = wordTokenize(cs)
14: similarity(q tok, cs tok)
15: end for
16: for all top k sentence ts do
17: incrementVote(a)
18: end for
19: return top voted a
20: end for
21: end for</p>
        <p>For the similarity calculation, the weighting by similarity approach as
described for Algorithm 1 is used, with MASI as the similarity measure.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>
        These two approaches and the described variations were tested on the 40
questions from the four reading tests. The scoring of the output for each reading test
is done using c@1 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], an evaluation measure between 0 and 1 that rewards
systems that, while maintaining the number of correct answers, are able to reduce
the number of incorrect ones by leaving some questions unanswered. It can be
formulated as follows:
n1 (nR + nU nnR )
(3)
where
nR: the number of correctly answered questions
nU : the number of unanswered questions
n: the number of questions
Furthermore, accuracy was measured.
      </p>
      <p>In Table 1, the question answering evaluation for each run is listed. It contains
the number of answered and unanswered questions, together with the accuracy
and c@1 measure over all readings tests. Furthermore, it shows the used
algorithm, its parameter setting for k (i.e. the number of most similar sentences
taken into account in the voting process), the employed similarity measures and
the variations to the basic algorithm.
{ For the standard version of the rst algorithm, a higher k value has a positive
in uence on the results. This is caused by the fact that when k is set to 5,
for some questions, none of the answers appear in the selected sentences,
meaning they do not get answered (respectively 7 and 8 questions, in run
01 and 02), whereas with a higher k, the correct answer can be selected in
some of the cases. During development, higher numbers of k were tried on
the sample reading test, but this did not yield better results.
{ In algorithm 2, it is checked if the answer tokens appear in sentences of the
input document at the start of the algorithm. Since for every question, one
or more of the answers could be found verbatim in the input document, this
gave no option for leaving the question unanswered, which seems to have a
negative e ect on the results.
{ The variation of algorithm 1, where the answers are weighted by similarity,
gives on average the best results, with a score of 0.30 for both accuracy and
c@1. Furthermore, this seems to be independent of the similarity measure
used, with equal scores for MASI and Jaccard.
{ Using the answer concatenation and overall similarity variants of algorithm
1 resulted in small improvements when compared to the standard version,
where MASI was used as the similarity measure.
{ On average over all runs, reading test 2 gave the best results (average c@1 of
0.317), followed by test 4 (0.259), test 3 (0.24) and test 1 (0.121). Although
test 2 is also the one with the highest standard deviation, whereas task 4
has the lowest.
{ The best scoring runs, namely 06 and 10, which both use the weighted by
similarity variation of algorithm 1, also have the lowest standard deviation,
which could be an indication that this algorithm is more robust to di erent
types of questions.
{ The highest individual reading test score of 0.50 is achieved by run 09 for
test 2. The overall similarity approach used for this run highly resembles
the weighting by similarity variant, except for the top k requirement. If the
answer to some of the questions in reading test 2 was only mentioned in
sentences that were more dissimilar than the k most similar ones, this could
explain this result.</p>
      <p>Also at the individual question level, we analyzed our results, with the
following main observations:
{ The following questions could be correctly answered in at least half of the
runs (the number before the dot indicates the reading test, the number
behind it the question, and the one between brackets the number of runs in
which it was answered correctly): 1.2 (5), 2.6 (9), 2.8 (7), 3.3 (5), 4.2 (8) and
Run
01
02
03
04
05
06
07
08
09
10
4.6 (6). For all of these questions, the input document contained sentences
that included one or multiple tokens from the question together with the
answer.
{ The proposed approach does not perform an analysis of the question
beforehand. Doing so could help in correctly answering a substantial number
of questions. For example, in question 4.4, the experimental technique that
was used to purify the -secretase complex is asked for. Since only 2 of the
5 questions deal with experimental techniques, the other answers could be
discarded beforehand.
{ 13 out of the 40 questions were incorrectly answered in all runs. A deeper
analysis should con rm this, but we suppose answering these questions
requires combining information from multiple sentences, performing some form
of inference.</p>
      <p>When the test set preparation procedure is made available, we would like to
analyze to which level of question di culty our system is able to answer questions
correctly. In order to do this, it should be known which facts are explicitly
stated in the text, which ones are present but are not explicitly related and
for which facts inference is needed to connect them and form the answer. This
will also clarify to what extent the reference background document collection is
a prerequisite to answer certain (types of) questions. In addition, this will also
give some insights on which kinds of textual inference and features are important
for the task, i.e., lexical information (e.g., acronym, synonymy,
hyperonymyhyponymy), syntactic structures (e.g., nominalization-verbalization, causative,
paraphrase, active-passive) or the identi cation of discourse relations (e.g.,
coreference, anaphora ellipsis).</p>
      <p>From our current system design, it is clear that only questions for which the
answer is mentioned in the same sentence can be answered, since it does not
combine information at the document level. Nor does it consult the background
collection or perform inference to reason about facts stated in the text. In a
more extensive error analysis, we will also analyze the in uence of the similarity
measure on the results.</p>
      <p>Furthermore, the answer level needs to be analyzed, in order to determine
which types of answers the system is able to distinguish. For example, can the
system discriminate between di erent answers if they are all enzymes, or only
answer questions that ask for an entity of which the type is unique in the
answer list? Currently, the only preprocessing that is done at the answer level
is the transformation of verbatim numbers to their numerical form, but other
transformations may be necessary.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>We presented the methodology of our system for machine reading of biomedical
texts about the Alzheimer disease. Question answering about a scienti c topic
poses additional challenges to a MR system, which is mainly caused by the
increased importance of the background knowledge. The main purpose of our
approach was to investigate how far-reaching this in uence was. Therefore we
developed a system using basic text mining techniques, without considering any
external resources.</p>
      <p>The proposed approach can be seen as a strong baseline for the task.
Futhermore, the error analysis of the results provides valuable insights for further
improvement. In future work, we want adopt this knowledge to develop a tailored
inference algorithm for this task.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>Mathias Verbeke is funded by the Research Foundation Flanders (FWO-project
G.0478.10 - Statistical Relational Learning of Natural Language). Jesse Davis is
partially supported by the Research Foundation Flanders (FWO-project G.0356.12
- A Synergestic Approach to Extraction, Learning and Reasoning for Machine
Reading).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Allen</surname>
          </string-name>
          , J.:
          <source>Natural Language Understanding (2nd edition)</source>
          .
          <source>Benjamin/Cummings</source>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. Question Answering for Machine Reading Evaluation, http://celct.fbk.eu/ QA4MRE</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Loper</surname>
          </string-name>
          , E.:
          <string-name>
            <surname>Natural Language Processing with Python. O'Reilly Media</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jaccard</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Etude comparative de la distribution orale dans une portion des Alpes et des Jura</article-title>
          .
          <source>Bulletin del la Societe Vaudoise des Sciences Naturelles</source>
          ,
          <volume>547</volume>
          {
          <fpage>579</fpage>
          (
          <year>1901</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Passonneau</surname>
          </string-name>
          , R.:
          <article-title>Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation</article-title>
          .
          <source>In: Proceedings of the International Conference on Language Resources and Evaluation (LREC)</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Pen~as,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>A Simple Measure to Assess Non-response</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</source>
          , pp.
          <fpage>1415</fpage>
          -
          <lpage>1424</lpage>
          , ACL, Portland, Oregon, USA (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>