<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Biomedical Text Mining about Alzheimer's Diseases for Machine Reading Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bing-Han Tsai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu-Zheng Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wen-Juan Hou</string-name>
          <email>emilyhou@csie.ntnu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Engineering, National Taiwan Normal University</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper presents the experiments carried out as part of the participation in the pilot task of Biomedical about Alzheimer for QA4MRE at CLEF 2012. We have submitted total five unique runs in the pilot task. One run uses Term Frequency (TF) of the query words to weight the sentence. Two runs use Term Frequency-Inverted Document Frequency (TF-IDF) of the query words to weight the sentences. The two unique runs differ in the way that when multiple answers get the same scores by our system, we choose the different answer in the different runs. The last two runs use TF or TF-IDF weighting scheme as well as the OMIM terms about Alzheimer for query expansion. Stopwords are removed from the query words and answers. Each sentence in the associated document is assigned a weighting score with respect to query words. The sentence that receives the higher weighting score corresponding to the query words is identified as the more relevant sentence to the document. The corresponding answer option to the given question is scored according to the sentence weighting score and the highest ranked answer is selected as the final answer.</p>
      </abstract>
      <kwd-group>
        <kwd>question-answering</kwd>
        <kwd>machine reading</kwd>
        <kwd>biomedical text mining</kwd>
        <kwd>QA4MRE</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The machine reading of biomedical texts about Alzheimer’s diseases follows the same
set up and principles as the QA4MRE, with the difference that it focuses on the
biomedical domain. It is important for researchers to perform more efficient
processing of Alzheimer-related literature. The task focuses on the reading of single
documents and the identification of the answers to a set of questions about
information that is stated or implied in the text. Questions are in the form of multiple
choices, each having five options, and only one correct answer.</p>
      <p>We have submitted total five unique runs in the pilot task. One run uses Term
Frequency (TF) of the query words to weight the sentences. Two runs use Term
Frequency-Inverted Document Frequency (TF-IDF) of the query words to weight the
sentences. The two unique runs differ in the way that when multiple answers get the
same scores by our system, we choose the different answer in the different runs. The
last two runs use TF or TF-IDF weighting scheme as well as the OMIM terms about
Alzheimer for query expansion.</p>
      <p>The paper is organized as follows. Section 2 describes the corpus we use in this
experiment. Section 3 introduces the system architecture and methods we propose.
We perform and discuss the evaluation results in Section 4. Finally, the conclusions
and future directions are drawn in Section 5.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Corpus Statistics</title>
      <sec id="sec-2-1">
        <title>Background Collections</title>
        <p>We use three types of background collections provided by the pilot task. The brief
introduction of background collections is stated as below.</p>
        <p>Open Access Full Articles PMC. 7,512 articles are provided in text format from
PubMed Central. These articles have been selected by performing the search and
selecting the full articles that belong to the PubMed Central Open Access subset.
Open Access Full Articles PMC, Smaller Collection. There are 1,041 full text
articles from PubMed Central. To select these documents, a search by the pilot task
was performed on PubMed using Alzheimer's disease related keywords and restricting
the search to the last three years.</p>
        <p>Elsevier Full Articles. There are 379 full text articles and 103 abstracts from Elsevier.
The articles in this subset have been selected from a list of articles provided by
Professor Tim Clark from the Massachusetts Alzheimer's Disease Research.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Test Data</title>
        <p>The test set is composed of four reading tests. Each reading test consists of one
document, with ten questions and a set of five choices per question. So, there are in
total forty questions and 200 choices/options.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>OMIM Term about Alzheimer</title>
        <p>1,549 entities and related genes about Alzheimer diseases have been retrieved from
OMIM website [1]. We use these terms to do query expansion in Run 4 and Run 5.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>The main system architecture is illustrated in Fig. 1. The expanded system
architecture is pictured in Fig. 2. Fig. 1 is the system architecture adopted in Runs 1, 2
and 3. In Run 4 and Run 5, we use the OMIM terms about Alzheimer as well as other
resource to do query expansion. The detailed architecture for OMIM expanded system
is shown in Fig. 2. Fig. 2 is the expanded system architecture adopted in Run 4 and
Run 5. Part A is the part of the main system architecture in Fig. 1.</p>
      <p>Documents</p>
      <sec id="sec-3-1">
        <title>Preprocessing</title>
      </sec>
      <sec id="sec-3-2">
        <title>Stemming</title>
      </sec>
      <sec id="sec-3-3">
        <title>Stemmed Documents</title>
      </sec>
      <sec id="sec-3-4">
        <title>Query Word Related Sentence Retrieval</title>
      </sec>
      <sec id="sec-3-5">
        <title>Query Word</title>
        <p>Retrieved
Sentences
QA4MRE
Test Data</p>
      </sec>
      <sec id="sec-3-6">
        <title>Questions</title>
      </sec>
      <sec id="sec-3-7">
        <title>Porter’s Stemmer</title>
      </sec>
      <sec id="sec-3-8">
        <title>Preprocessing</title>
      </sec>
      <sec id="sec-3-9">
        <title>Stopword Removal</title>
      </sec>
      <sec id="sec-3-10">
        <title>Punctuation Removal</title>
      </sec>
      <sec id="sec-3-11">
        <title>Stemming</title>
      </sec>
      <sec id="sec-3-12">
        <title>Query Words</title>
      </sec>
      <sec id="sec-3-13">
        <title>Query Word Weighting Scheme</title>
      </sec>
      <sec id="sec-3-14">
        <title>Sentence Weighting Scores</title>
      </sec>
      <sec id="sec-3-15">
        <title>Answer Selection Algorithm</title>
      </sec>
      <sec id="sec-3-16">
        <title>Produce Final Answers</title>
      </sec>
      <sec id="sec-3-17">
        <title>PART A</title>
        <p>Let’s explain the details about Fig. 1. Because QA4MRE test data is provided in
XML format, we have to do some format cleaning work. Hence, we first split it to
three parts: (1) documents, (2) questions and (3) answers.
3.1</p>
        <sec id="sec-3-17-1">
          <title>Preprocessing</title>
          <p>After splitting QA4MRE test data to three parts, we need to do some processes so as
not to cause implicit query handling during searching. They are described as follows.
Stopword Removal. The Stopwords are removed from each question and answer
option using a stopword list [2].</p>
          <p>Punctuation Removal. Punctuation characters are removed from the questions and
answers. For example, “http://wt.jrc.it/” and “doug@nutch.org” are rephrased as “http
wt jrc it” and “doug nutch org”, respectively.</p>
          <p>Stemming. Standard Porter stemming algorithm [3] is used to stem words in
documents, questions and answers.</p>
          <p>The remaining words in the question and answer are identified respectively as the
query words and answer words.</p>
          <p>Also, we expand some key words for the questions. For example, when facing with
the word “experiment” in the question, we expand the related word “show” to the
question. It is because we think words “experiment” and “show” are highly related
each other.
3.2</p>
        </sec>
        <sec id="sec-3-17-2">
          <title>Retrieving Query Word Related Sentences</title>
          <p>After extraction of the query words, we use it to retrieve sentences from the
documents. If a query word exactly matches with words in a sentence, then we view it
as the relevant sentence and retrieve it.
3.3</p>
        </sec>
        <sec id="sec-3-17-3">
          <title>Query Word Weighting Scheme</title>
          <p>Each query word is assigned a weight to determine its importance for the sentences.
We use TF and TF-IDF depending on different runs as the weights of the query words.
In Run 1 and Run 4, we use TF to weight the query words. The remaining runs use
TF-IDF to weight the query words.</p>
          <p>TF Weighting. The formula of TF weighting is listed in Equation (1):
TFQi = 1+</p>
          <p>fQi
mQai x fQi
(1)
where TFQi is the term frequency of query word Qi . fQi is the number of Qi
appearing in the stemmed document. We assume that the weight of each query word
has a baseline of 1. If a query word doesn’t exist in the document, the formula will
give TFQi a value of 1.</p>
          <p>TF-IDF Weighting. The formula of Inverted Document Frequency (IDF) is listed in
the following.</p>
          <p>⎧
⎪ log2
⎪
⎪
⎪
IDFQi = ⎪⎨ 0.1
⎪
⎪ 0
⎪
⎩</p>
          <p>N
nQi
if nQi ≠ 0
if nQi = 0 and fQi ≠ 0
otherwise
(2)
(3)
(4)
(5)
where IDFQi is the inverse document frequency of query word Qi . N is the total
number of documents in the corpus (i.e., QA4MRE background collections). nQi is
the number of documents in the corpus which Qi appears. fQi is the number of
Qi that appears in the stemmed document. We can’t ignore the importance of a
query word which doesn’t exist in the corpus while counting TF-IDF. So, when
nQi = 0 and if Qi exists in the document, we give the inverse document frequency
a value of 0.1 for the smoothing reason.</p>
          <p>The TF-IDF formula is shown as follows:
3.4</p>
        </sec>
        <sec id="sec-3-17-4">
          <title>Sentence Weighting Scores</title>
          <p>If a query word matches words in the relevant sentence which we found at the
sentence retrieval step, then the sentence gets the weight of that word. A sentence
weighting score is calculated as Equation (4) and Equation (5):</p>
          <p>TF − IDFQi = TFQi × IDFQi
SW _ TFj = ∑TFQi</p>
          <p>Qi∈S j
SW _ TFIDFj = ∑ (TFQi × IDFQi )</p>
          <p>Qi∈S j
where SW _ TFj is the sum of TF for all query words appearing in the sentence Sj.
SW _ TFIDFj is the sum of TF-IDF for all query words appearing in the sentence Sj.
3.5</p>
        </sec>
        <sec id="sec-3-17-5">
          <title>Answer Selection Algorithm</title>
          <p>According to the sentence weighting scores, we can compute each answer’s score in
this phase. If an answer word matches words in the sentence Sj, then its weighting
value is recorded by the sentence. Each answer’s score is the sum of the above values.
We choose the answer with the highest score to be the final answer. If there are
multiple answers with the same highest scores, we select the different answer in the
different runs.
3.6
In this study, we use the OMIM Alzheimer-related terms as our extra knowledge base
in Run 4 and Run 5. As shown in Fig. 2, OMIM terms are first preprocessed through
stopword removal, punctuation removal and stemming. We call them as expanded
query words. These expanded query words will combine with query words to
compute the new weighting scores. The answer selection algorithm is the same as the
approach in Section 3.5.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and Discussion</title>
      <p>We have submitted total five runs. Run 1 uses TF of the query words to weight the
sentences. Run 2 and Run 3 use TF-IDF of the query words to weight the sentences.
Runs 2 and 3 differ in the way that when multiple answers have the same scores in
our system, we view them as different runs. Run 4 uses TF weighting scheme and
takes OMIM terms about Alzheimer for query expansion. Run 5 uses TF-IDF
weighting and takes OMIM terms about Alzheimer for query expansion. In summary,
the weighting methods for each run are listing as follows.</p>
      <p>TF: Run 1
TF-IDF: Run 2, Run 3
OMIM+TF: Run 4
OMIM+TF-IDF: Run 5</p>
      <p>The main measure used in this evaluation campaign is called c@1, which is
defined as follows.</p>
      <p>1
n(nR + nU nnR )
(6)
where nR is the number of correctly answered questions, nU is the number of
unanswered questions, and n is the total number of questions.</p>
      <p>Table 1 presents the evaluation results at question-answering level. In Table 1,
Column “Run ID” identifies five runs we have submitted. Column “C1” is the number
of questions our system answered. Column “C2” is the number of questions our
system are unanswered. Column “C3” is the number of questions answered with right
candidate answer. Column “C4” is the number of questions answered with wrong
candidate answer. Column “C5” is the number of questions unanswered with right
candidate answer. Column “C6” is the number of questions unanswered with wrong
candidate answer. Column “C7” is the number of questions unanswered with empty
candidate. Column “c@1” is the value calculated in Equation (6).</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this study, we utilize TF, TF-IDF and OMIM terms with background collections to
help for machine reading comprehension. We observe that the OMIM terms are good
features for answering questions in this task and the best c@1 measure is 0.20. The
results also show some improvement space.</p>
      <p>Our future work will focus on the query expansion part. Trying to extract some
related words to the questions from corpus may improve the performance of the
system. Also the anaphora resolution and some semantic inference are considered in
the future.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Hamosh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scott</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amberger</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bocchini</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McKusick</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          :
          <article-title>Online Mendelian Inheritance in Man (OMIM), a Knowledgebase of Human Genes and Genetic Disorders</article-title>
          .
          <source>Nucleic Acids Res</source>
          . vol.
          <volume>30</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>52</fpage>
          -
          <lpage>55</lpage>
          (
          <year>2002</year>
          )
          <article-title>Stopword list</article-title>
          , http://www.lextek.com/manuals/onix/stopwords1.html Porter,
          <string-name>
            <surname>M.F.</surname>
          </string-name>
          :
          <article-title>An Algorithm for Suffix Stripping</article-title>
          . In: Jones,
          <string-name>
            <given-names>K.S.</given-names>
            ,
            <surname>Willet</surname>
          </string-name>
          , P. (eds.)
          <article-title>Readings in Information Retrieval</article-title>
          . Morgan Kaufmann, San Francisco,
          <fpage>313</fpage>
          −
          <lpage>316</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>