<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Rank
batch</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>HPI question answering system in the BioASQ 2015 challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mariana Neves</string-name>
          <email>mariana.neves@hpi.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hasso-Plattner-Institute at the University of Potsdam</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1884</year>
      </pub-date>
      <volume>1</volume>
      <issue>0</issue>
      <abstract>
        <p>I describe my participation on the 2015 edition of the BioASQ challenge in which I submitted results for the concept matching, document retrieval, passage retrieval, exact answer and ideal answer subtasks. My approach relies on a in-memory based database (IMDB) and its built-in text analysis features, as well as on PubMed for retrieving relevant citations, and on prede ned ontologies and terminologies necessary for matching concepts to the questions. Although results are far below the ones obtained by other groups, I present an novel approach for answer extraction based on sentiment analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>question answering</kwd>
        <kwd>biomedicine</kwd>
        <kwd>passage retrieval</kwd>
        <kwd>document retrieval</kwd>
        <kwd>concept extraction</kwd>
        <kwd>in-memory database</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        1 http://bioasq.org/
I participated with a system developed on top of an in-memory database (IMDB)
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the SAP HANA database, which is similar to the approach that I used during
in the 2014 edition of the BioASQ challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. I participated in phases A and B
of the task 3b of the 2015 edition of the BioASQ challenge and I have submitted
predictions for potentially relevant concepts, documents, passages and answers.
      </p>
      <p>
        Similar to previous QA systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], my system is composed of the following
components: (a) question processing for construction of a query from the
question; (b) concept mapping for performing concept recognition on the question;
(c) document and passage retrieval for ranking and retrieval of relevant PubMed
documents and passages; (d) answer extraction for building the short and long
(summaries) answers. Figure 1 illustrates the architecture of the system and I
describe the various steps in details below, including a short overview of the
IMDB technology.
The SAP HANA database relies on IMDB technology [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for fast access of data
directly from main memory, in contrast to approaches which process data from
les that reside on disk space and requires loading data into main memory. It
also includes lightweight compression, i.e., a data storage representation that
consumes less space than its original format, and built-in parallelization. The
SAP HANA database comes with built-in text analysis which includes
language detection, sentence splitting, tokenization, stemming, part-of-speech
tagging, named-entity recognition based on pre-compiled dictionaries, information
extraction based on manually crafted rules, document indexing, approximate
searching and sentiment analysis.
2.2
      </p>
      <sec id="sec-1-1">
        <title>Question processing</title>
        <p>
          In this step, the system processes the questions using the Standford CoreNLP
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for sentence splitting, tokenization, part-of-speech tagging and chunking. The
system constructed two queries for each question by selecting their more
meaningful tokens. The rst approach consists in removing all tokens which match a
stopword list2 and connecting them with the \OR", operator for more exibility
of the query. Both the document and passage retrieval steps as well as the answer
extraction step made use of this high recall query for ranking documents and
passages.
        </p>
        <p>The second query aims on more precision and less recall and lters tokens
further based on a list of the 5,000 most popular words of English3 and uses
the \AND" operator for connecting words. Only the document retrieval step
used this high precision query for ranking relevant documents from PubMed.
For instance, for the question \What disease is mirtazapine predominantly used
for?", \disease OR mirtazapine OR predominantly OR used" is the resulting
high recall query and \mirtazapine AND predominantly" is a higher precision
query.
2.3</p>
      </sec>
      <sec id="sec-1-2">
        <title>Concept mapping</title>
        <p>
          The approach is the same that I used in the 2014 edition of the challenge [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]:
I made use of the built-in named-entity recognition feature of the IMDB for
mapping the questions to concepts from the ve required terminologies and
ontologies, which needed to be previously converted to dictionaries in an
appropriate XML format. Given the dictionaries, the IMDB databases automatically
matched terms to the words of the question, as illustrated in Figure 2.
2 http://www.textfixer.com/resources/common-english-words.txt
3 https://www.englishclub.com/vocabulary/common-words-5000.htm
2.4
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Document and passage retrieval</title>
        <p>
          The approach for retrieving relevant PubMed documents for each question is
similar to the one described in my recently submitted paper [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. It consisted in
rst posing the two generates queries to PubMed web services, retrieving up to
200 top ranked documents for each query and fetching the title and abstract
for each PMID using the BioASQ web services. When querying PubMed, I
restricted publication dates up to '2013/03/14' and I required citations to have an
abstract available. This current approach di ers from the one of my last year's
participation [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] in terms that no I did not perform synonym expansion for the
terms in the query, given the poor results obtained when relying on BioPortal for
this purpose. Finally, titles and abstracts were inserted into a table in the IMDB.
        </p>
        <p>I retrieved passages using on the built-in information retrieval features
available in the IMDB, which is based in approximated string similarity to match
terms from the query to the words in the documents. The system proceeds ranks
the passages (sentences) based on the TF-IDF metrics and I retrieve the top 10
sentences and corresponding documents as answers for the passage and
document retrieval sub-tasks, respectively.
2.5</p>
      </sec>
      <sec id="sec-1-4">
        <title>Answer extraction</title>
        <p>I extracted both exact and ideal answers based on the gold-standard snippets
that the organizers made available for phase B of task 3b. The process consisted
in inserting the snippets into the IMDB database and I utilized built-in text
analysis features for the extracting the answers, as described in details below for
each question type.</p>
        <p>Yes/No: Decision on either the answers \yes" or \no" was based on the
sentiment analysis predictions provided by the IMDB. The assumption was that all
snippets are somehow related to the question and that detection of sentiments
in these passages could be used to distinguish between the two possible answers.
Figure 3 shows the sentiments which were detected for a certain question.</p>
        <p>The IMDB returns 10 types of sentiments, namely
\StrongPositiveSentiment"", \StrongPositiveEmoticon", \WeakPositiveSentiment",
\WeakPositiveEmoticon", \StrongNegativeSentiment", \StrongNegativeEmoticon", \MajorProblem",
\WeakNegativeSentiment", \WeakNegativeEmoticon" and \MinorProblem". I
merged some of these sentiment types into coarser categories according to
simples rules (Table 1). The sentiments were rst grouped into four coarse categories,
i.e., \positiveStrong", \positiveWeak", \negativeStrong", \negativeWeak", and
then into the three main sentiments \positive" or \negative". For the rules shown
in Table 1, I consider that the \positiveStrong" sentiment is stronger than the
\negativeStrong" one, and therefore I assign the \positive" sentiment for such
cases. Similarly, I consider \positiveWeak" weaker than \negativeWeak" when
both are returned for the same question. Cases which did not match none of the
rules for \positive" or \negative" sentiments are classi ed as \neutral". Final
decision for the the answers \yes" or \no" was based on these three coarse
sentiments. By default, I return the answer \no", unless I get \positive" or \neutral'
as output from the above rules.
coarse sentiment Rule
positiveStrong StrongPositiveSentiment OR StrongPositiveEmoticon
positiveWeak WeakPositiveSentiment OR WeakPositiveEmoticon
negativeStrong StrongNegativeSentiment OR StrongNegativeEmoticon OR
MajorProblem
negativeWeak WeakNegativeSentiment OR WeakNegativeEmoticon OR
MinorProblem
positive (positiveStrong OR positiveWeak) AND (NOT(negativeStrong)</p>
        <p>AND NOT(negativeWeak))
positive positiveStrong AND negativeStrong
positive positiveStrong AND negativeWeak
negative (negativeStrong OR negativeWeak) AND (NOT(positiveStrong)</p>
        <p>AND NOT(positiveWeak))
negative positiveWeak AND negativeStrong
negative positiveWeak AND negativeWeak
Factoid and list: I extracted factoid and list answers based also on built-in
predictions provided by our IMDB, more speci cally, on the annotations of noun
phrases and topics, as presented in Figure 4. Given that no semantic processing
was performed neither for the question nor for the snippets, in oder to tag named
entities and to identify the entity type of expected answer, I choose the ve top
answers based on the order returned by the IMDB.</p>
        <p>Summary: I also built summaries for the ideal answers based on the phrases
which contain sentiments, as shown in Table 6. The assumption was that such
phrases are more informative and relevant than the ones in which no sentiments
were found. My approach consisted in concatenating the sentences up to a limit
of 200 words, as speci ed in the challenge's guidelines.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Results and discussion</title>
      <p>I submitted results for all ve batches of test questions for task 3b: (a) phase A,
i.e., concept mapping and document and passage retrieval, and (b) phase B, i.e.,
exact and ideal answers. Di erent from previous editions of the BioASQ
challenge, when participants were allowed to submit up to 100 entries per question
for each of the required sub-tasks, whether documents, concepts or exact
answers, this year's edition limited concepts, documents and passages up to 10 per
question and factoid answers up to 5. I present below the results I obtained as
published by the organizers in the BioASQ Web site 4. I do not show results for
concept matching because the organizers seem not to have made them available
yet.</p>
      <p>Table 3 shows my results for document retrieval for each of the ve test
batches. As discussed in the methods section, I did not implement any speci c
approach for this task and documents were ranked based on the relevancy of
the query to the passages and not to the documents (abstracts) themselves. In
4 http://participants-area.bioasq.org/results/3b/phaseA/;http://
participants-area.bioasq.org/results/3b/phaseB/</p>
      <p>second-generation antidepressants (selective serotonin reuptake
inhibitors, nefazodone, venlafaxine, and mirtazapine) in participants
younger than 19 years with MDD, OCD, or non-OCD anxiety disorders.
patients 65 years or older with major depression.</p>
      <p>A case report involving linezolid with citalopram and mirtazepine in
the precipitation of serotonin syndrome in a critically ill bone marrow
transplant patient is described in this article.</p>
      <p>In 26 patients with FMS who completed a 6-week open study with
mirtazapine, 10 (38%) responded with a reduction of at least 40% of
the initial levels of pain, fatigue and sleep disturbances (Samborski et
al 2004).</p>
      <p>In general, drugs lacking strong cholinergic activity should be preferred.
Drugs blocking serotonin 5-HT2A or 5-HT2C receptors should be
preferred over those whose sedative property is caused by histamine
receptor blockade only.
this year's edition of the challenge, organizers required participants to submit
up to 10 document, which is a hard assignment, given the millions of citations
in PubMed. Indeed, results have been lower than the ones obtained by
participants last year and it is unclear whether we (teams) performed better than the
baseline systems as the organizers did not publish results for these systems yet.</p>
      <p>Table 4 shows my results for passage retrieval for each of the ve test batches.
Few groups participated in this task, in comparison to the number of submissions
for the document retrieval task. A task which is already very complex has been
made even more di cult this year by the limitation of providing up to only 10
top passages.</p>
      <p>Finally, tables 5 and 6 shows the results I obtained for the exact and ideal
answers in phase B of task 3b.
Acknowledgements MN would like to acknowledge funding from the HPI
Research School.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Stanford core nlp, http://nlp.stanford.edu/software/corenlp.shtml</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>HPI in-memory-based database system in task 2b of bioasq</article-title>
          .
          <source>In: Working Notes for CLEF 2014 Conference, She eld, UK, September 15-18</source>
          ,
          <year>2014</year>
          . pp.
          <volume>1337</volume>
          {
          <issue>1347</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>In-memory database for passage retrieval in biomedical question answering</article-title>
          .
          <source>Journal Of Biomedical Semantics (submitted)</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leser</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Question answering for biology</article-title>
          .
          <source>Methods</source>
          <volume>74</volume>
          (
          <issue>0</issue>
          ),
          <volume>36</volume>
          {
          <fpage>46</fpage>
          (
          <year>2015</year>
          ), http://www.sciencedirect.com/science/article/pii/S1046202314003491
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Plattner</surname>
          </string-name>
          , H.:
          <article-title>A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases</article-title>
          . Springer, 1st edn. (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tsatsaronis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balikas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malakasiotis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Partalas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zschunke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alvers</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weissenborn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krithara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petridis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polychronopoulos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>An overview of the bioasq large-scale biomedical semantic indexing and question answering competition</article-title>
          .
          <source>BMC bioinformatics 16(1)</source>
          ,
          <volume>138</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>