<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A generic retrieval system for biomedical literatures: USTB at BioASQ2015 Question Answering Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhi-Juan Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tian-Tian Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bo-Wen Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yan Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chun-Hua Zhao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shao-Hui Feng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xu-Cheng Yin</string-name>
          <email>xuchengyin@ustb.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fang Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Technology, University of Science and Technology Beijing (USTB)</institution>
          ,
          <addr-line>Beijing 100083</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe our participation in the 2015 BioASQ challenge task on question answering (Phase A). Participants need to respond to the natural language questions in the format of documents, snippets, concepts and RDF triplets. In document retrieval, we build a generic retrieval model based on the sequential dependence model, Word Embedding and Ranking Model. In addition, from the view of the special signi cance of titles(Title Signi cance Validation), we re-rank the top-K results by counting the meaningful nouns in the titles. The top-K documents are split into sentences and indexed for snippets retrieval. The similar models of document retrieval are applied for this part. To extract the biomedical concepts and corresponding RDF triplets, we use concept recognition tools MetaMap and Banner 1. Statistics indicate that our systems outperform other results.</p>
      </abstract>
      <kwd-group>
        <kwd>generic retrieval</kwd>
        <kwd>sequential dependence model</kwd>
        <kwd>Word Embedding</kwd>
        <kwd>Ranking</kwd>
        <kwd>MetaMap</kwd>
        <kwd>Banner</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The challenge of BioASQ consists of two tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: a large-scale semantic indexing
task (Task 3a) and a question answering task (Task 3b). We only focus on phase
A of Task 3b which includes four parts: retrieving the gold relevant articles
and the most relevant snippets from the benchmark datasets, retrieving relevant
concepts from designated terminologies and ontologies and RDF triples from
designated ontologies. For this task, participants are provided with about 100
questions in each batch and required to return at most 10 answers for each
part.In all of the following experiments,we utilize the training datasets 3b which
includes 810 queries.
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 http://ikmbio.csie.ncku.edu.tw/GN/</title>
      <sec id="sec-2-1">
        <title>Methodology</title>
        <p>In our system, we deploy Galago 2, an open source search engine developed as an
improved JAVA version of Indri, over large clusters for indexing and retrieval.
We lease 2015 MEDLINE/PubMed Journal Citations from the U.S. National
Library of Medicine, composed of about 22 million MEDLINE citations.
2.1</p>
        <sec id="sec-2-1-1">
          <title>Data Pre-Processing</title>
          <p>For documents retrieval, the elds of title and abstract are extracted from
document resources and indexed with Galago. On the basis of experimental results
of document retrieval, the top-K documents are chosen from the candidates as
the source of retrieval for snippets retrieval part. Titles and abstracts of the
articles are separated into several sentences according to some speci c rules. These
sentences make up a pile of new les with the eld name Text for indexing. For
concepts retrieval part, participants are required to return relevant concepts in
ve ontologies or terminologies: MeSH, GO, SwissProt, Jochem and DO. We
download all the resources and index the elds of term and ID.
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Query Pre-Processing</title>
          <p>Except the experiment of triples retrieval, original queries are processed with
the same approaches. The stop words in queries are removed and the queries
are case-folded, stemmed and tagged with Porter Stemmer and Part-Of-Speech.
Finally we lter out the special symbols.</p>
          <p>
            MetaMap [
            <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
            ], which is a semantic tool in medical text processing, maps
concepts in the UMLS Metathesaurus. Biomedical terminologies and ontologies
are identi ed from queries by MetaMap and composed new queries to retrieval
concepts. Linked life Data is aggregation of more than 25 popular biomedical
data sources. Users are able to access 10 billion RDF statements through a single
SPARQL endpoint.
          </p>
          <p>In the following sections, the procedures of retrieval models sequential
dependence model(SDM ), Word Embedding(Word2Vec), Ranking Model(RM ) and
Title Signi cance Validation(TSV ) are introduced in detail.
2.3
2.3.1</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Searching</title>
        </sec>
        <sec id="sec-2-1-4">
          <title>Sequential Dependence Model</title>
          <p>Our baseline of documents retrieval is the unigram language model referred as
query likelihood model (QL). In this model, the likelihood of query term qi
occurring assumed is that not a ected by the occurrence of any other query
terms. But for a natural language query, the terms depend on each other. So our
retrieval models should take the sequence of terms into account.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2 http://www.galagosearch.org/</title>
      <p>
        Metzler and Crofts Markov Random Field (MRF ) model [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], also called
undirected graphical models, is commonly used in the statistical machine
learning domain to succinctly model joint distributions. The sequential dependence
model (SDM ) is a special case of the MRF. It assumes the occurrences of
adjacent query terms are related.
      </p>
      <p>Three types of features are considered in SDM : single term features (standard
unigram language model features, fT ), exact ordered phrase features (words
appearing in sequence, fO) and unordered window features (require words to be
close together, but not necessarily in an exact sequence order, fU ).</p>
      <p>For the query Q after pre-processing, Q=q1, q2,. . . ,qi,. . . . Document D is
ranked according to the following equation (1):
scoreSDM (Q; D) =</p>
      <p>T</p>
      <p>X fT (q; D)
q Q
One of the most critical language issues for retrieval performance is the term
mismatching problem. The 810 queries of training datasets 3b after pre-processing
contains 4609 terms. There are about 5.7 terms on average for each query. The
queries are short and the natural language is inherently ambiguous. The queries
may not use the same terms as the retrieval sources. Query expansion is usually
utilized to select the golden relevant terms to the original queries. However, the
main challenge of the query expansion is to nd the expansion terms, especially
in speci c areas, such as biomedicine.</p>
      <p>The result vectors of words o ered by BioASQ o cials can be used to
estimate the relatedness of two words. With the similarities of each two words,
the query expansion can be easily applied. The resulting vectors of 1,701,632
distinct words (types) is trained by the Word2Vec 3 tool which processes a large
corpus and maps the words in the corpus to vectors of a continuous space. We
use these word vectors based on SDM. The feature fT is replaced with fW , which
represents the expansion terms feature.</p>
      <p>For a query Q=q1, q2,. . . ,qi,. . . , we calculate the distance between the
term qi and all the distinct terms from the dictionary by cosine similarity. Then
all the terms are sorted by the distances with qi. The nearest k terms are</p>
    </sec>
    <sec id="sec-4">
      <title>3 https://code.google.com/p/word2vec/</title>
      <p>chosen to enrich the original query. The original terms qi with the
additional terms qi1,qi2,. . . ,qik, used as expansion terms with corresponding weights
wi(i=1,2,. . . ).A new query can be reformulated as Qnew=(t1,t2,. . . ,ti,. . . ),where
ti Ti =qi,qi1,qi2,. . . ,qik.</p>
      <p>Documents are ranked by the enriched SDM query T according to the
following scoring equation (2):
scoreW ord2V ec(Q; D) =</p>
      <p>W</p>
      <p>
        X fW (T; D)
t T
In order to improve the retrieval performance, we propose a Reranking Model
(RM )[
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] based on the Word2Vec results. For each query, a subset D of results
composed by the top-K documents is represented by a vector according to
TFIDF. Then the similarity of each two documents is calculated by the cosine
similarity of corresponding vectors. The similarity of the K documents make up
the K*K dimension matrix M .M[i][j] represent the similarity of the Di and Dj .
Via these similarities, we update the score of the documents for each query by
Equation (3).scorei is the initial score for the Di,
      </p>
      <p>The updated score of the document Di for the query Q is calculated by the
following equation:
scoreRM (Q; Di) = scorei +
With a speci c request and several relevant literatures, people usually directly
judge the titles rather than carefully reading the full text of the abstract. In
order to investigate the special signi cance of titles, we design an interesting
experiment to validate it. We pick top-K documents retrieved by the Word2Vec
model and look up the corresponding titles. Then we compare these titles with
the processed query. Di erent from other type of words, nouns are a meaningful
linguistic unit and have virtual in uence in natural language.</p>
      <p>Hence, we lter out all types of words from the queries other than the nouns
labeled by the Stanford-POS tagger when processing the queries. The frequency
with which the nouns occur in the titles are counted as title-hit. We combine
the title-hit and initial score by linear combination. We respectively compare
(stemmed query, stemmed titles) and (non-stemmed query, non-stemmed titles)
to see if title-hit can in uence the performance.
3</p>
      <sec id="sec-4-1">
        <title>Experiments on Generic Retrieval Models</title>
        <p>
          We train and validate our methods on the training datasets 3b which contains
810 queries based on the 22 million MEDLINE documents. We utilize trec eval
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] to evaluate the top 100 ranked search lists. Mean average precision(MAP )[
          <xref ref-type="bibr" rid="ref10 ref9">9,
10</xref>
          ] serves as our evaluation metric.In the previous years,we are required to return
at most 100 relevant results.But the participating systems are required to return
at most 10 relevant results in 2015.So we select the best parameters through
the training datasets 3b,then the parameters are utilized to the testsets 3b.The
results with testsets 3b are o ered by BioASQ o cials. The scalar which is
a hyper-parameter controls the amount of collection smoothing applied. We set
the value in the range between 500 and 5000.The following tables are only parts
of our documents experiments for setting up the parameters on training datasets
3b.
        </p>
        <p>In SDM, there are three weighting parameters ( T , O, U ) to be trained.
We set each of the parameters values from 0.00 to 1.00 in steps of 0.01. Based
on this, wi in the Word2Vec model, which is the weights for expansion terms,
needs to be set. In addition, another issue for query expansion is to con rm how
many expansion terms are suitable for retrieval. As a comparison, on all training
data, the performance with the expansion terms from 1 to 10 are measured for
an optimum parameter. The results are shown in Table 1.</p>
        <p>Overall, it works well when those parameters are optimized in the datasets
3b.Obviously, Word2Vec shows greater performance compared with QL and
SDM. Especially,the average result with Word2Vec is higher than the other two.So
Reranking Model and Title Signi cance Validation are evaluated based on this
model.</p>
        <p>Afterwards, the top-K documents determined by the initial ranking are
reranked by RM . The value of K is trained by groups of experiments.The initial
scores and similarities are also taken into account. The value is changed from
0.000 to 1.000 in steps of 0.001.After many experiments, we get the stable
parameter values.The parts of comparison results are shown in Table 2.
The RM performs well compared with Word2Vec which the value of is 1.</p>
        <p>Results for the TSV model which contains non-stemmed queries and stemmed
queries are presented in Table 3.</p>
        <p>Method
Training Datasets 3b</p>
        <p>Word2Vec non Stem Stem</p>
        <p>0.2878 0.2932 0.2988</p>
        <p>Experimental results show that the e ectiveness is improved when applying
title signi cance validation appropriately.</p>
        <p>We choose the parameters in RM model with best result on the ve Batch
respectively and then compare to the o cial results of top 3 winning participants
in BioASQ 2014 4.Table 4 shows the results of our system and top 3 participants.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4 http://participants-area.bioasq.org/oracle/results/taskB/phaseA/</title>
      <p>Form the Table 4,we nd our results are better than the Top 1 except the
Batch3 because of the random data.So our generic retrieval system is more
effective in biomedical retrieval.
4</p>
      <sec id="sec-5-1">
        <title>Conclusion</title>
        <p>Due to the limited time, we only participate in the phase A of task 3b. But
our approaches performs competitive especially during the documents and
snippets retrieval. We adopt various retrieval models and adjust almost all possible
parameters to improve the nal performance. Although our trained system
performs stable on the training set 2015 (810 queries), the MAP value on batch
3(testsets 3b) is unusual. Giving a deep analysis of the query set of batch 3, we
think the cause may be the count of terms and biomedical nouns in each query.</p>
        <p>In the future, we will focus on the strategies of query expansion on
biomedical text, probabilities of improving the document retrieval accuracy through
the feedback results of snippets retrieval. Besides, our research will add natural
language processing (NLP ) into our system to improve the performance.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Georgios</given-names>
            <surname>Balikas</surname>
          </string-name>
          , Ioannis Partalas,
          <string-name>
            <surname>Axel-Cyrille Ngonga</surname>
            <given-names>Ngomo</given-names>
          </string-name>
          , Anastasia Krithara, Georgios Paliouras:
          <article-title>Results of the BioASQ Track of the Question Answering Lab at CLEF 2014</article-title>
          . CLEF (Working Notes)
          <year>2014</year>
          :
          <fpage>1181</fpage>
          -
          <lpage>1193</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aronson A R. MetaMap</surname>
          </string-name>
          <article-title>: Mapping Text to the UMLS Metathesaurus[J]</article-title>
          .
          <source>Bethesda</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Aronson</surname>
            <given-names>A R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lang F M.</surname>
          </string-name>
          <article-title>An overview of MetaMap: historical perspective and recent advances[J]</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          ,
          <year>2010</year>
          ,
          <volume>17</volume>
          (
          <issue>3</issue>
          )::
          <fpage>229C236</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Metzler</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft W B. A</surname>
          </string-name>
          <article-title>Markov random eld model for term dependencies</article-title>
          [C]// In Proceedings of SIGIR
          <year>2005</year>
          .
          <year>2005</year>
          :
          <fpage>472</fpage>
          -
          <lpage>479</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Sungbin</given-names>
            <surname>Choi</surname>
          </string-name>
          , Jinwook Choi:
          <article-title>Classi cation and Retrieval of Biomedical Literatures: SNUMedinfo at CLEF QA track BioASQ 2014</article-title>
          . CLEF (Working Notes)
          <year>2014</year>
          :
          <fpage>1283</fpage>
          -
          <lpage>1295</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bo-Wen</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Xu-Cheng Yin,
          <string-name>
            <surname>Xiao-Ping</surname>
            <given-names>Cui</given-names>
          </string-name>
          , Bin Geng, Jiao Qu,
          <string-name>
            <surname>Fang Zhou</surname>
            ,
            <given-names>Li</given-names>
          </string-name>
          <string-name>
            <surname>Song</surname>
          </string-name>
          and
          <string-name>
            <surname>Hong-Wei Hao</surname>
          </string-name>
          .
          <article-title>Social Book Search Reranking with Generalized ContentBased Filtering</article-title>
          . Submitted to CIKM14.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Bo-Wen</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Xu-Cheng Yin,
          <string-name>
            <surname>Xiao-Ping</surname>
            <given-names>Cui</given-names>
          </string-name>
          , Jiao Qu, Bin Geng, Fang Zhou,
          <string-name>
            <surname>Hong-Wei</surname>
            <given-names>Hao</given-names>
          </string-name>
          :USTB at INEX2014:
          <article-title>Social Book Search Track</article-title>
          .
          <source>CLEF (Working Notes)</source>
          <year>2014</year>
          :
          <fpage>536</fpage>
          -
          <lpage>54</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Buckley</surname>
            <given-names>C.</given-names>
          </string-name>
          <article-title>trec eval IR evaluation package; 1999.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Manning</surname>
            <given-names>CD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schutze</surname>
            <given-names>H</given-names>
          </string-name>
          .
          <article-title>Introduction to information retrieval</article-title>
          , vol.
          <volume>1</volume>
          . Cambridge University Press Cambridge;
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Buckley</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voorhees</surname>
            <given-names>EM</given-names>
          </string-name>
          .
          <article-title>Evaluating evaluation measure stability</article-title>
          .
          <source>In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval</source>
          . Athens, Greece: ACM;
          <year>2000</year>
          . p.
          <year>33C40</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>