<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>KISTI at CLEF eHealth 2017 Patient-Centered Information Retrieval Task-1: Improving Medical Document Retrieval with Query Expansion</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Heung-Seon Oh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuchul Jung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Korea Institute of Science and Technology Information</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this report, we describe our retrieval framework for participating in CLEF eHealth 2017 Patient-Centered Information Retrieval Task-1: Ad-hoc Search. Our retrieval framework is a query expansion approach which adopts relevance and pseudo relevance feedback to improve retrieval performance.</p>
      </abstract>
      <kwd-group>
        <kwd>language model</kwd>
        <kwd>feedback model</kwd>
        <kwd>query expansion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>This report summarizes our approaches to CLEF eHealth 2017 [2] Patient-Centered
Information Retrieval Task-1, a standard ad-hoc task [7] . As same with 2016, this
task utilizes a large web corpus (ClueWeb12 B13) and topics developed by mining
health web forums where users were seeking advice about specific symptoms,
diagnosis, conditions or treatments.</p>
      <p>The main goal of the task is to improve the relevance assessment pool and the
collection reusability. To meet the evaluation requirements of this year, we explicitly
exclude documents that have been already assessed in 2016 from our search results.
Meanwhile, to enhance the relevance of the searched, we utilize the already assessed
documents in our proposed approaches by following the suggested guideline.</p>
      <p>Based on the above considerations, we’ve designed a medical information retrieval
framework which is characterized with relevance feedback for initial search and query
expansion for re-ranking.</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <sec id="sec-2-1">
        <title>Retrieval framework</title>
        <p>Our proposed framework basically performs selective query expansion techniques in
the initial retrieval and re-ranks the retrieval results based on the more accurate query
expansion methods developed. Figure 1 shows the overview of our retrieval
framework. First, we employ relevance feedback (RF) based on the relevance judgements
built in last year since it is encouraged to improve retrieval performance and
relevance assessment pool. For a query , a feedback model, , is constructed and
combined to produce a new query model, . Second, an initial search is performed
using and produces a set of documents, , from a collection .
For the retrieved documents, we perform re-ranking with new queries built via two
different query expansion methods.</p>
        <p>As summarized above, our framework starts with the relevance feedback to
improve retrieval performance and relevance assessment pool. Let is a set of
documents relevant to a query . A relevance model, i.e. RM1 [4], is constructed with
scored by KL-divergence method (KLD) [3, 6, 9]. There exists two differences
compared to standard RM1 since it is built using the relevance judgements. First, all
documents in are used to involve in a feedback model because they are explicitly
relevant. Second, the relevance are employed as document priors. From the
differences, it is expected that a query model includes all relevant information in .
Finally, a new query is constructed via RM3 [1]. After that, the initial search is
performed using KLD method on the entire collection and obtain a set of retrieved
documents which are target for re-ranking.</p>
        <p>Before performing the re-ranking, two different query expansion techniques are
considered based on . The first query expansion approach adopts random-walk
based centrality scores [5] with a different transition matrix. This strategy is to
estimate the query model by considering the associations of words in a query. The major
difference is that an association between two words w and u where is
computed using two corresponding word vectors rather than co-occurrences. The
word vectors are an accurate representation obtained through GloVe [8], an
unsupervised learning algorithm for obtaining vector representations for words, so call word
embedding. The GloVe is known to outperform word2vec models on similarity tasks
and named entity recognition tasks. The word vectors were computed on TREC CDS
2016 collection [8] which contains about 1.2M biomedical journal articles. We expect
that the word vectors are more representative in medical domain than other domains.
Then, centrality scores are computed using random-walk based on the transition
matrix and regarded as a query model. Similar to RM3 above, a query model are
generated by combining and the centrality scores. Finally, documents in are
re-ranked according to with KLD method.</p>
        <p>The second query expansion approach follows cluster-based external expansion
model (CBEEM) [6] which is an advanced version of using external collections in
pseudo relevance feedback (PRF). The key idea of CBEEM is to estimate an accurate
feedback model using not only the original collection but also other benchmark
collections. Again, TREC CDS 2016 collection was employed as an external collection.
As a result, re-ranking is performed with a new query with .
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>Data
Two different collections are used for target and external collections, respectively.
The target collection is ClueWeb12-Disk-B (ClueWeb12B) including about 52M
web pages while the external collection is TREC CDS 2016 including about 1.2M
biomedical journal articles. In both collections, text of pages were extracted by
removing HMTL and XML tags using JSOUP1 parser. Table 1 shows the summary
of data statistics of ClueWeb12B and TREC CDS 2016, respectively. Words occur
less than 5 and longer than 100 characters are replaced with &lt;UNK&gt;. Numbers are
normalized to &lt;NUx&gt; where x is length of a number. Finally, all words are
lowercased. This normalization reduces noisy words. Stop-words were removed using
419 stop-words2 in INQUERY on query time but not on indexing time.
#Docs</p>
      <sec id="sec-3-1">
        <title>Voc. Size</title>
      </sec>
      <sec id="sec-3-2">
        <title>Tokens Avg. Doc. Len</title>
      </sec>
      <sec id="sec-3-3">
        <title>1 https://jsoup.org/</title>
        <p>2 http://sourceforge.net/p/lemur/galago/ci/default/tree/core/src/main/resources/stopwords/inquer
y
All mixtures for combining the query and feedback models are set to 0.5. Dirichlet
prior is set to 2500. In relevance feedback (RF), the size of feedback words is set to
50 while the size of feedback documents corresponds to the number of relevant
documents. In two query expansion approaches, they are fixed as 5 and 50, respectively.
Word vectors are estimated using GloVe with ADAM optimizer where the vector size
is 200.
We submitted three runs for this task. Run1, considered as our baseline, is the results
of applying RF. Run2 and Run3 employed centrality scores and CBEEM,
respectively. Table 2 summarized three runs.
6. Oh, H.-S., Jung, Y.: Cluster-based query expansion using external
collections in medical information retrieval. J. Biomed. Inform. 58, 70–79
(2015).
7. Palotti, J. et al.: CLEF 2017 Task Overview: The IR Task at the eHealth
Evaluation Lab. In: Working Notes of Conference and Labs of the
Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2017).
8. Roberts, K. et al.: Overview of the TREC 2016 Clinical Decision Support
Track. In: In Proceedings of The Twenty-Fifth Text REtrieval Conference
(TREC 2016). (2016).
9. Zhai, C., Lafferty, J.: Model-based feedback in the language modeling
approach to information retrieval. In: Proceedings of the tenth international
conference on Information and knowledge management. pp. 403–410 ACM,
New York, New York, USA (2001).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abdul-Jaleel</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          et al.:
          <source>UMass at TREC</source>
          <year>2004</year>
          :
          <article-title>Novelty and HARD</article-title>
          .
          <source>In: Proceedings of Text REtrieval Conference (TREC)</source>
          .
          <article-title>(</article-title>
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          et al.:
          <article-title>CLEF 2017 eHealth Evaluation Lab Overview</article-title>
          .
          <source>In: CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS)</source>
          . Springer (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Kurland</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>PageRank without hyperlinks: Structural re-ranking using links induced by language models</article-title>
          .
          <source>In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '05</source>
          . pp.
          <fpage>306</fpage>
          -
          <lpage>313</lpage>
          ACM Press, New York, New York, USA (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Inf</surname>
          </string-name>
          . Process. Manag.
          <volume>46</volume>
          ,
          <issue>4</issue>
          ,
          <fpage>448</fpage>
          -
          <lpage>469</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>H.-S.</given-names>
          </string-name>
          et al.:
          <article-title>A Multiple-Stage Approach to Re-ranking Medical Documents</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          . pp.
          <fpage>166</fpage>
          -
          <lpage>177</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>