<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>KISTI at CLEF eHealth 2016 Task 3: Ranking Medical Documents using Word Vectors</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Heung-Seon Oh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuchul Jung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Korea Institute of Science and Technology Information</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>User's searching activity to obtain relevant medical information becomes very common as the general public uses the Web as source of health information. As a response to this phenomenon, there have been a number of approaches to find useful information for diagnosing or understanding their health conditions from the Web or medical literatures. As an ongoing effort to deliver useful medical information, we attempted two different approaches using word vectors learnt by Word2Vec with Wikipedia. At first, initial documents are obtained using a search engine. Based the retrieved documents, pseudo-relevance feedback is applied with two different usage of the word vectors. In the first approach, a feedback model is constructed using new relevance scores using the word vectors while it is constructed with a new query expanded.</p>
      </abstract>
      <kwd-group>
        <kwd>medical information retrieval</kwd>
        <kwd>language models</kwd>
        <kwd>pseudo relevance feedback</kwd>
        <kwd>word vectors</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Laypeople use the Web to acquire medical information such as symptoms, diagnosis,
treatments, diseases, and hospitals. Unfortunately, they may fail to find relevant
information due to difficulty of representing information needs. This happens because they
are often not only unfamiliar with medical terminology but also uncertain about their
exact questions. To mitigate this problem, CLEF eHealth [2, 4] aims to support
laypeople for finding and understanding medical documents on the Web by leveraging
medical text processing techniques.</p>
      <p>CLEF 2016 eHealth [3] continues to make an effort for the same purpose. We
participate in task 3 (patient-centered information retrieval) that focuses on evaluating the
effectiveness of medical information retrieval on the Web [10]. This task utilizes a vast
of Web document collection, ClueWeb12-B, while the previous tasks employs about
1M Web documents collected from several health-related web sites. In this paper, we
proposed two different approaches using word vectors obtained from Word2Vec to
perform pseudo relevance feedback.</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <sec id="sec-2-1">
        <title>Ranking framework</title>
        <p>Our method is to rank medical documents using word vectors constructed from a
medical resource, specifically medical Wikipedia. The aim of using the word vectors is to
understand the information need of a query properly. For a query  , a set of documents,
 = { 1,  2, … ,  | |}, from a collection  are retrieved using a search engine. For a
retrieval model, query-likelihood method with Dirichlet smoothing (QLD) is chosen [8].
Based on  , pseudo relevance feedback (PRF) using the word vectors is performed to
re-rank the documents in S with a feedback model. In this step, the word vectors are
adopted in two different approaches. In the first approach, they are used to compute
relevance scores 
( ,  )between  and 
while, in the second approach, they
are used to directly expand  to</p>
        <p>by adding more words that are not appear in  .</p>
        <p>For each approach, final scores are computed by KL-divergence method with a
feedback model constructed using 
2.2</p>
        <p>Basic Foundation
KL-divergence method (KLD) is adopted to compute a relevance score between  and
 by estimating language models [5, 7, 9] because it has a principle to incorporate
information into a query in PRF:



( ,  )= exp (−</p>
        <p>(  ||  ))
= exp (− ∑  ( |  )

 ( |  )
 ( |  )
)
(1)
tively.</p>
        <p>low:
where   and   are the query and document unigram language models,
respecA query model is estimated by maximum likelihood estimation (MLE), as shown
be ( |  )=
 ( , )</p>
        <p>| |
 ( |  )=
 ( , )+  ⋅  ( | )</p>
        <p>∑  ( , )+ 
where  ( , )is the count of a word w in query  and | | is the number of words in
 .</p>
        <p>A document model is estimated using Dirichlet smoothing to improve retrieval
performance [8]:</p>
        <p>where  ( , )is the count of a word w in document D,  ( | )is the probability of
a word w in collection C, and  is the Dirichlet prior parameter.</p>
        <p>Pseudo-relevance feedback (PRF) is a popular query expansion approach to update
a query. It assumes that the top-ranked documents  = { 1, 2,…, | |} relevant to a
given query and the words in F are useful to reveal hidden information needs. A
relevance model (RM) is a multinomial distribution  ( | ), which is the likelihood of a
word w in a query  based on  . The first version of the relevance model (RM1) is
defined as follows:
(2)
(3)
(4)
  1( | )= ∑  ( |  ) (  | )
 ∈
 ∈
 ∈</p>
        <p>( |  ) (  )
= ∑  ( |  )</p>
        <p>( )
∝ ∑  ( |  ) (  ) ( |  )</p>
        <p>RM1 is composed of three components: the document prior  (  ), the document
weight  ( |  ), and the term weight in a document  ( |  ). In general,  (  )is
assumed to have a uniform distribution without knowledge of document D.  ( |  )=
∏ ∈  ( |  ) ( , ) indicates the query-likelihood score.</p>
        <p>Finally, a new query model is estimated by combining the original query model and
RM1. Documents are re-scored and re-ranked using the new query model. RM3 [1] is
a variant of a relevance model which is used here to estimate a new query model with
RM1,
 ( | ′ )= (1 −  )⋅  ( |  )+  ⋅   1( | ),
(5)
where  is a control parameter between the original query model and the feedback
model.</p>
        <sec id="sec-2-1-1">
          <title>2.3 Word Vectors</title>
          <p>Word2Vec [6] learns a vector representation for a word using a neural network
language model. The resulting vector representations for words (i.e., word vectors) can be
used in various tasks because a word is represented by a small-size vector. Learning the
word vectors is entirely unsupervised and it can be computed on the text corpus
according to purposes.</p>
          <p>In our approach, Wikipedia was chosen to an input to train the Word2Vec. We
assumed that non-medical pages are not useful to medical-related word vectors.
Therefore, we just focused on medical pages by filtering out non-medical pages. To this end,
first, categories were collected from a root to leaves. We set
Health/Diseases_and_disorders and Health/Health_care/Medicine to the root because it is assumed that general
medical queries want to find out information about diseases and treatments. This
filtering procedure produced 7,672 categories. Then, all pages associated with those
categories were used as input. The details of the medical Wikipedia pages are summarized at
do that, cosine similarity is computed between  and  by averaging associated word
vectors respectively:</p>
          <p>( ,  ):
tion 4 and 
 
tion 1.</p>
          <p>Then, a new relevance score is computed by multiplying 
( ,  ) and

( ,  )
PRF is performed with 

( ,  ). For detail,</p>
          <p>1( | )is estimated in
Equa( ,  ). In Equation 5,  ( | ′ ) is constructed by combining
1( | )and  ( |  ). Finally, re-ranking is performed with  ( | ′ )using
EquaIn the second approach, a query  is directly expanded to  WV using the word
vectors. To do that, ⃗W⃗⃗ , the average word vector for all query words, is computed. Then,
cosine similarity is computed between ⃗W⃗⃗ and ⃗W⃗⃗ where  ∈   . Top-5 words
with high cosine similarity that don’t appear in  are chosen and added to   . Then,
PRF is performed with   using Equations 1, 4, and 5.
3
3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>Data
This task used ClueWeb12-Disk-B (ClueWeb12B) collection which contains about
50M pages. Text of pages were extracted by removing HMTL tags using JSOUP1
parser. Table 2 shows a summary of data statistics of ClueWeb12B.
Lucene2 was exploited to index and search the initial documents  . For text processing,
Stop-words were removed using 419 stop-words3 in INQUERY. | | was set to 2500
and obtained using QLD.</p>
      <p>To generate the word vectors, Java version of Word2Vec4 was used. CBOW
architecture was used with 200 sized word vector. For input, we removed all punctuations and
lowercased words without removing stop-words.
We submitted three runs for this task. Run1 is our baseline while other two runs are our
proposed approaches using the word vectors. Run2 is PRF with new relevance scores
using the word vectors. Run3 is PRF with an expanded query using the word vectors.</p>
      <sec id="sec-3-1">
        <title>1 https://jsoup.org/</title>
      </sec>
      <sec id="sec-3-2">
        <title>2 http://lucene.apache.org/</title>
        <p>3 http://sourceforge.net/p/lemur/galago/ci/default/tree/core/src/main/resources/stopwords/inquer
y</p>
      </sec>
      <sec id="sec-3-3">
        <title>4 https://github.com/medallia/Word2VecJava</title>
        <p>Run
1
2</p>
        <sec id="sec-3-3-1">
          <title>Description</title>
          <p>Scoring by KLD with RM1
Scoring by KLD with RM1 using 
Scoring by KLD with RM1 using</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abdul-Jaleel</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          et al.:
          <source>UMass at TREC</source>
          <year>2004</year>
          :
          <article-title>Novelty and HARD</article-title>
          .
          <source>In: Proceedings of Text REtrieval Conference (TREC)</source>
          .
          <article-title>(</article-title>
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          et al.:
          <article-title>Overview of the CLEF eHealth Evaluation Lab 2015</article-title>
          .
          <source>In: CLEF 2015 - 6th Conference and Labs of the Evaluation Forum. Lecture Notes in Computer Science (LNCS)</source>
          , Springer (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Kelly</surname>
          </string-name>
          , Liadh and Goeuriot, Lorraine and Suominen, Hanna and Névéol, Aurélie and Palotti, Joao and Zuccon, G.:
          <article-title>Overview of the CLEF eHealth Evaluation Lab 2016</article-title>
          .
          <source>In: CLEF 2016 - 7th Conference and Labs of the Evaluation Forum</source>
          . Springer (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          et al.:
          <source>Overview of the ShARe/CLEF eHealth Evaluation Lab</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>In: Proceedings of CLEF 2014</source>
          . Springer (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Kurland</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>PageRank without hyperlinks: Structural re-ranking using links induced by language models</article-title>
          .
          <source>In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '05</source>
          . pp.
          <fpage>306</fpage>
          -
          <lpage>313</lpage>
          ACM Press, New York, New York, USA (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          et al.:
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>In: Proceedings of the International Conference on Learning Representations (ICLR</source>
          <year>2013</year>
          ). pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>H.-S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Cluster-based query expansion using external collections in medical information retrieval</article-title>
          .
          <source>J. Biomed. Inform</source>
          .
          <volume>58</volume>
          ,
          <fpage>70</fpage>
          -
          <lpage>79</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lafferty</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A study of smoothing methods for language models applied to information retrieval</article-title>
          .
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>22</volume>
          ,
          <issue>2</issue>
          ,
          <fpage>179</fpage>
          -
          <lpage>214</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lafferty</surname>
          </string-name>
          , J.:
          <article-title>Model-based feedback in the language modeling approach to information retrieval</article-title>
          .
          <source>In: Proceedings of the tenth international conference on Information and knowledge management</source>
          . pp.
          <fpage>403</fpage>
          -
          <lpage>410</lpage>
          ACM, New York, New York, USA (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          et al.:
          <source>The IR Task at the CLEF eHealth Evaluation Lab</source>
          <year>2016</year>
          :
          <article-title>User-centred Health Information Retrieval</article-title>
          . In:
          <article-title>CLEF 2016 Evaluation Labs</article-title>
          and Workshop: Online Working Notes. (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>