<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CSKU GPRF-QE for Medical Topic Web Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ornuma Thesprasith</string-name>
          <email>ornuma.thesprasith@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chuleerat Jaruskulchai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Faculty of Science, Kasetsart University</institution>
          ,
          <addr-line>Bangkok</addr-line>
          ,
          <country country="TH">Thailand</country>
        </aff>
      </contrib-group>
      <fpage>260</fpage>
      <lpage>268</lpage>
      <abstract>
        <p>Patients and their relatives have more chances to access their healthinformation in a form of discharge summary. Most of them do not totally understand contents in the discharge summary. The ShARe/CLEF eHealth Evaluation Lab organized a shared task for improving retrieval medical information from the web. Queries of this task are formulated based on information in discharge summaries. This paper investigates efficiency of query expansion using external collection. Co-occur terms in pseudo-relevance feedback of Genomics collection are selected and re-weighted based on Rocchio's formula with dynamic tunable parameters of pseudo-relevance part. LUCENE, vector space model, is baseline retrieval tool. The proposed expansion method improves from baseline in all level cut of nDCG and best perform in P@10 of 3 topics. Using biomedical related collection such as Genomics is useful for medical topics retrieval.</p>
      </abstract>
      <kwd-group>
        <kwd>Genomics Track 2004</kwd>
        <kwd>pseudo-relevance feedback</kwd>
        <kwd>re-weighting scheme</kwd>
        <kwd>medical terminology retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Most patients or their relatives may be questionably when reading their discharge
summary because medical terminology is very specific domain and is un-easy to
understand by laypeople. The ShARe/CLEF eHealth evaluation lab is established to help
these users more comprehend the health information [1]. Especially in the Task 3:
User-centred health information retrieval [2] focuses on web collection. Since search
engines are usually used to retrieve more explanation about the medical-specific
subject. The expected results about health information should be understood by general
users and come from reliable resources. This means that the relevant web pages
contents are consisted of the medical terminology along with general terms or common
words that explain the medical term in more detail.</p>
      <p>Query expansion techniques are widely used to improve retrieval performance.
There are several factors effect to expansion results; source of expansion, term
selection, and re-weighting method. Source to expand query may come from many sources
such as local collection, external-standard collection, and ontology. External
collections such as English Wikipedia and TREC (disk 1-5) are used for expansion as
reported in [3]. Reliable and most often used biomedical ontologies are UMLS
Metathesaurus [4], MeSH ontology [5], and SNOMED-CT [6].</p>
      <p>Research work [7] proposed method for selection the most effective expansion
source based on query performance prediction technique. The objective of this
technique is to estimate performance of retrieval system without relevant judgment [8].
This technique is either analysis collection without retrieval or focus returned results
[9]. However, query performance prediction can estimate degree of relation between
difference collections also. We follow this idea to select source for expansion such as
med [3] , OHSUMED [10] and Genomics collection [11] .</p>
      <p>Expansion terms from an external collection should be similar to indexing terms of
the local collection. In our previous work [12] used internal MeSH (Medical Subject
Headings) terms of local collection (OHSUMED) for expansion based on
pseudorelevance feedback (PRF) method. Since users need information to describe disease
and treatment in MEDLINE collection, expansion query with medical vocabulary
may be beneficial method. On the other hand, the ShARe/CLEF Task 3a [2] queries
are specific medical terms in discharge summaries whereas collection contains health
information web pages for laypeople. We believe that there is a gap between specific
medical terms in user’s queries and general words used in relevant web pages. To
expand medical terms in query, we select terms in title and abstract part instead of
medical controlled vocabulary part (MeSH terms) for expansion. We expect that these
candidate terms derived from this method should appear more in relevant web pages
and effect to boost up retrieval scores.</p>
      <p>Research works [13] expanded query based on pseudo-relevance feedback (PRF)
method and adapted Rocchio’s formula for re-weighting terms. We adapt PRF
method in different way by using results of external PRF instead of local PRF as used in
traditional PRF paradigm. We adapt re-weighting formula for appropriately
expansion. The details of our expansion method and results are described in the next
sections.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <sec id="sec-2-1">
        <title>Vector Space Model Method and Tool</title>
        <p>The collection is represented by matrix of terms-documents and each raw is a
representation of document and is consisted of weighted term values. The query is
represented by vector of weighted term values as the document vector. The similarity
between query and each document vector is used to rank the returned results. The
classical vector similarity measure is cosine similarity defined as following.

( ⃑ ,  ⃑ )=</p>
        <p>⃑ ∙⃑
‖ ⃑ ‖∙‖⃑ ‖</p>
        <p>Lucene is vector space model retrieval tool [14]. This tool is implementing cosine
similarity measure in sophisticate way. The Lucene similarity measure is define as
following.
( ,  ) = ∑   (</p>
        <p>(   ) × 
ℎ
( 
  ) × 
( ,  ) × 
( 2) ×</p>
        <p>(  
( )
  ) ×
(1)
(2)</p>
        <p>Lucene allows user to boost some query terms to have specific weight via “^” (the
caret mark), for example, “hepatic^3.0 encephalopathy^4.6 liver”. These boosting
query weight will be used in Coord(q,d) function and normalized by QueryNorm(q)
function.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Genomics Pseudo-Relevance Feedback Expansion Method</title>
      </sec>
      <sec id="sec-2-3">
        <title>Indexing Process.</title>
        <p>This current work, the documents are web pages collected from many
medicalrelated resources [2] . Queries of this collection are formulated by using medical
terminologies in discharge summaries. We use 5 train queries to determine indexing
method; a) all document (web pages with raw data), b) non-html tags documents
(some pages are missing), and c) non-html tags documents compensate missing pages
with original pages. In our preliminarily experiment, we evaluate MAP performance
using train relevant judgments. The results are mixed and inconsistency. Therefore we
select compensate indexing method to avoid losing under-estimate webpages.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Expansion Source Selection.</title>
        <p>Original purpose of query performance prediction (QPP) is to estimate retrieval
system without relevant judgment [8]. This technique is either analysis collection
without retrieval or focus returned results [9]. Research work [7] used QPP to select
appropriate expansion sources by comparing average term frequency of query with
local and external collection. Our work uses simplest method by comparing number
of documents returned from retrieval in three TREC standard sub-collections such as
med [3], OHSUMED [10], and Genomics 2004 [11]. Results of 5 train queries from
Genomic collection are larger than OHSUMED and med. In this current work, we
believe that more documents returned provide more useful expansion terms. Even the
Genomic collection based on genomics information, we expect that biology terms in
genomics-based documents have relationship with medical terminologies.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Term Selection.</title>
        <p>We hypothesize that relevant documents should contain more general terms that
easy to understand by laypeople. Therefore expansion terms could be binding specific
terms in queries and general terms in web pages. We select terms co-occur more often
in Genomics-PRF set for expansion. Procedures for term selection describe as
following steps. First, we retrieve in Genomics collection (uses title and abstract for
indexing process). Second, top-k documents that contain any query terms are included in
Genomic-PRF set. Third, terms in title and abstract part of this set are selected based
on term frequency as candidate set.</p>
        <p>Since candidate terms derive only from Genomics collection (Genomic-PRF set),
these terms can be redundant with query terms or new added terms. Each candidate
terms should be assigned with different weight based on its appearance.</p>
      </sec>
      <sec id="sec-2-6">
        <title>Re-weighting Method.</title>
        <p>Rocchio’s formula is widely used for PRF-based query expansion re-weighting
schemes. The formula composes of three part; original weight, relevance-based
weight, and non-relevance-based weight.</p>
        <p>W’Q = α WQ + β/ |DR| × (∑ dr) + γ/|DN| × (∑ dn)
(3)
where α is tunable weight of initial query,</p>
        <p>WQ is weight of term in initial query,
β is tunable weight of relevant documents (dr),
|DR| is number of relevant documents,
γ is tunable weight of non-relevant documents (dn),
|DN| is number of non-relevant documents.</p>
        <p>The traditional pseudo-relevance re-weighting formula replaces the relevant part
with pseudo-relevance part and ignores non-relevant part by setting γ to 0. It defined
as follow,</p>
        <p>W’Q = α WQ + β × WPRF
where WPRF is weight of term in pseudo-relevance documents.</p>
        <p>Our method divides original query into two parts according to appearance in
candidate set, non-candidate terms (WNQ) and candidate term (WCQ). These two parts of
query terms have corresponding tunable parameters are  1and  2 respectively. Our
pseudo-relevant part uses top k documents that returned from Genomics collection
and define as Genomics-PRF terms (WGPRF). We are not using pseudo-relevant
feedback from local collection. The re-weighting formula is defined as follow,
  ′ =  1 
+  2 
+   
where   is weight of initial query term that not appear in PRF set,
 1 is tunable parameter for initial query term that not appear in PRF set,
  is weight of initial query term that appear in PRF set,
 2 is tunable parameter for initial query term that appear in PRF set,
WGPRF is weight of Genomics expansion term in PRF set,
 is dynamic tunable parameter for new expansion term in PRF set.</p>
        <p>We set more value to original query terms that are not appear in external-based
expansion source to prevent “query drifting”. We use external resource for finding new
terms for increase recall. If these terms are original query terms we set the offset
value of the WCQ less than the WNQ and use frequency for boosting up from the offset.
This approach reduces effect of over-weighting terms.</p>
        <p>Query log is useful information for relevant judgment [15]. We assume that results
from train queries act as query log. We use train queries and their relevant judgments
to set tunable parameter values. Term frequency in Genomics-PRF set is used to set
these parameters. We derive optimized values of each weight is 1.0 and tunable
pa(4)
(5)
rameter values (  1,  2 ,  ) are 3.0, (2.0 +  2( )) and (0.5+ 2( )) ,
respectively. These setting are done quite well in training set in heuristic manner. We expect
that these values will work well on test query set also.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>Experimental Setup</title>
        <p>The collection contains 8 parts of .zip files [2]. The html content of a web page is
within “#UID” and “#EOR” tag. The html pages total is 875,486 files. Jsoup [16] is
html parser tool used for extract major content from web page such as title,
description, keywords, header, bold, and strong text. From this content extraction process,
some files are very small (size less than 200 bytes). These are qualified files that
contain 460,279 files. In our preliminary experimental, we indexed collection three types:
a) raw html (whole collection), b) only major content that without html tags (460,279
files) and c) compensate missing major content file with raw html (whole collection).</p>
        <p>Lucene version 4 is indexing and retrieval tool. We use Lucene’s Standard
Analyzer for indexing three collection types [14]. We retrieved 5 training topics and
evaluated with train relevant judgment. Since MAP results are mixed, we avoid losing the
under-estimate web pages by indexing with compensate method (type c).</p>
        <p>Our research work focuses on finding a suitable source for query expansion. In
preliminary, we compare results returned from retrieval 5 train queries. The preliminary
results demonstrated that Genomics 2004 collection returns maximum number of
documents in all train queries. This collection contains more biomedical terms and
gene information thus we believe that returned documents are likely to have more
related medical terms.</p>
        <p>Since we assume that keywords or information need of users are similar to
keywords used in the train queries. This paper investigates efficiency of using Genomics
collection to expand medical topics queries. With the preliminary experiment, we
retrieve 5 train queries and vary number documents (top k) in Genomic-PRF set and
number of expansion terms (top m) according to equation (5). By considering MAP
results from our variations, we found that the optimized values for top-k and top-m
are 19 documents and 8 terms, respectively. We expect that the test queries are not
different from the train queries that we used to setting these parameter. In expansion
process, candidate terms are terms that co-occur in the same document of query terms
in pseudo-relevance feedback (PRF) set.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Remarks before Discussion</title>
        <p>Since our official baseline is missing result of topic no.50 because of program
error. This error result to the evaluation of baseline is lower than usual. Therefore we
re-examine the correction baseline (with returned result of topic no. 50) and
reevaluate the retrieval performance. The MAP values for correction baseline run and
expansion run are 0.1820 and 0.2076, respectively.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Results and Discussion</title>
        <p>The results from all runs are shown in this section. We demonstrate nDCG
comparison as detail in Table 1. All nDCG cut level of expansions are higher than two
baselines (both official and correction version). This means terms in Genomics documents
occur in relevant web pages. Re-weighting these terms are effect to result ranking.
Detail of other metrics in trec evaluation of our runs shown in Table 2.</p>
        <p>Precision at 10 (P@10), our baseline-run above median 10 topics whereas
expansion-run above median 14 topics. Our expansion proposed is best performance in 3
topics (4, 9, and 17) of 50 topics. Fortunately, terms in pseudo-relevance feedback of
these topics more relate to main keyword such as “anoxic” vs. “anoxia”, “pneumonia”
vs. “lung”, and “duodenal” vs. “gastric”. These expansion terms are very helpful.</p>
        <p>The expansion results improve from official baseline 8 topics whereas official
baseline outperforms expansion 4 topics. As shown in the following figures.</p>
        <p>Fig. 1. nDCG baseline runs compare with expansion run</p>
        <p>Medical terms in discharge summary are difficult for laypeople because these
terms are very specific domain terminology. Retrieval by using queries constructed
from discharge summary will be returned too specific web pages and users still need
more explanation and information about the subject.</p>
        <p>We believe that relevant web page contain both medical terminology and general
terms. We use query expansion technique to explore useful terms and increase
possibility of retrieval more relevant documents. Our query expansion approach is based
on pseudo-relevance feedback using external biological (genomics literature)
collection. We use train queries and train relevant judgments to set the optimized
parameters for our proposed expansion method.</p>
        <p>The importance issues for query expansion are source of terms, type of term for
expansion, and re-weighting scheme. We determine expansion source based on query
performance prediction technique. We estimate usefulness of external collection base
on size of returned set. Since biomedical references in Genomics collection has
disease and related-gene information. Terms in these references are selected and
reweighted based on frequency in PRF set. Although we use only statistical information
in pseudo-relevance feedback set, this proposed method shows MAP improvement
from baseline.</p>
        <p>In future work, we will keep going on more sophisticated criteria to select external
collection to expand query and experiment on various external collections.
14.
15.
16.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Velupillai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            ,
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Zuccon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            , and
            <surname>Palotti</surname>
          </string-name>
          , J.:
          <source>Overview of the ShARe/CLEF eHealth Evaluation Lab 2014</source>
          . Springer (
          <year>2014</year>
          ) Goeuriot,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Palotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Pecina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Zuccon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            , and
            <surname>Mueller</surname>
          </string-name>
          , H.:
          <source>ShARe/CLEF eHealth Evaluation Lab</source>
          <year>2014</year>
          ,
          <article-title>Task 3: User-centred health information retrieval</article-title>
          ,
          <source>In CLEF</source>
          <year>2014</year>
          .
          <article-title>(2014) Voorhees</article-title>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            , and
            <surname>Harman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Overview of the Fifth Text REtrieval Conference (TREC-5)</article-title>
          , In TREC. (
          <year>1996</year>
          )
          <article-title>Unified Medical Language Systems</article-title>
          , http://www.nlm.nih.gov/research/umls The Basics of Medical Subject Headings (MeSH®), http://www.nlm.nih.gov/bsd/disted/mesh/ SNOMED Clinical Terms®
          <article-title>(SNOMED CT®</article-title>
          ), http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html
          <string-name>
            <surname>He</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ounis</surname>
          </string-name>
          , I.:
          <article-title>Combining fields for query expansion and adaptive query expansion</article-title>
          .
          <volume>43</volume>
          ,
          <fpage>1294</fpage>
          -
          <lpage>1307</lpage>
          (
          <year>2007</year>
          )
          <string-name>
            <surname>Kurland</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raiber</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Shtok</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Query-performance prediction and cluster ranking: Two sides of the same coin</article-title>
          ,
          <source>In Proceedings of the 21st ACM international conference on Information and knowledge management</source>
          . pp.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2459-
          <fpage>2462</fpage>
          . ACM (
          <year>2012</year>
          )
          <article-title>Cummins</article-title>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ,Jose, J., and O'riordan, C.:
          <article-title>Improved query performance prediction using standard deviation</article-title>
          .
          <source>In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval</source>
          . pp.
          <fpage>1089</fpage>
          -
          <lpage>1090</lpage>
          . ACM (
          <year>2011</year>
          ) Hersh,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Buckley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Leone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. J.</given-names>
            , and
            <surname>Hickam</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research</article-title>
          , in SIGIR '94,
          <string-name>
            <given-names>B.</given-names>
            <surname>Croft</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.J.</given-names>
            <surname>Rijsbergen</surname>
          </string-name>
          .(eds). p.
          <fpage>192</fpage>
          -
          <lpage>201</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          Springer London (
          <year>1994</year>
          )
          <string-name>
            <surname>William R. Hersh</surname>
          </string-name>
          , R. T. B.,
          <string-name>
            <surname>Laura</surname>
            <given-names>Ross</given-names>
          </string-name>
          , Phoebe Johnson , Aaron M. Cohen ,
          <string-name>
            <given-names>Dale F.</given-names>
            <surname>Kraemer</surname>
          </string-name>
          .
          <article-title>TREC 2004 genomics track overview In The 13th Text REtrieval Conference</article-title>
          . (
          <year>2004</year>
          )
          <string-name>
            <surname>Thesprasith</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Jaruskulchai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Query Expansion Using Medical Subject Headings Terms in the Biomedical Documents</article-title>
          ,
          <source>in Intelligent Information and Database Systems</source>
          . p.
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          . Springer (
          <year>2014</year>
          ) Abdou,
          <string-name>
            <given-names>S.</given-names>
            , and
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>Searching in Medline: Query expansion and manual indexing evaluation</article-title>
          .
          <volume>44</volume>
          ,
          <fpage>781</fpage>
          -
          <lpage>789</lpage>
          (
          <year>2008</year>
          )
          <string-name>
            <given-names>Apache</given-names>
            <surname>Lucene - Apache Lucene</surname>
          </string-name>
          Core, http://lucene.apache.org/core/ Cui, H.,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>J.-R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
          </string-name>
          , J.-Y., and Ma, W.-Y.:
          <article-title>Query expansion by mining user logs</article-title>
          .
          <volume>15</volume>
          ,
          <fpage>829</fpage>
          -
          <lpage>839</lpage>
          (
          <year>2003</year>
          )
          <article-title>jsoup: Java HTML Parser</article-title>
          , http://jsoup.org/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>