<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>QUT ielab at CLEF 2017 e-Health IR Task: Knowledge Base Retrieval for Consumer Health Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jimmy</string-name>
          <email>jimmy@hdr.qut.edu.au</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guido Zuccon</string-name>
          <email>g.zuccon@qut.edu.au</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bevan Koopman</string-name>
          <email>bevan.koopman@csiro.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Australian E-Health Research Centre, CSIRO</institution>
          ,
          <addr-line>Brisbane</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Queensland University of Technology</institution>
          ,
          <addr-line>Brisbane</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Surabaya (UBAYA)</institution>
          ,
          <addr-line>Surabaya</addr-line>
          ,
          <country country="ID">Indonesia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe our participation to the CLEF 2017 e-Health IR Task [6]. This track aims to evaluate and advance search technologies aimed at supporting consumers to nd health advice online. Our solution addressed this challenge by developing a knowledge base (KB) query expansion method. We found that the two best KB query expansion methods are mapping entity mentions to KB entities by performing exact matching entity mentions to the KB aliases (EMAliases) and multi-matching entity mentions to all KB features (Title, Categories, Links, Aliases, and Body) (EM-All). After mapping between entity mentions to KB entities established, we found the Title of the mapped KB entities as the best source of expansion terms compared to the aliases or combination of both features. Finally, we also found that Relevance Feedback and Pseudo Relevance Feedback are e ective to further improve the query e ectiveness.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A major challenge for users in consumer health search (CHS) is how to e ectively
represent complex and ambiguous information needs as a query [
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref9">11, 9, 10, 13</xref>
        ]. In
this work we seek to overcome this problem by reformulating the consumer's
health query with more e ective terms (e.g., less ambiguous, synonyms, etc.).
Previous work has shown that manually replacing query terms with those from
medical terminologies (e.g., UMLS) proved to be e ective [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] { but can it be
done automatically?
      </p>
      <p>
        This work addressed the adhoc search task de ned in the CLEF eHealth
2017 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Task 3: Patient-Centred Information Retrieval, sub task IRTask1 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
In 2017, this task will use the same set of queries as in CLEF eHealth Task in
2016. However, only results that were un-judged in 2016 will be considered.
such as Wikipedia and Freebase and then used related entities for query
expansion. Bendersky et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] approach involved linking the query to concepts in
Wikipedia. Concepts from the query, denoted as Q, were weighted; the same
was done for concepts in each of the documents in the corpus, denoted as D.
The relevance score sc(Q; D) between query Q and document D was calculated
as relatedness measure between Q and D [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Later, the Entity Query Feature Expansion model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] extended this previous
work by automatically expanding queries by linking them to Wikipedia. Instead
of just using entities from the Wikipedia (as done by Bendersky et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), the
Entity Query Feature Expansion model labelled words in the user query and in
each document with a set of entity mentions MQ and Md [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Each entity mention
was related to KB entities e E, with di erent relationship types. The queries
were expanded by including entity aliases, categories, words, and types from
Wikipedia articles. The expanded query was then matched against documents
in the corpus using the query likelihood model with Dirichlet smoothing.
      </p>
      <p>
        We posit that this Entity Query Feature Expansion model would have merit
in CHS. It provides a means of mapping health queries to health entities in a
health related subset of a general KB (Wikipedia). The initial query can then
be expanded based on related entities. Our decision to use a general KB di ers
from other approaches in health search which typically expand the query using
specialised medical KB (e.g., MeSH, UMLS) [
        <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
        ]. Our rationale for this was
the observation that consumers tend to submit queries using general terms and
that these are covered by Wikipedia entities. However, Wikipedia also covers
many of the medical entities found in specialised medical KBs. More importantly,
there are links between the general and specialised entities in Wikipedia | links
that can be exploited for query expansion. Nevertheless, we adopt the Entity
Query Feature Expansion model for our empirical evaluation, determining if
such a KB retrieval approach is e ective for CHS. Note however that while
Wikipedia content is manually curated by an active, large community, editors
may not include medical experts or clinical terminologists. Thus, there may be
errors in some of the information included for medical entities in Wikipedia, also,
information in Wikipedia may be incomplete.
3
      </p>
      <p>Our KB Query Expansion Model for CLEF 2017
We use the Entity Query Feature Expansion model for retrieval and the Wikipedia
as the KB. A single Wikipedia page represents a single entity (the page title
identi es the entity). Beyond titles, Wikipedia also contains many page features
useful in a retrieval scenario. Figure 1 shows those we used to map the queries
to entities in the KB and as the source of expansion terms: entity title (E),
categories (C), links (L), aliases (A), and body (B).</p>
      <p>We formally de ne the query expansion model as:
#^q = X X</p>
      <p>M
f
f #f(EM;SE)
(1)
Q</p>
      <p>W
M</p>
      <p>E</p>
    </sec>
    <sec id="sec-2">
      <title>Wikipedia C E L</title>
      <p>B
A</p>
    </sec>
    <sec id="sec-3">
      <title>Q: Query</title>
      <p>W: Words in query
M: Entity mention
E: Entity
C: Categories
L : Links
A: Aliases</p>
      <p>B: Body</p>
      <p>where M are the entity mentions and contain uni-, bi-, and tri-gram generated
from the query; f is a function used to extract the expansion terms. f (0; 1)
is a weighting factor. #f(EM;SE) is a function to map entity mention M to the
Wikipedia features EM (i.e., \Title", \Aliases", \Links", \Body", \Categories",
\All") and extract expansion terms from source of expansion SE (i.e.,\Title",
\Aliases", \Title and Aliases").
3.1</p>
      <p>Relevance Feedback and Pseudo Relevance Feedback
On top of the KB query expansion, we also perform relevance feedback (RF)
and Pseudo Relevance Feedback (PRF). We performed RF by extracting the
ten most important health related words (based on tf.idf) from the top three
relevant documents (i.e. relevance score greater than 0 in the CLEF 2016 qrel).
A word is considered as health related if it exactly matches a title or an alias of
a Wikipedia health page. We consider a Wikipedia page as being health related
if it contains an infobox of health type or links to medical vocabulary resources,
e.g MeSH.
3.2</p>
      <p>
        Runs
We submitted 7 runs as described in Table 1. Runs included a baseline which
consists in submitting the original, not expanded queries to a system
implementing BM25F. To produce this submission, we indexed the Clueweb12b-13
collection using Elasticsearch 5.1.1, with stopping and stemming. For BM25F,
we set b = 0:75 and k1 = 1:2. BM25F allows to specify boosting factors for
matches occurring in di erent elds of the indexed web page. We consider only
the title eld and the body eld, with boost factors 1 and 3, respectively. These
were found to be the optimal weights for BM25F for this test collection in
previous work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This is a strong baseline as it outperforms all runs submitted to
CLEF 2016 (excluding the organisers' relevance feedback baselines) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>For constructing the KB, we considered candidate pages from the English
subset of Wikipedia (dump 1/12/2016), limited to current revisions only and
without talk or user pages. Of the 17 million entries, we ltered out pages that
were redirects; this resulted in a Wikipedia corpus of 9,195,439 pages.</p>
      <p>These candidate pages were then ltered by retaining only pages that
contain health infobox type and links to medical terminologies as Mesh, UMLS,
SNOMED CT, ICD. This choice is proven to be more e ective that retaining
all Wikipedia pages. The retained pages were then indexed using Elasticsearch
5.1.1 with eld based indexing ( elds: title, links, categories, types, aliases, and
body), to support the use of di erent elds as the source of query expansion
terms.</p>
      <p>Once the Knowledge Base was constructed, we extended the initial query
by rstly extract mentions all uni-, bi-, and tri-grams of the queries). Next, we
mapped the extracted mentions to KB's entities by exact matching the query
mentions to terms in KB's aliases eld (EM-Aliases) and to all KB's elds
(EMAll). Finally, we extended the initial query with the title of the mapped entities.</p>
      <p>We further extended the queries from EM-Aliases and EM-All by performing
Relevance Feedback (RF) and Pseudo Relevance Feedback (PRF). Our RF used
the top ten health related words from the top three relevant results. Health
related words are words that match title of a Wikipedia health pages (i.e., title of
a page in KB). Relevant results are documents that are judged relevant following
CLEF2016 qrels. In this work, PRF used the top ten health words from the top
three results (regardless of whether it was judged or not).
●
●
●
●
●
●
●
e
n
i
l
e
s
a
b
f
R
e
n
i
l
e
s
a
b
●
●
●
●
●
s
e
s
a
i
l
A
−
M
E</p>
      <p>F
R
P
s
e
s
a
i
l
A
−
M
E</p>
      <p>
        F
R
s
e
s
a
i
l
A
−
M
E
l
l
A
−
M
E
f
r
P
l
l
A
−
M
E
f
R
l
l
A
−
M
E
4 Results
Runs produced with the methods outlined above were stripped of any documents
assessed in CLEF 2016, as per instructions for the CLEF 2017 submissions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Before the removal of these documents, we did evaluate the results with respect
to NDCG@10, BPref and RBP@10. Note that BPref results are based on the
top 1,500 results for each query (this is because of the need to retrieved more
documents than the 1,000 documents threshold so that when removing
documents assessed in CLEF 2016, we still could retain 1,000 documents). Results
according to the CLEF 2016 relevance assessments are reported in Table 2.
      </p>
      <p>We further analysed the runs with respect to the number of un-judged
documents retrieved (using the CLEF 2016 relevance assessments). Figure 2 shows
that our expansion retrieved many un-judged documents in the top 10 search
results. This observation, along with the large RBP residuals reported in Table 2,
suggest that the evaluation of our runs may be a ected by the large number of
un-judged documents. The new assessments in CLEF 2017 may provide a fairer
estimate of the e ectiveness of the considered KB query expansion approaches.
5</p>
      <p>Future Work and Conclusion
Future work will seek to further improve the e ectiveness of the expanded queries
by exploring post-processing the results, for example by promoting documents
that are more likely to be health related.</p>
      <p>
        In conclusion, using CLEF 2016 dataset, we found that Entity Query Feature
Expansion Model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] can e ectively improved the query e ectiveness. The
expanded queries can then be further improved by performing Relevance Feedback
and Pseudo Relevance Feedback.
      </p>
      <p>Acknowledgment: Jimmy conducted this research as part of his doctoral study
which is sponsored by Indonesia Endowment Fund for Education (Lembaga
Pengelola Dana Pendidikan / LPDP).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bendersky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metzler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , W.:
          <article-title>E ective query formulation with multiple information sources</article-title>
          .
          <source>In: WSDM'12</source>
          . pp.
          <volume>443</volume>
          {
          <issue>452</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dalton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietz</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allan</surname>
          </string-name>
          , J.:
          <article-title>Entity Query Feature Expansion Using Knowledge Base Links</article-title>
          .
          <source>In: SIGIR'14</source>
          . pp.
          <volume>365</volume>
          {
          <issue>374</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D</given-names>
            <surname>az-Galiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mart</surname>
          </string-name>
          n-Valdivia,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Uren</surname>
          </string-name>
          <string-name>
            <surname>~</surname>
          </string-name>
          <article-title>a-</article-title>
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Query expansion with a medical ontology to improve a multimodal information retrieval system</article-title>
          .
          <source>Journal of Computers in Biology and Medicine</source>
          <volume>39</volume>
          (
          <issue>4</issue>
          ),
          <volume>396</volume>
          {
          <fpage>403</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spijker</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palotti</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuccon</surname>
          </string-name>
          , G.:
          <article-title>CLEF 2017 eHealth Evaluation Lab Overview</article-title>
          .
          <source>In: CLEF 2017 - 8th Conference and Labs of the Evaluation Forum. Lecture Notes in Computer Science (LNCS)</source>
          , Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jimmy</surname>
            , Zuccon,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koopman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Boosting Titles Does Not Generally Improve Retrieval E ectiveness</article-title>
          .
          <source>In: ADCS'16</source>
          . pp.
          <volume>25</volume>
          {
          <issue>32</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Palotti</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimmy</surname>
            , Pecina,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lupu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanbury</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab</article-title>
          . In: Working Notes of Conference and
          <article-title>Labs of the Evaluation (CLEF) Forum</article-title>
          . CEUR Workshop Proceedings (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Plovnick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Reformulation of consumer health queries with professional terminology: a pilot study</article-title>
          .
          <source>JMIR</source>
          <volume>6</volume>
          (
          <issue>3</issue>
          ) (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The e ectiveness of query expansion when searching for health related content: Infolab at clef ehealth 2016</article-title>
          . In: CLEF'
          <volume>16</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Toms</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Latter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>How consumers search for health information</article-title>
          .
          <source>Health Informatics Journal</source>
          <volume>13</volume>
          (
          <issue>3</issue>
          ),
          <volume>223</volume>
          {
          <fpage>235</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kogan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ash</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greenes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boxwala</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Characteristics of consumer terminology for health information retrieval</article-title>
          .
          <source>Journal of Methods of Information in Medicine</source>
          <volume>41</volume>
          (
          <issue>4</issue>
          ),
          <volume>289</volume>
          {
          <fpage>298</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Y.:
          <article-title>Searching for speci c health-related information in MedlinePlus: Behavioral patterns and user experience</article-title>
          .
          <source>JAIST</source>
          <volume>65</volume>
          (
          <issue>1</issue>
          ),
          <volume>53</volume>
          {
          <fpage>68</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palotti</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lupu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pecina</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mueller</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Budaher</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deacon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The IR Task at the CLEF eHealth evaluation lab 2016: user-centred health information retrieval</article-title>
          .
          <source>In: CLEF'16</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koopman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palotti</surname>
          </string-name>
          , J.:
          <article-title>Diagnose this if you can: On the e ectiveness of search engines in nding medical self-diagnosis information</article-title>
          .
          <source>In: Advances in Information Retrieval</source>
          , pp.
          <volume>562</volume>
          {
          <fpage>567</fpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>