<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IIIT-H at CLEF eHealth 2017 Task 2: Technologically Assisted Reviews in Empirical Medicine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaspreet Singh</string-name>
          <email>jaspreet.singh@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lini Thomas</string-name>
          <email>lini.thomas@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DSAC, IIIT Hyderabad</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>KCIS, IIIT Hyderabad</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Observational evidence in clinical practice is critical in healthcare and policy making. Researchers spend a lot of time searching for relevant published articles to write a systematic review of a topic. In this paper, we present our participation as the team of IIIT Hyderabad at Task2 Technologically Assisted Reviews in Empirical Medicine as an e ort to automate this task and deliver relevant information in medical literature. We base our approach on query expansion according to relevance feedback. Query expansion is a standard technique in information retrieval tasks with growing use in medical literature [1, 2]. Articles returned from pubmed query performed during a systematic review are rst indexed using lucene's inverted index. The query is porcessed for term boosting, fuzzy search and used for scoring documents according to TF-IDF similarity. Relevance feedback is used to update the query and become more pragmatic.</p>
      </abstract>
      <kwd-group>
        <kwd>medical information retrieval</kwd>
        <kwd>relevance feedback</kwd>
        <kwd>query expansion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Diagnostic tests are critical to healthcare. Well designed reviews of results from
Diagnostic test accuracy(DTA) studies will help in decision making in medical
domain [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. But there are enormous amount of articles published every year.
Information retrieval in medicine has caught attention due to signi cant
implications of evidence-based medicine and rapidly expanding medical libraries.
Automatic screening of medical literature will help evolve retrieval techniques
applicable in other domains as well. CLEF eHealth Task2 [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] is an e ort
towards this purpose.
      </p>
      <p>
        We participate in Task 2: Technologically Assisted Reviews in Empirical
Medicine, evaluating information retrieval of medical documents. The task
focuses on ranking and thresholding methods for DTA reviews. We proposed a
system which is based on query expansion using fuzzy logic and relevance
feedback to get relevant documents. Relevance feedback is used earlier in various
information retrieval systems[6{8]. Fuzzy search make query exible and helps
improve recall. Relevance feedback helps reconstruct the query to deal with any
ambiguous information need [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Thus, we use both techniques in our system.
      </p>
      <p>Each query is initially converted into a fuzzy query. The documents
pertaining to each topic are indexed using lucene3. These indexed documents are
searched using query provided by Cochrane experts. The query is updated to
include more terms from relevant documents from an initial set of ranked
documents provided by lucene and remove terms from irrelevant ones. Since the
initial ranking of a few documents gives high average precision, the idea is to
let unique terms be picked from them to better represent a query. The updated
query is used to rank remaining documents.
2</p>
      <p>Methodology
In this section we explain our methodology in detail. For simple evaluation runs,
we try to optimize recall by ranking approximately half of the documents.
However, for cost e ective measures, we stop when we don't nd any query updates
or average precision in the last set of ranked results falls below a threshold (0.1
in most cases). A summary of the runs submitted to the task is shown in Table
1 .</p>
    </sec>
    <sec id="sec-2">
      <title>2.1 Indexing</title>
      <p>We let lucene index each topic's documents. Lucene breaks each document into
words to create an inverted index. This index consists of terms with set of
documents that contain it. Later, it is utilized for e cient search. To reduce noise
and false positives, we remove stop words from the documents at the time of
indexing. Lucene separates document information into elds. We create elds
for title, abstract etc. from pubmed documents, as the queries speci es terms
along with elds to search them from.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Query Reformulation</title>
      <p>The query provided by the Cochrane experts vary in length and have complex
boolean logic. We use a fuzzy search system to expand it. The system allows
terms close to the base term to be included in the expanded query. For
example, search terms like "dysplasia" also include terms like "dysplastic" and
"dysplasias". Although the OVID medline search syntax includes some amount
of regex present in the query, we make every term go through fuzzy search system
before adding it to expanded query.
3 https://lucene.apache.org/core/
After building the document index and query reformulation, we make use of
TF-IDF scoring model. Vector space models lets reweigh search terms quickly
and uses cosine for calculating similarity between document and query. Four
similarity measures are incorporated - tf, idf, coord and length Norm. Where
coord is number of terms in the query that were found in the document and
length Norm is measure of the importance of a term according to the total
number of terms in the eld.</p>
      <p>Initially, we request a small set of ranked and scored documents from lucene.
This initial set is inspected for relevance. We found from our experiments on the
training data that about half of this set is relevant. Let (rd) be the set of relevant
document and (nrd) be the set of not relevant documents in the initial ranking.
The search query is appended with boolean OR with top occuring terms from
rd and boolean NOT from top occuring terms from nrd given that they don't
already occur in the query. To prevent overpopulating terms in the query and
drifting away from desired result, we restrict the count of new terms ve percent
of average article size. Once updated, the new query is used to rank remaining
documents.</p>
      <p>We boost a term for scoring if it occurs in rd for multiple iterations. These
terms get n times as much weight of any other term if they occur again in the nth
iteration. Incorporating this, we found that though we are providing a binary
relevance feedback, our system has the advantages of a graded feedback. Relevance
feedback system is applied to queries containing more than 1500 documents.
Apart from the submitted runs, we found that this technique was e ective on
queries having less documents.
2.4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>We submitted eight runs for this task. Four of which are for simple evaluation
and four for cost-based evaluation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Zhenyu</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Wesley W.</given-names>
            <surname>Chu</surname>
          </string-name>
          .
          <article-title>Knowledge-based query expansion to support scenario-speci c retrieval of medical free text</article-title>
          .
          <source>Information Retrieval</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <volume>173</volume>
          {
          <fpage>202</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>M.C.</surname>
          </string-name>
          <article-title>D az-</article-title>
          <string-name>
            <surname>Galiano</surname>
          </string-name>
          , M.T
          <string-name>
            <surname>Mart</surname>
          </string-name>
          n
          <article-title>-</article-title>
          <string-name>
            <surname>Valdivia</surname>
            , and
            <given-names>L.A.</given-names>
          </string-name>
          <string-name>
            <surname>Uren</surname>
          </string-name>
          <article-title>~a-Lopez. Query expansion with a medical ontology to improve a multimodal information retrieval system</article-title>
          .
          <source>Computers in Biology and Medicine</source>
          ,
          <volume>39</volume>
          (
          <issue>4</issue>
          ):
          <volume>396</volume>
          {
          <fpage>403</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Gah</given-names>
            <surname>Juan</surname>
          </string-name>
          <string-name>
            <surname>Ho</surname>
          </string-name>
          , Su May Liew, Chirk Jenn Ng, Ranita Hisham Shunmugam, and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Glasziou</surname>
          </string-name>
          .
          <article-title>Development of a search strategy for an evidence based retrieval service</article-title>
          .
          <source>PLOS ONE</source>
          ,
          <volume>11</volume>
          (
          <issue>12</issue>
          ):
          <volume>1</volume>
          {
          <fpage>14</fpage>
          , 12
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Hanna</given-names>
            <surname>Suominen</surname>
          </string-name>
          , Liadh Kelly, Lorraine Goeuriot, Evangelos Kanoulas, Rene Spijker, Aurelie Neveol, Guido Zuccon, and Jo~
          <string-name>
            <surname>ao R. M.</surname>
          </string-name>
          <article-title>Palotti. Overview of the CLEF ehealth evaluation lab 2017</article-title>
          .
          <article-title>In Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 8th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2017</year>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          , Proceedings, Lecture Notes in Computer Science. Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Evangelos</given-names>
            <surname>Kanoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Dan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Leif</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Rene</given-names>
            <surname>Spijker</surname>
          </string-name>
          .
          <article-title>Overview of the CLEF technologically assisted reviews in empirical medicine</article-title>
          .
          <source>In Working Notes of CLEF</source>
          <year>2017</year>
          <article-title>- Conference and Labs of the Evaluation forum</article-title>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          ., CEUR Workshop Proceedings. CEUR-WS.org,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Pragati</given-names>
            <surname>Bhatnagar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Narendra</given-names>
            <surname>Pareek</surname>
          </string-name>
          .
          <article-title>Improving pseudo relevance feedback based query expansion using genetic fuzzy approach and semantic similarity notion</article-title>
          .
          <source>Journal of Information Science</source>
          ,
          <volume>40</volume>
          (
          <issue>4</issue>
          ):
          <volume>523</volume>
          {
          <fpage>537</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jagendra</given-names>
            <surname>Singh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Aditi</given-names>
            <surname>Sharan</surname>
          </string-name>
          .
          <article-title>Relevance feedback based query expansion model using borda count and semantic similarity approach</article-title>
          .
          <source>Intell. Neuroscience</source>
          ,
          <year>2015</year>
          :
          <volume>96</volume>
          :
          <fpage>96</fpage>
          {
          <fpage>96</fpage>
          :96, jan
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Alexandru</surname>
          </string-name>
          <string-name>
            <surname>Chirita</surname>
          </string-name>
          , Claudiu S. Firan, and
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Nejdl</surname>
          </string-name>
          .
          <article-title>Personalized query expansion for the web</article-title>
          .
          <source>In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '07</source>
          , pages
          <fpage>7</fpage>
          {
          <fpage>14</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Ruthven</surname>
          </string-name>
          , Ian, Lalmas, and
          <string-name>
            <surname>Mounia</surname>
          </string-name>
          .
          <article-title>A survey on the use of relevance feedback for information access systems</article-title>
          .
          <source>Knowl. Eng. Rev.</source>
          ,
          <volume>18</volume>
          (
          <issue>2</issue>
          ):
          <volume>95</volume>
          {145, jun
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>