KISTI at CLEF eHealth 2017 Patient-Centered Information Retrieval Task-1: Improving Medical Document Retrieval with Query Expansion

KISTI at CLEF eHealth 2017 Patient-Centered Information Retrieval Task-1: Improving Medical Document Retrieval with Query Expansion Heung-SeonOh Institute of Science and Technology Information Korea YuchulJung Institute of Science and Technology Information Korea KISTI at CLEF eHealth 2017 Patient-Centered Information Retrieval Task-1: Improving Medical Document Retrieval with Query Expansion C9F8FC6F271674685C8499A5A2B50E3B GROBID - A machine learning software for extracting information from scholarly documents language model feedback model query expansion

In this report, we describe our retrieval framework for participating in CLEF eHealth 2017 Patient-Centered Information Retrieval Task-1: Ad-hoc Search. Our retrieval framework is a query expansion approach which adopts relevance and pseudo relevance feedback to improve retrieval performance.

Introduction

This report summarizes our approaches to CLEF eHealth 2017 [2] Patient-Centered Information Retrieval Task-1, a standard ad-hoc task [7] . As same with 2016, this task utilizes a large web corpus (ClueWeb12 B13) and topics developed by mining health web forums where users were seeking advice about specific symptoms, diagnosis, conditions or treatments.

The main goal of the task is to improve the relevance assessment pool and the collection reusability. To meet the evaluation requirements of this year, we explicitly exclude documents that have been already assessed in 2016 from our search results. Meanwhile, to enhance the relevance of the searched, we utilize the already assessed documents in our proposed approaches by following the suggested guideline.

Based on the above considerations, we've designed a medical information retrieval framework which is characterized with relevance feedback for initial search and query expansion for re-ranking.

Method

2.1

Retrieval framework

Our proposed framework basically performs selective query expansion techniques in the initial retrieval and re-ranks the retrieval results based on the more accurate query expansion methods developed. Figure 1 shows the overview of our retrieval framework. First, we employ relevance feedback (RF) based on the relevance judgements built in last year since it is encouraged to improve retrieval performance and relevance assessment pool. For a query , a feedback model, , is constructed and combined to produce a new query model, . Second, an initial search is performed using and produces a set of documents, , from a collection . For the retrieved documents, we perform re-ranking with new queries built via two different query expansion methods.

Fig. 1. Overview of retrieval framework

As summarized above, our framework starts with the relevance feedback to improve retrieval performance and relevance assessment pool. Let is a set of documents relevant to a query . A relevance model, i.e. RM1 [4], is constructed with scored by KL-divergence method (KLD) [3,6,9]. There exists two differences compared to standard RM1 since it is built using the relevance judgements. First, all documents in are used to involve in a feedback model because they are explicitly relevant. Second, the relevance are employed as document priors. From the differences, it is expected that a query model includes all relevant information in . Finally, a new query is constructed via RM3 [1]. After that, the initial search is performed using KLD method on the entire collection and obtain a set of retrieved documents which are target for re-ranking. Before performing the re-ranking, two different query expansion techniques are considered based on . The first query expansion approach adopts random-walk based centrality scores [5] with a different transition matrix. This strategy is to estimate the query model by considering the associations of words in a query. The major difference is that an association between two words w and u where is computed using two corresponding word vectors rather than co-occurrences. The word vectors are an accurate representation obtained through GloVe [8], an unsupervised learning algorithm for obtaining vector representations for words, so call word embedding. The GloVe is known to outperform word2vec models on similarity tasks and named entity recognition tasks. The word vectors were computed on TREC CDS 2016 collection [8] which contains about 1.2M biomedical journal articles. We expect that the word vectors are more representative in medical domain than other domains. Then, centrality scores are computed using random-walk based on the transition matrix and regarded as a query model. Similar to RM3 above, a query model are generated by combining and the centrality scores. Finally, documents in are re-ranked according to with KLD method. The second query expansion approach follows cluster-based external expansion model (CBEEM) [6] which is an advanced version of using external collections in pseudo relevance feedback (PRF). The key idea of CBEEM is to estimate an accurate feedback model using not only the original collection but also other benchmark collections. Again, TREC CDS 2016 collection was employed as an external collection. As a result, re-ranking is performed with a new query with .

Experiments

Data

Two different collections are used for target and external collections, respectively. The target collection is ClueWeb12-Disk-B (ClueWeb12B) including about 52M web pages while the external collection is TREC CDS 2016 including about 1.2M biomedical journal articles. In both collections, text of pages were extracted by removing HMTL and XML tags using JSOUP1 parser. Avg. Doc. Len 850.9 4,511.9

Evaluation Settings

All mixtures for combining the query and feedback models are set to 0.5. Dirichlet prior is set to 2500. In relevance feedback (RF), the size of feedback words is set to 50 while the size of feedback documents corresponds to the number of relevant documents. In two query expansion approaches, they are fixed as 5 and 50, respectively. Word vectors are estimated using GloVe with ADAM optimizer where the vector size is 200.

Submitted Runs

We submitted three runs for this task. Run1, considered as our baseline, is the results of applying RF. Run2 and Run3 employed centrality scores and CBEEM, respectively. Table 2 summarized three runs.

Table 11shows the summary

Table 1 .1Data StatisticsClueWeb12BTREC CDS 2016#Docs52,051,8441,255,260Voc. Size20,139,4502,938,617Tokens44,291,018,2905,663,660,754

Table 2 .2Descriptions of our Submitted RunsRunDescription1Relevance feedback (RF)2RF + Random-walk based centrality scores3RF + Cluster-based external expansion model

https://jsoup.org/ http://sourceforge.net/p/lemur/galago/ci/default/tree/core/src/main/resources/stopwords/inquer y

UMass at TREC 2004: Novelty and HARD NAbdul-Jaleel Proceedings of Text REtrieval Conference (TREC) Text REtrieval Conference (TREC) 2004 CLEF 2017 eHealth Evaluation Lab Overview LGoeuriot CLEF 2017 -8th Conference and Labs of the Evaluation Forum Lecture Notes in Computer Science (LNCS Springer 2017 PageRank without hyperlinks: Structural re-ranking using links induced by language models OKurland LLee Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval -SIGIR '05 the 28th annual international ACM SIGIR conference on Research and development in information retrieval -SIGIR '05

New York, New York, USA

ACM Press 2006 Conceptual language models for domain-specific retrieval EMeij Inf. Process. Manag 46 4 2010 A Multiple-Stage Approach to Re-ranking Medical Documents H.-SOh Proceedings of CLEF CLEF 2015 Cluster-based query expansion using external collections in medical information retrieval H.-SOh YJung J. Biomed. Inform 58 2015 CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab JPalotti Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings 2017 Overview of the TREC 2016 Clinical Decision Support Track KRoberts Proceedings of The Twenty-Fifth Text REtrieval Conference The Twenty-Fifth Text REtrieval Conference

TREC

2016. 2016 Model-based feedback in the language modeling approach to information retrieval CZhai JLafferty Proceedings of the tenth international conference on Information and knowledge management the tenth international conference on Information and knowledge management

New York, New York, USA

ACM 2001