-

CSKU GPRF-QE for Medical Topic Web Retrieval

Ornuma Thesprasith

ornuma.thesprasith@gmail.com 0

Chuleerat Jaruskulchai

0 0 Department of Computer Science, Faculty of Science, Kasetsart University , Bangkok , Thailand

260 268

Patients and their relatives have more chances to access their healthinformation in a form of discharge summary. Most of them do not totally understand contents in the discharge summary. The ShARe/CLEF eHealth Evaluation Lab organized a shared task for improving retrieval medical information from the web. Queries of this task are formulated based on information in discharge summaries. This paper investigates efficiency of query expansion using external collection. Co-occur terms in pseudo-relevance feedback of Genomics collection are selected and re-weighted based on Rocchio's formula with dynamic tunable parameters of pseudo-relevance part. LUCENE, vector space model, is baseline retrieval tool. The proposed expansion method improves from baseline in all level cut of nDCG and best perform in P@10 of 3 topics. Using biomedical related collection such as Genomics is useful for medical topics retrieval.

Genomics Track 2004 pseudo-relevance feedback re-weighting scheme medical terminology retrieval

Most patients or their relatives may be questionably when reading their discharge summary because medical terminology is very specific domain and is un-easy to understand by laypeople. The ShARe/CLEF eHealth evaluation lab is established to help these users more comprehend the health information [1]. Especially in the Task 3: User-centred health information retrieval [2] focuses on web collection. Since search engines are usually used to retrieve more explanation about the medical-specific subject. The expected results about health information should be understood by general users and come from reliable resources. This means that the relevant web pages contents are consisted of the medical terminology along with general terms or common words that explain the medical term in more detail.

Query expansion techniques are widely used to improve retrieval performance. There are several factors effect to expansion results; source of expansion, term selection, and re-weighting method. Source to expand query may come from many sources such as local collection, external-standard collection, and ontology. External collections such as English Wikipedia and TREC (disk 1-5) are used for expansion as reported in [3]. Reliable and most often used biomedical ontologies are UMLS Metathesaurus [4], MeSH ontology [5], and SNOMED-CT [6].

Research work [7] proposed method for selection the most effective expansion source based on query performance prediction technique. The objective of this technique is to estimate performance of retrieval system without relevant judgment [8]. This technique is either analysis collection without retrieval or focus returned results [9]. However, query performance prediction can estimate degree of relation between difference collections also. We follow this idea to select source for expansion such as med [3] , OHSUMED [10] and Genomics collection [11] .

Expansion terms from an external collection should be similar to indexing terms of the local collection. In our previous work [12] used internal MeSH (Medical Subject Headings) terms of local collection (OHSUMED) for expansion based on pseudorelevance feedback (PRF) method. Since users need information to describe disease and treatment in MEDLINE collection, expansion query with medical vocabulary may be beneficial method. On the other hand, the ShARe/CLEF Task 3a [2] queries are specific medical terms in discharge summaries whereas collection contains health information web pages for laypeople. We believe that there is a gap between specific medical terms in user’s queries and general words used in relevant web pages. To expand medical terms in query, we select terms in title and abstract part instead of medical controlled vocabulary part (MeSH terms) for expansion. We expect that these candidate terms derived from this method should appear more in relevant web pages and effect to boost up retrieval scores.

Research works [13] expanded query based on pseudo-relevance feedback (PRF) method and adapted Rocchio’s formula for re-weighting terms. We adapt PRF method in different way by using results of external PRF instead of local PRF as used in traditional PRF paradigm. We adapt re-weighting formula for appropriately expansion. The details of our expansion method and results are described in the next sections. 2 2.1

Method Vector Space Model Method and Tool

The collection is represented by matrix of terms-documents and each raw is a representation of document and is consisted of weighted term values. The query is represented by vector of weighted term values as the document vector. The similarity between query and each document vector is used to rank the returned results. The classical vector similarity measure is cosine similarity defined as following. ( ⃑ , ⃑ )=

⃑ ∙⃑ ‖ ⃑ ‖∙‖⃑ ‖

Lucene is vector space model retrieval tool [14]. This tool is implementing cosine similarity measure in sophisticate way. The Lucene similarity measure is define as following. ( , ) = ∑ (

( ) × ℎ ( ) × ( , ) × ( 2) ×

( ( ) ) × (1) (2)

Lucene allows user to boost some query terms to have specific weight via “^” (the caret mark), for example, “hepatic^3.0 encephalopathy^4.6 liver”. These boosting query weight will be used in Coord(q,d) function and normalized by QueryNorm(q) function. 2.2

Genomics Pseudo-Relevance Feedback Expansion Method Indexing Process.

This current work, the documents are web pages collected from many medicalrelated resources [2] . Queries of this collection are formulated by using medical terminologies in discharge summaries. We use 5 train queries to determine indexing method; a) all document (web pages with raw data), b) non-html tags documents (some pages are missing), and c) non-html tags documents compensate missing pages with original pages. In our preliminarily experiment, we evaluate MAP performance using train relevant judgments. The results are mixed and inconsistency. Therefore we select compensate indexing method to avoid losing under-estimate webpages.

Expansion Source Selection.

Original purpose of query performance prediction (QPP) is to estimate retrieval system without relevant judgment [8]. This technique is either analysis collection without retrieval or focus returned results [9]. Research work [7] used QPP to select appropriate expansion sources by comparing average term frequency of query with local and external collection. Our work uses simplest method by comparing number of documents returned from retrieval in three TREC standard sub-collections such as med [3], OHSUMED [10], and Genomics 2004 [11]. Results of 5 train queries from Genomic collection are larger than OHSUMED and med. In this current work, we believe that more documents returned provide more useful expansion terms. Even the Genomic collection based on genomics information, we expect that biology terms in genomics-based documents have relationship with medical terminologies.

Term Selection.

We hypothesize that relevant documents should contain more general terms that easy to understand by laypeople. Therefore expansion terms could be binding specific terms in queries and general terms in web pages. We select terms co-occur more often in Genomics-PRF set for expansion. Procedures for term selection describe as following steps. First, we retrieve in Genomics collection (uses title and abstract for indexing process). Second, top-k documents that contain any query terms are included in Genomic-PRF set. Third, terms in title and abstract part of this set are selected based on term frequency as candidate set.

Since candidate terms derive only from Genomics collection (Genomic-PRF set), these terms can be redundant with query terms or new added terms. Each candidate terms should be assigned with different weight based on its appearance.

Re-weighting Method.

Rocchio’s formula is widely used for PRF-based query expansion re-weighting schemes. The formula composes of three part; original weight, relevance-based weight, and non-relevance-based weight.

W’Q = α WQ + β/ |DR| × (∑ dr) + γ/|DN| × (∑ dn) (3) where α is tunable weight of initial query,

WQ is weight of term in initial query, β is tunable weight of relevant documents (dr), |DR| is number of relevant documents, γ is tunable weight of non-relevant documents (dn), |DN| is number of non-relevant documents.

The traditional pseudo-relevance re-weighting formula replaces the relevant part with pseudo-relevance part and ignores non-relevant part by setting γ to 0. It defined as follow,

W’Q = α WQ + β × WPRF where WPRF is weight of term in pseudo-relevance documents.

Our method divides original query into two parts according to appearance in candidate set, non-candidate terms (WNQ) and candidate term (WCQ). These two parts of query terms have corresponding tunable parameters are 1and 2 respectively. Our pseudo-relevant part uses top k documents that returned from Genomics collection and define as Genomics-PRF terms (WGPRF). We are not using pseudo-relevant feedback from local collection. The re-weighting formula is defined as follow, ′ = 1 + 2 + where is weight of initial query term that not appear in PRF set, 1 is tunable parameter for initial query term that not appear in PRF set, is weight of initial query term that appear in PRF set, 2 is tunable parameter for initial query term that appear in PRF set, WGPRF is weight of Genomics expansion term in PRF set, is dynamic tunable parameter for new expansion term in PRF set.

We set more value to original query terms that are not appear in external-based expansion source to prevent “query drifting”. We use external resource for finding new terms for increase recall. If these terms are original query terms we set the offset value of the WCQ less than the WNQ and use frequency for boosting up from the offset. This approach reduces effect of over-weighting terms.

Query log is useful information for relevant judgment [15]. We assume that results from train queries act as query log. We use train queries and their relevant judgments to set tunable parameter values. Term frequency in Genomics-PRF set is used to set these parameters. We derive optimized values of each weight is 1.0 and tunable pa(4) (5) rameter values ( 1, 2 , ) are 3.0, (2.0 + 2( )) and (0.5+ 2( )) , respectively. These setting are done quite well in training set in heuristic manner. We expect that these values will work well on test query set also. 3 3.1

Experiments Experimental Setup

The collection contains 8 parts of .zip files [2]. The html content of a web page is within “#UID” and “#EOR” tag. The html pages total is 875,486 files. Jsoup [16] is html parser tool used for extract major content from web page such as title, description, keywords, header, bold, and strong text. From this content extraction process, some files are very small (size less than 200 bytes). These are qualified files that contain 460,279 files. In our preliminary experimental, we indexed collection three types: a) raw html (whole collection), b) only major content that without html tags (460,279 files) and c) compensate missing major content file with raw html (whole collection).

Lucene version 4 is indexing and retrieval tool. We use Lucene’s Standard Analyzer for indexing three collection types [14]. We retrieved 5 training topics and evaluated with train relevant judgment. Since MAP results are mixed, we avoid losing the under-estimate web pages by indexing with compensate method (type c).

Our research work focuses on finding a suitable source for query expansion. In preliminary, we compare results returned from retrieval 5 train queries. The preliminary results demonstrated that Genomics 2004 collection returns maximum number of documents in all train queries. This collection contains more biomedical terms and gene information thus we believe that returned documents are likely to have more related medical terms.

Since we assume that keywords or information need of users are similar to keywords used in the train queries. This paper investigates efficiency of using Genomics collection to expand medical topics queries. With the preliminary experiment, we retrieve 5 train queries and vary number documents (top k) in Genomic-PRF set and number of expansion terms (top m) according to equation (5). By considering MAP results from our variations, we found that the optimized values for top-k and top-m are 19 documents and 8 terms, respectively. We expect that the test queries are not different from the train queries that we used to setting these parameter. In expansion process, candidate terms are terms that co-occur in the same document of query terms in pseudo-relevance feedback (PRF) set. 3.2

Remarks before Discussion

Since our official baseline is missing result of topic no.50 because of program error. This error result to the evaluation of baseline is lower than usual. Therefore we re-examine the correction baseline (with returned result of topic no. 50) and reevaluate the retrieval performance. The MAP values for correction baseline run and expansion run are 0.1820 and 0.2076, respectively. 3.3

Results and Discussion

The results from all runs are shown in this section. We demonstrate nDCG comparison as detail in Table 1. All nDCG cut level of expansions are higher than two baselines (both official and correction version). This means terms in Genomics documents occur in relevant web pages. Re-weighting these terms are effect to result ranking. Detail of other metrics in trec evaluation of our runs shown in Table 2.

Precision at 10 (P@10), our baseline-run above median 10 topics whereas expansion-run above median 14 topics. Our expansion proposed is best performance in 3 topics (4, 9, and 17) of 50 topics. Fortunately, terms in pseudo-relevance feedback of these topics more relate to main keyword such as “anoxic” vs. “anoxia”, “pneumonia” vs. “lung”, and “duodenal” vs. “gastric”. These expansion terms are very helpful.

The expansion results improve from official baseline 8 topics whereas official baseline outperforms expansion 4 topics. As shown in the following figures.

Fig. 1. nDCG baseline runs compare with expansion run

Medical terms in discharge summary are difficult for laypeople because these terms are very specific domain terminology. Retrieval by using queries constructed from discharge summary will be returned too specific web pages and users still need more explanation and information about the subject.

We believe that relevant web page contain both medical terminology and general terms. We use query expansion technique to explore useful terms and increase possibility of retrieval more relevant documents. Our query expansion approach is based on pseudo-relevance feedback using external biological (genomics literature) collection. We use train queries and train relevant judgments to set the optimized parameters for our proposed expansion method.

The importance issues for query expansion are source of terms, type of term for expansion, and re-weighting scheme. We determine expansion source based on query performance prediction technique. We estimate usefulness of external collection base on size of returned set. Since biomedical references in Genomics collection has disease and related-gene information. Terms in these references are selected and reweighted based on frequency in PRF set. Although we use only statistical information in pseudo-relevance feedback set, this proposed method shows MAP improvement from baseline.

In future work, we will keep going on more sophisticated criteria to select external collection to expand query and experiment on various external collections. 14. 15. 16.

L. , Velupillai , S. , Chapman , W. W. , Martinez , D. , Zuccon , G. , and Palotti , J.: Overview of the ShARe/CLEF eHealth Evaluation Lab 2014 . Springer ( 2014 ) Goeuriot, L. , Kelly , L. , Li , W. , Palotti , J. , Pecina , P. , Zuccon , G. , Hanbury , A. , Jones , G. , and Mueller , H.: ShARe/CLEF eHealth Evaluation Lab 2014 , Task 3: User-centred health information retrieval , In CLEF 2014 . (2014) Voorhees , E. M. , and Harman , D. : Overview of the Fifth Text REtrieval Conference (TREC-5) , In TREC. ( 1996 ) Unified Medical Language Systems , http://www.nlm.nih.gov/research/umls The Basics of Medical Subject Headings (MeSH®), http://www.nlm.nih.gov/bsd/disted/mesh/ SNOMED Clinical Terms® (SNOMED CT® ), http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html He , B. , and Ounis , I.: Combining fields for query expansion and adaptive query expansion . 43 , 1294 - 1307 ( 2007 ) Kurland , O. , Raiber , F. , and Shtok , A. : Query-performance prediction and cluster ranking: Two sides of the same coin , In Proceedings of the 21st ACM international conference on Information and knowledge management . pp.

2459- 2462 . ACM ( 2012 ) Cummins , R. ,Jose, J., and O'riordan, C.: Improved query performance prediction using standard deviation . In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval . pp. 1089 - 1090 . ACM ( 2011 ) Hersh, W. , Buckley , C. , Leone , T. J. , and Hickam , D. : OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research , in SIGIR '94,

Croft and

C.J.

Rijsbergen .(eds). p. 192 - 201 .

Springer London ( 1994 ) William R. Hersh , R. T. B., Laura

Ross

, Phoebe Johnson , Aaron M. Cohen ,

Dale F.

Kraemer . TREC 2004 genomics track overview In The 13th Text REtrieval Conference . ( 2004 ) Thesprasith , O. , and Jaruskulchai , C. : Query Expansion Using Medical Subject Headings Terms in the Biomedical Documents , in Intelligent Information and Database Systems . p. 93 - 102 . Springer ( 2014 ) Abdou, S. , and Savoy , J.: Searching in Medline: Query expansion and manual indexing evaluation . 44 , 781 - 789 ( 2008 )

Apache

Lucene - Apache Lucene Core, http://lucene.apache.org/core/ Cui, H., Wen , J.-R. , Nie , J.-Y., and Ma, W.-Y.: Query expansion by mining user logs . 15 , 829 - 839 ( 2003 ) jsoup: Java HTML Parser , http://jsoup.org/