=Paper=
{{Paper
|id=Vol-1391/17-CR
|storemode=property
|title=KISTI at CLEF eHealth 2015 Task 2
|pdfUrl=https://ceur-ws.org/Vol-1391/17-CR.pdf
|volume=Vol-1391
|dblpUrl=https://dblp.org/rec/conf/clef/OhJK15a
}}
==KISTI at CLEF eHealth 2015 Task 2==
KISTI at CLEF eHealth 2015 Task 2 Heung-Seon Oh, Yuchul Jung, and Kwang-Young Kim Korea Institute of Science and Technology Information {ohs, jyc77, glorykim}@kisti.re.kr Abstract. Laypeople (e.g., patients and their caregivers) usually use queries which de- scribe a sign, symptom or condition to obtain relevant medical information on the Web. They can fail to find useful information for diagnosing or understanding their health conditions because the search results delivered by existing medical search engines do not fit the information needs of users. To deliver useful medical information, we at- tempted to combine multiple ranking methods, explicit semantic analysis (ESA), a clus- ter-based external expansion model (CBEEM), and concept-based document centrality (CBDC), using external medical resources to improve retrieval performance. As a first step, initial documents are searched using a baseline method. Based on the initial docu- ments, ranking methods are selectively applied. Our experiments with combinations of ranking methods aim to find the best means of computing accurate similarity scores using different external medical resources. The best performance was obtained when the CBEEM and the CBDC were used together. Keywords: medical information retrieval, external expansion model, concept-based re- trieval 1 Introduction The general public searches the Web to acquire medical information to diagnose their symptoms and find related health information. Unfortunately, searchers such as laypeo- ple without medical knowledge can fail to find the necessary information in a search query because they are often not only unfamiliar with medical terminology but also uncertain about their exact questions. Tackling queries for laypeople has been a chal- lenging issue with regard to medical information retrieval (IR) because existing Web search engines often fail to deliver satisfactory search results because the required in- formation is not properly understood. To mitigate the difficulties of laypeople (e.g., patients and their relatives), Conference and Labs of the Evaluation Forum (CLEF) launched the eHealth Evaluation Lab [4]. Specifically, Task 2 of CLEF 2015 eHealth [10] explores circumlocutory queries consisting of the signs and symptoms of a medical condition. As a participant in task 2, this paper introduces a re-ranking framework which at- tempts to combine selectively different ranking components, such as explicit semantic analysis (ESA), a cluster-based external expansion model (CBEEM), and concept- based document centrality (CBDC). The main goal of our framework is an accurate estimation of the similarity score by combining different ranking methods using exter- nal medical resources. Within our re-ranking framework, a query-likelihood method with Dirichlet smooth- ing as a baseline was utilized to obtain the initial document set. π·ππππ‘ is re-ranked with the help of ranking components using external medical resources, two biomedical col- lections (i.e., TREC CDS [11] and OHSUMED [5]) and ICD-10 1extracted from Wik- ipedia. In our experiments, we designed eight runs which combine more than one re- ranking components, except run 1, which represents the baseline. Among the eight runs, the best performance was observed in runs 6 and 8, when the CBEEM and the CBDC were combined. The best performances, in runs 6 and 8, were 0.3864 (P@10) and 0.3464 (NDCG@10). The rest of this paper is organized as follows. Section 2 presents our ranking frame- work in detail. The experimental results are described in Section 3. Section 4 concludes with a short summary. 2 Method 2.1 Re-ranking framework The key idea of our method is to devise a re-ranking framework which estimates an accurate similarity score between a query and a document using external medical re- sources. To do this, we build a pool of re-ranking components with external resources. Figure 1 shows an overview of our re-ranking framework. For a given query Q, a set of documents, π·ππππ‘ = {π·1 , π·2 , β¦ , π·π }, is retrieved from collection C using a search en- gine. In this paper, a query-likelihood method with Dirichlet smoothing (QLD) [14] is utilized to obtain π·ππππ‘ . Then, we focus on re-ranking π·ππππ‘ using external resources to improve the performance. Specifically, two biomedical collections, TREC CDS and OHSUMED, and ICD-10 as extracted from Wikipedia were used as external resources. Based on π·ππππ‘ , re-ranking is performed through a series of ranking components in the pool. Fig. 1. Overview of the re-ranking framework 1 http://apps.who.int/classifications/icd10/browse/2015/en 2.2 Basic Foundation Before explaining the details of the three different re-ranking components, we introduce the basic foundation of the language modeling framework for IR to provide a deeper explanation. In language modeling for IR, the KL-divergence method (KLD) is a pop- ular scoring function to compute similarity scores by estimating unigram language models for a query Q and a document D [6, 7, 9]: π πππππΎπΏπ· (π, π·) = exp (βπΎπΏ(ππ ||ππ· )) π(π€|ππ ) (1) = exp (β β π(π€|ππ ) πππ ) π(π€|ππ· ) π€ where ππ and ππ· are the query and document unigram language models, respec- tively. KLD has been attractive because effective pseudo-relevance feedback methods have been proposed to estimate more accurate query language models in an effort to improve performance. The research questions are how to estimate accurate query and document language models to improve the retrieval performance. In general, a query model is estimated by maximum likelihood estimation MLE), as shown below: π(π€, π) π(π€|ππ ) = (2) |π| where π(π€, π) is the count of a word w in query Q and |π| is the number of words in Q. A document model is estimated using Dirichlet smoothing to avoid zero probabilities and to improve the retrieval performance through an accurate estimation [14]: π(π€, π·) + π β π(π€|πΆ) π(π€|ππ· ) = (3) βπ‘ π(π‘, π·) + π where π(π€, π·) is the count of a word w in document D, π(π€|πΆ) is the probability of a word w in collection C, and π is the Dirichlet prior parameter. Query expansion aims to reveal information needs not expressed in Q by adding more useful words. Pseudo-relevance feedback (PRF) is a popular query expansion ap- proach to update a query. Updating a query with PRF assumes that the top-ranked doc- uments πΉ = {π·1 , π·2 , β¦ , π·|πΉ| } in the initial search results relevant to a given query and the words in F are useful to modify a query for a better representation. A relevance model (RM) serves to estimate a multinomial distribution π(π€|π), which is the likeli- hood of a word w in query Q. The first version of the relevance model (RM1) is defined as follows: ππ π1 (π€|π) = β π(π€|ππ· )π(ππ· |π) π·βπΉ π(π|ππ· )π(ππ· ) = β π(π€|ππ· ) (4) π(π) π·βπΉ β β π(π€|ππ· )π(ππ· )π(π|ππ· ) π·βπΉ RM1 is composed of three components: the document prior π(ππ· ), the document weight π(π|ππ· ), and the term weight in a document π(π€|ππ· ). In general, π(ππ· ) is as- sumed to have a uniform distribution without knowledge of document D. π(π|ππ· ) = βπ€βπ π(π€|ππ· )π(π€,π) indicates the query-likelihood score. π(π€|ππ· ) can be estimated using various smoothing methods, such as Dirichlet-smoothing. Various strategies are applicable to estimate these components. To improve the retrieval performance, a new query model can be estimated by comb- ing the relevance model and the original query model. RM3 [1] is a variant of a rele- vance model which is used here to estimate a new query model with RM1, π(π€|ππβ² ) = (1 β π½) β π(π€|ππ ) + π½ β ππ π1 (π€|π), (5) where π½ is a control parameter between the original query model and the feedback model. 2.3 Re-ranking Components Component 1 - Explicit Semantic Analysis: Concept-based IR using an explicit se- mantic analysis (ESA) [3] is a well-known approach used to deal with a vocabulary mismatch problem between a query and a document, where the words in the query and document are mapped to concepts. In medical IR, methods [2, 12] employ MetaMap to map words to concepts in the Unified Medical Language System (UMLS). Processing millions of documents in a collection using MetaMap involves a considerable amount of time complexity. To avoid this difficulty, concepts relevant to International Classi- fication Diseases (ICD-10) were used as a concept resource because they are closely related to diseases. These concepts were collected from Wikipedia. Articles linked to the name of the section and the sub-section of ICD-10 were crawled. As a result, 3,784 articles with 93,756 unique words were obtained. The title of an article was used as a medical concept. Figure 2 shows an example of the medical concept Bubonic plague 2 in Wikipedia. Based on the concepts, a word-concept matrix filled with standard TF- IDF values was constructed. Then, a similarity score between a query and a document is computed after concept mapping, as shown in Figure 3. Cosine similarity was utilized as a scoring function. 2 http://en.wikipedia.org/wiki/Bubonic_plague Fig. 2. An example of the Wikipedia article of the medical concept bubonic plague Fig. 3. Similarity computation using concept mapping Component 2 - Cluster-based External Expansion Model: There are several medi- cal collections, TREC CDS and OHSUMED, available to researchers, as medical col- lections have been developed for different purposes. For re-ranking purposes, these collections can be used as textual resources to build more robust external expansion models [13]. To this end, we revised an existing external expansion model (EEM) by combining it with a cluster-based document model [8]. The key idea of the EEM is to generate a feedback model by determining the proper contributions of multiple collec- tions for a given query. Formally, the EEM is defined as follows: ππΈπΈπ (π€|π) β β π(π|ππΆ ) β π(ππΆ ) β π(π€|ππ· ) β π(π|ππ· ) β π(π·|ππΆ ). (6) πΆβπΈ π·βπΆ Specifically, the EEM consists of five components: the prior collection probability, document relevance, collection relevance, document importance, and word probability. Prior collection probability π(ππΆ ) is the prior importance of a collection among all the collections in use. Without the prior knowledge of collections, it can be ignored by 1 setting a uniform probability π(ππΆ ) = |πΈ| . Document relevance π(π|ππ· ) is the rele- vance of a document D to a given query Q. Precisely, it is a query-likelihood score given to a document. Collection relevance π(π|ππΆ ) is the relevance of a query Q with respect to a collection C. This component determines the query-dependent contribution of a collection when constructing the EEM. To avoid time-consuming iteration over a collection C, it can be estimated using the most highly relevant documents with the assumption that documents are equally important in a given collection C. Thus, it is the average score of the feedback documents in π·ππππ‘ . Document importance π(π·|ππΆ ) re- fers to the importance of a document D in a collection C. This is also ignored by setting 1 to a uniform probability π(π·|ππΆ ) = |πΆ| without the prior knowledge of documents in a collection C. Word probability π(π€|ππ· ) is a probability of observing a word w in a document D. In [13], the MLE is utilized to estimate this component. In the cluster-based document model [8], a document model is smoothed with cluster and collection models in which the clusters are generated with the K-means algorithm. Therefore, we can obtain more accurate document models because the probabilities of words which occur frequently in a cluster or a collection are decreased. Similarly, we can assume that each collection corresponds to a cluster explicitly partitioned over E. This assumption allows the use of the cluster-based document model without any addi- tional computations with K-means clustering, as K is determined via |πΈ|, and each col- lection is a cluster. All that is required is to utilize the statistics of a collection C for a cluster. Then, a document model is defined as follows: π(π€, π·) + π β π(π€|πΆ) π(π€|ππ· ) = (1 β ππΈ ) β + ππΈ β π(π€|πΈ) |π·| + π |π·| π (7) = (1 β ππΈ ) β [ π(π€|π·) + π(π€|πΆ)] + ππΈ β π(π€|πΈ), |π·| + π |π·| + π where Ξ»E is a control parameter for all collections in E. Our CBEEM is defined by revising π(π€|ππ· ) in Equation 6 and replacing it with that of Equation 7. Based on this revision, the CBEEM is expected to be a probability dis- tribution over topical words because it is combined with individual RMs owing to the decrease in the probability of common words in the feedback documents. Then, a new query model is estimated with the CBEEM as follows: π(π€|ππβ² ) = (1 β π½) β π(π€|ππ ) + π½ β ππΆπ΅πΈπΈπ (π€|π) (8) Component 3 - Concept-based Document Centrality: To utilize external resources, we designed a concept-based document centrality method (CBDC) as an additional re- ranking component. The key idea originated from centrality-based document scoring, which utilizes the associations among documents in the search results [6]. The central- ities are computed through two steps - similarity matrix construction and a random- walk step. Among the initial documents, implicit links are generated because there are no explicit links among them. Then, the documents are re-ranked by combining the initial and centrality scores, as follows: π πππe(π, π·) = π ππππππΏπ· (π, π·) β π πππππ·πΆ (π, π·) (9) However, the CBDC differs from previous approaches [6] in two aspects. First, we attempted to capture the associations among a query and documents explicitly when computing document centralities, while the previous method only considered the asso- ciations among documents. Second, the CBDC captures the associations at the concept level while the previous method focused on the word level. The CBDC is estimated as follows. First, the document-concept weight matrix is constructed by concept mapping. In this matrix, the query is augmented at the ends of the rows. Then, a document-doc- ument similarity matrix is computed using the document-concept weight matrix. Due to the need to augment the query, the CBDC considers the associations of documents with respect to a query. Next, a random walk was performed to compute centrality scores. We only utilized the centrality scores of documents. Fig. 4. Computation of concept-based document centralities 3 Experiments 3.1 Data We utilized three medical external resources, TREC CDS, OHSUMED, and ICD- 10, which were extracted from Wikipedia. Tables 1 show a summary of the TREC CDS and OHSUMED collections. TREC CDS consists of biomedical literature, spe- cifically a subset of PubMed Central. A document is a full-text XML of a journal article. OHSUMED consists of biomedical literature which is a subset of the clini- cally oriented MEDLINE. Clearly, πΈ = {πΆππ»ππππ‘β , πΆπΆπ·π , πΆππ»ππππΈπ· } for the CBEEM. Table 1. Data Statistics (The lengths are counted after stop-word removal.) CLEF eHealth TREC CDS OHSUMED #Docs 1,102,289 732,451 348,566 Voc. Size 2,647,062 6,931,356 122,512 Avg. Doc. Len 540.0 1779.0 68.0 3.2 Evaluation Settings Lucene3 was exploited to index and search the initial documents π·ππππ‘ . Stop-words were removed using 419 stop-words4 in INQUERY. In addition, numbers were normalized to NU<# of DIGITS>. A query-likelihood method with Dirichlet smoothing was chosen 3 http://lucene.apache.org/ 4 http://sourceforge.net/p/lemur/galago/ci/default/tree/core/src/main/resources/stopwords/in- query as a scoring function. |π·ππππ‘ | was set to 1000. Based on π·ππππ‘ , we performed eight runs by differentiating the combining components of our re-ranking framework. Table 2 shows the descriptions and Tables 3 and 4 summarize the performances of the submit- ted runs. The performances were measured by P@10, NDCG@10, rank-biased preci- sion (RBP), and two different variants of RBP (i.e., uRBP, and uRBPgr). In contrast to the evaluation settings used in previous years, the readability of the retrieved medical content, along with the common topical assessments of relevance, is added as new eval- uation measure [15]. 3.3 Results Table 2 describes our submitted runs for CLEF 2015 eHealth Task 2 and Table 3 sum- marizes our results obtained from the taskβs official standard evaluation set. Runs 7 and 8 are different from runs 5 and 6, as the experiments were performed with expanded queries produced from the CBEEM for ESA and CBDC, while runs 5 and 6 used orig- inal queries. According to Table 3, ESA and CBDC using the concept relevant to ICD-10 are not helpful according to a comparison of runs 1, 2 and 3. It can be concluded that the re- duction of the concept space without precise ICD-10 concepts resulted in low discrim- ination power. On the other hand, the CBEEM showed consistent improvements over QLD. The best performance was obtained in runs 6 and 8, where the CBEEM and the CBDC were combined. This finding indicates that the use of external medical resources when also considering concept-level associations can have synergetic effects on the re- ranking of documents when they are in the proper right sequence. Moreover, the CBDC is not apparently affected by the query expansion results. Table 2. Descriptions of our Submitted Runs Run Description 1 Query likelihood method with Dirichlet smoothing (QLD) 2 QLD + Explicit semantic analysis (ESA) 3 QLD + Concept-based document centrality (CBDC) using ESA 4 QLD + Cluster-based external expansion model (CBEEM) 5 QLD + CBEEM+ ESA 6 QLD + CBEEM+ CBDC 7 QLD + CBEEM + ESA with expanded query 8 QLD + CBEEM + CBDC with expanded query Table 3. Performances of the Submitted Runs for Topical Relevance Run P@10 NDCG@10 1 0.3606 0.3352 2 0.3455 0.3223 3 0.3591 0.3395 4 0.3788 0.3424 5 0.3606 0.3362 6 0.3864 0.3464 7 0.3727 0.3459 8 0.3864 0.3464 In comparison with the readability-based measures (i.e., uRBP and uRBPgr), the best results in RBP were obtained from runs 6 and 8. However, the best performances of the two readability-based measures were observed from run 7. Table 4. Performances of Submitted Runs for Readability-Biased Relevance Run RBP uRBP uRBPgr 1 0.3222 0.2593 0.2646 2 0.3038 0.2607 0.2614 3 0.3295 0.2596 0.2666 4 0.3306 0.2644 0.2709 5 0.3203 0.2702 0.2725 6 0.3332 0.2607 0.2695 7 0.3299 0.2703 0.2739 8 0.3332 0.2607 0.2695 The results show that the selection of re-ranking components is important because some of them can degrade previously achieved levels of moderated performance. In addition, we can expect additional performance improvements by combining two dif- ferent re-ranking components if their application sequence is appropriate. 4 Conclusion This working note describes our efforts to find high-performance combinations of dif- ferent re-ranking components which utilize external medical resources. Among the dif- ferent runs we attempted, runs 6 and 8 (where our proposed CBEEM and CBDC were used) showed the best performance in P@10, NDCG@10, and RBP. These results im- ply that the effective use of external medical resources for re-ranking can overcome the innate limitations of naΓ―ve queries by laypeople. As our future work, to enhance the proposed re-ranking components, we plan systematically to analyze symptom-wise ev- idence residing in promising external medical resources. References 1. Abdul-Jaleel, N. et al.: UMass at TREC 2004: Novelty and HARD. Proceedings of Text REtrieval Conference (TREC). (2004). 2. Choi, S. et al.: Semantic concept-enriched dependence model for medical information retrieval. Journal of biomedical informatics. 47, 18β27 (2014). 3. Egozi, O. et al.: Concept-Based Information Retrieval Using Explicit Semantic Analysis. ACM Transactions on Information Systems. 29, 2, 1β34 (2011). 4. Goeuriot, L. et al.: Overview of the CLEF eHealth Evaluation Lab 2015. CLEF 2015 - 6th Conference and Labs of the Evaluation Forum. Lecture Notes in Computer Science (LNCS), Springer (2015). 5. Hersh, W. et al.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR β94. pp. 192β201 (1994). 6. Kurland, O., Lee, L.: PageRank without hyperlinks: Structural re-ranking using links induced by language models. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR β05. pp. 306β313 ACM Press, New York, New York, USA (2006). 7. Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR β01. pp. 111β119 ACM Press, New York, New York, USA (2001). 8. Liu, X., Croft, W.B.: Cluster-based retrieval using language models. Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR β04. pp. 186β193 ACM Press, New York, New York, USA (2004). 9. Oh, H.-S., Myaeng, S.-H.: Utilizing global and path information with language modelling for hierarchical text classification. Journal of Information Science. 40, 2, 127β145 (2014). 10. Palotti, J. et al.: CLEF eHealth Evaluation Lab 2015, task 2: Retrieving Information about Medical Symptoms. CLEF 2015 Online Working Notes. CEUR-WS (2015). 11. Simpson, M.S. et al.: Overview of the TREC 2014 Clinical Decision Support Track. Proceedings of Text REtri eval Conference (TREC). (2014). 12. Wang, Y. et al.: A Study of Concept-based Weighting Regularization for Medical Records Search. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. pp. 603β612 (2014). 13. Weerkamp, W. et al.: Exploiting External Collections for Query Expansion. ACM Transactions on the Web. 6, 4, 1β29 (2012). 14. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems. 22, 2, 179β214 (2004). 15. Zuccon, G., Koopman, B.: Integrating Understandability in the Evaluation of Consumer Health Search Engines. Proceedings of the SIGIR Workshop on Medical Information Retrieval (MedIR). (2014).