A language-modelling approach to User-Centred Health Information Retrieval Suzan Verberne Institute for Computing and Information Sciences Radboud University Nijmegen s.verberne@cs.ru.nl Abstract. In this working notes paper we present our methodology and the results we obtained in task 3a of the CLEF eHealth lab 2014. In the set-up of our experiments we assumed that the discharge summary provides the context of the patient’s query, and therefore may contain useful background information that can be used to retrieve more rele- vant results. The central component in our approach is the Indri search engine, with its Language Modelling-based retrieval model. We experi- mented with query expansion using terms extracted from the discharge summary, and using terms extracted from the UMLS thesaurus. We obtained a small positive effect from expanding the Indri queries with terms from both sources. Our future work is directed at improving our term extraction and query expansion strategies. 1 Introduction In this working notes paper we present our methodology and the results we obtained in task 3a of the CLEF eHealth lab 20141 . The aim of the eHealth track is “to evaluate systems that support laypeople in searching for and understanding their health information” [1, 3]. The goal of task 3 is “to provide valuable and relevant documents to patients, so as to satisfy their health-related information needs” [2]. The situation that the task simulates is a patient who has learned from his physician what his diagnosis is, and then starts searching the internet for medical information about his illness. The physician’s information about the patient has been registered in the patient’s discharge summary. The data that was distributed for task 3 consists of: – the 2012 crawl of approximately one million medical documents made available by the EU-FP7 Khresmoi project2 in plain text form; – a set of English ‘layman’ queries that individuals may realistically pose based on the content of their discharge summaries. The query creators used the disor- der diagnosed in the discharge summary as main query term. The training set contained 5 queries and the test set 50; – (optionally, after following human subjects training and obtaining the certifi- cate) the collection of 299 discharge summaries from task 2 among which the discharge summaries belonging to the train and test queries for task 3. In the set-up of our experiments we assumed that the discharge summary pro- vides the context of the patient’s query, and therefore may contain useful back- ground information that can be used to retrieve more relevant results. Our research question is: 1 http://clefehealth2014.dcu.ie/task-3 2 http://www.khresmoi.eu/ 269 Does query expansion with terms from the medical context of a patient’s query lead to better results? The central component in our approach is the Indri search engine3 , with its Language Modelling-based retrieval model. We experiment with query expansion using terms extracted from the discharge summary, and using terms extracted from the UMLS thesaurus. 2 Our approach 2.1 Data preprocessing Document collection We preprocessed the document collection by splitting the corpus files in separate documents and saving for each document its uid, date, url and content. Discharge summaries We obtained the corpus of 299 discharge summaries that was distributed for CLEF-eHealth task 2. We processed all discharge summaries that are referred to by a query in the query set from task 3. We cleaned the discharge summaries from all variables of the form [** ...**] (e.g. [**MD Number 2860**]), from abbreviations (using the regular expression [a-z]\.([a-z]\.)+), and from numbered lists (sequences of lines matching the regular expression ^[0-9]+\..*$). 2.2 Indexing and Retrieval We used the Indri API to index the CLEF collection and set up a query interface to the index. We applied a stopword list to the CLEF collection at indexing time. As ranking model, we use the Indri LM with Dirichlet smoothing and Pseudo-Relevance feedback (PRF) using the Ponte Expander. As parameters for the PRF, we used: number of feedback docs: 20, number of feedback terms: 3. We did not optimize these parameters for the current task but used the optimal settings from a previous task. 2.3 Query construction All characters that are not alphanumeric, no hyphen or whitespace are removed from the query and all letters are lowercased. The words in the query are concate- nated into one string and combined using the combine function in the Indri query language. We used two types of queries: short queries, consisting of a concatenation of the title and description fields, and long queries, consisting of a concatenation of all four fields title, description, profile and narr. For example, for this query: QTRAIN2014.1 08114-027513-DISCHARGE SUMMARY.txt MRSA and wound infection What is MRSA infection and is it dangerous? This 60 year old lady has had coronary artery bypass grafting surgery and during recovery her wound has been infected. She wants to know how dangerous her infection is, where she got it and if she can be infected again with it. Documents should contain information about sternal wound infection by MRSA. They should describe the causes and the complications. 3 http://www.lemurproject.org/indri/ 270 Table 1. The top-5 informative terms that we extracted from the discharge summaries belonging to training queries 2, 3 and 4 using the method by [4] QTRAIN2014.2 QTRAIN2014.3 QTRAIN2014.4 mcg unit stay aortic dissection mute medical intensive care lake thrush end-stage renal disease type levothyroxine sodium department by nodule mesylate saturating right kidney the following short query is created: #combine(mrsa and wound infection what is mrsa infection and is it dangerous) 2.4 Query expansion with terms from discharge summaries We used the discharge summaries to create more informative queries, covering some of the medical background information of the patient. To this end, we aimed to extract the most relevant terms from the discharge summary belonging to each query. We extracted all n-grams with n = {1, 2, 3} from the discharge summary and ranked them by their relevance. For calculating the relevance of each n-gram, we used Kullback-Leibler divergence for informativeness and phraseness [4]. In this method, term relevance is based on the expected loss between two language models, measured with point-wise Kullback-Leibler divergence. Tomokiyo and Hurst propose to mix two models for term scoring: – phraseness (how tight are the words in the multi-word term): P (t) kldivP = P (t) ∗ log Qn (1) i=1 (P (ui )) in which P (ui ) is the probability of the ith unigram inside the n-gram t – informativeness (how informative is the term for the foreground corpus): P (t)f g kldivI = P (t)f g ∗ log (2) P (t)bg in which P (t)f g is the probability of the n-gram t in a foreground collection and P (t)bg is the probability of t in the background collection. For expanding query q, we used the discharge summary belonging to q as foreground collection, and all 299 discharge summaries in the task2 corpus as background col- lection. The parameter γ is the weight of the informativeness score relative to the phraseness score: T ermRelevance = γ ∗ kldivI + (1 − γ) ∗ kldivP (3) We used γ = 0.9, giving a higher weight to informativeness than to phraseness. We sort the n-grams by their T ermRelevance. Table 1 shows the top-5 terms extracted for a few training queries. Note that not all terms are qualitative, some seem too generic to give information about this specific patient. This is caused by the relative sparseness of the data from which the terms are extracted, only one document. In the runs that use the discharge summaries (run2–4), we added the top-k (k = {2, 5}) of terms extracted from the discharge summary to the query, again using unstructured queries with the #combine-operator. This implies that multi- word terms are treated as separate words in the query, not as phrases. 271 C UMG6-MRSA-METHICILLIN-RESISTANT-STAPHYLOCOCCUS-AUREUS-INFECTION: *0 #{{{ S *0 MRSA - Methicillin resistant Staphylococcus aureus infection *0 Methicillin-resistant staphylococcus aureus (MRSA) *0 . -#}}} C UMG9-METHICILLIN-RESISTANT-STAPHYLOCOCCUS-AUREUS: *0 #{{{ S *0 Methicillin resistant Staphylococcus aureus *0 methicillin resistant Staphylococcus aureus *0 MRSA *0 . -#}}} Fig. 1. Two thesaurus entries from UMLS for the query title “MRSA and wound infection” 2.5 Query expansion with terms from a thesaurus We use thesaurus expansion to enricht the queries with non-personalized informa- tion. The idea is that the patient might use layman’s terminology and that the technical-medical synonyms of these terms might give better retrieval results. For this purpose, we use an old version of the ULMS-thesaurus that we have stored locally. In the UMLS, medical terms are ordered in synonym sets. An example is shown in Figure 1. We looked up all query titles from the train and test set in the ULMS: First we preprocessed the query titles so that only the first content word is kept (e.g. “MRSA and wound infection” becomes “MRSA”). Then we processed all synonym sets that – contain one of the preprocessed query titles, or – in which a word from the query title that is between 3 and 5 uppercase char- acters (presumably an abbreviation) is used between brackets (e.g. Methicillin- resistant staphylococcus aureus (MRSA)) For each query title, we created one expansion set by merging all synonym sets in which the preprocessed query title occurs, disregarding synonym sets with 30 or more terms in them (because generic synonym sets such as “UMG2-BODY-PART- ORGAN-OR-ORGAN-COMPONENT” contain dozens and sometimes hundreds of terms). From the expansion set, we removed duplicates and near-duplicates (the only difference is a plural -s, hyphens and underscores, or capitalization). Then we sort the expansion terms by the number of times they occur in synonym sets together with the query title and select the top-k terms to be added to the query. In the runs that use the thesaurus (run2,3,5,6), we added the top-k (k = 5) of terms extracted from the thesaurus to the query, again using unstructured queries with the #combine-operator. This implies that multi-word terms are treated as separate words in the query, not as phrases. 3 Submitted runs 1. Baseline: Indri LM retrieval with Pseudo-Relevance feedback (PRF). Queries are unstructured combination of title and description, using the #combine- operator in the Indri query language 2. Same as Run1, and: 272 Table 2. Evaluation of our runs in terms of Precision (P), nDCG, MAP and the number of relevant results retrieved (ret rel). For each evaluation measure (column), the highest scoring run is marked in boldface. Run ID P@10 nDCG@10 MAP ret rel NIJM EN Run.1 0.5740 0.5708 0.3036 2330 NIJM EN Run.2 0.6180 0.6149 0.2825 2190 NIJM EN Run.3 0.5960 0.5772 0.2606 2154 NIJM EN Run.4 0.5680 0.5669 0.2695 2176 NIJM EN Run.5 0.5880 0.5773 0.2609 2165 NIJM EN Run.6 0.5220 0.5302 0.2180 1939 NIJM EN Run.7 0.5220 0.5302 0.2180 1939 – We expanded each query with terms from the discharge summary (k=5) – We expanded each query with maximally 5 terms from the UMLS thesaurus 3. Same as Run2, but with k=2 for the terms from the discharge summary 4. Same as Run2 (k=5), but without the thesaurus expansion. Note that in our original submission, there was an error in this run, as a result of which its results were equal to Run 3. We reproduced the run and included the correct results in the results section below. 5. Same as Run1, and: – We expanded each query with maximally 5 terms from the UMLS thesaurus 6. Same as Run5 but with long queries: unstructured combination of title, descrip- tion, profile and narrative 7. Same as Run6 but without the thesaurus expansion 4 Results 4.1 Evaluation of the runs Table 2 shows a summary of the results obtained for our runs. From the results, we make the following observations: – Run 1, with short queries and no query expansion, gives the best results in terms of MAP and the number of retrieved and relevant results. – Run 2, with 5 terms from the discharge summary and maximum 5 terms from the thesaurus, gives the best results in terms of nDCG@10 and P@10. This suggests that a combination of terms from the discharge summary and from the thesaurus added to the query can lead to better results in the top-10. – Run 3, with 2 expansion terms from the discharge summary and maxumum 5 terms from the thesaurus, gives worse results than Run 2, with 5 terms from the discharge summary. This suggests that the most informative terms from the discharge summaries are not always ranked at top; or that a combination of terms is needed to retrieve more relevant results. – Run 4, with 5 terms from the discharge summary and no terms from the the- saurus, does not beat the baseline Run 1. This suggests that without the the- saurus terms, the discharge summary terms do not lead to improvement. – Run 5, with maximally 5 terms from the thesaurus and no terms from the discharge summary, gives almost the same results as Run 3, with 2 terms from the discharge summary. This suggests that if there is relevant information in the discharge summary terms, it only leads to improvement if we add enough terms (2 is not enough). – Run 6, with long queries, no terms from the discharge summary and maximally 5 terms from the thesaurus, gives worse results than Run 5, with short queries and maximally 5 terms from the thesaurus. This suggests that short queries are better than long queries. 273 Fig. 2. Per-query results for Precision@10 for our best run (of the runs submitted), Run2. For each query, the height of a bar represents the gain/loss of our Run (grey) and the best Run (white) over the median run of all submitted runs. – Run 7, with long queries and no thesaurus expansion gives the exact same results as Run 6, with long queries and maximally 5 terms from the thesaurus. This suggests that when using the long queries with the fields ‘profile’ and ‘narrative’ fully included, makes it senseless to add a few thesaurus terms. Since runs 6 and 7 are the poorest of all our runs, using the longer queries suspectedly leads to more irrelevant results than the shorter queries. Even if the thesaurus terms are informative, they cannot compensate this. 4.2 Per-query analysis Figure 2 shows the per-query results for our best run (of the runs submitted), Run2. It shows a large divergence between queries. For some queries (the ones where there is no white bar), our run (the grey bar) scores the best of all runs, but for others, our run scores far below median. In follow-up work we will investigate the query characteristics that cause our method to be more or less successful. We suspect that the success depends at least partly on the quality of the extracted additional terms. 5 Conclusions Since we did not do comprehensive analyses of our results, we can only draw prelim- inary conclusions. We worked with a Language Modelling approach, and although the Indri retrieval model can handle long queries (it will find the best possible matches for the combination all query terms — not necessarily all terms are present in the retrieved documents), we found that expanding the query consisting of the fields title and description with the fields profile and narrative has a negative effect on the retrieval results. However, we did get a positive effect from adding the 5 most informative terms from the discharge summary and maximally 5 synonyms from the thesaurus to the queries. 274 In the near future, we plan to do more analyses in order to find out what factors play a role in the success of query expansion. Our current line of work focuses on personalization of IR through terms extracted from personal documents. The discharge summary is a good example of a document that may provide additional information about the context of a user’s query. For that purpose, we aim to improve our term extraction and query expansion strategies. In addition, we will investigate how thesaurus expansion could be applied successfully (selecting the query terms to look up, selecting the synonym sets to expand with). Acknowledgements This publication was supported by the Dutch national program COMMIT (project P7 SWELL). References 1. Lorraine Goeuriot, G Jones, Liadh Kelly, Johannes Leveling, Allan Hanbury, Henning Müller, Sanna Salanterä, Hanna Suominen, and Guido Zuccon. Share/clef ehealth evaluation lab 2013, task 3: Information retrieval to address patients questions when reading clinical reports. Online Working Notes of CLEF, CLEF, 2013. 2. Lorraine Goeuriot, Liadh Kelly, Wei Li, Joao Palotti, Pavel Pecina, Guido Zuccon, Allan Hanbury, Gareth Jones, and Henning Mueller. Share/clef ehealth evaluation lab 2014, task 3: User-centred health information retrieval. In Proceedings of CLEF 2014, 2014. 3. Liadh Kelly, Lorraine Goeuriot, Hanna Suominen, Tobias Schrek, Gondy Leroy, Danielle L. Mowery, Sumithra Velupillai, Wendy W. Chapman, David Martinez, Guido Zuccon, and Joao Palotti. Overview of the share/clef ehealth evaluation lab 2014. In Proceedings of CLEF 2014, Lecture Notes in Computer Science (LNCS). Springer, 2014. 4. Takashi Tomokiyo and Matthew Hurst. A language model approach to keyphrase ex- traction. In Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment-Volume 18, pages 33–40. Association for Computational Lin- guistics, 2003. 275