-

The University of Iowa at CLEF 2014: eHealth Task 3

Chao Yang

chao-yang@uiowa.edu 0

Sanmitra Bhattacharya

sanmitra-bhattacharya@uiowa.edu 0

Padmini Srinivasan

padmini-srinivasan@uiowa.edu 0 0 Department of Computer Science, University of Iowa , Iowa City, IA , USA

283 295

The task 3 of CLEF eHealth Evaluation lab aims to help laypeople get more accurate information from health related documents. In this task, we did several experiments and tried di erent technologies to improve the retrieval performance. We tried to clean the original dataset and did sentence level retrieval. We explored di erent parameter settings for pseudo relevance feedback. Description and Narrative was utilized to expand the query as well. We also modi ed Markov Random Field (MRF) model to expand the query using medical phrase only. In our training set (2013 test set), using those methods can signi cantly improve the retrieval performance by 8-15% from baseline. We submitted 4 runs. Results on 2014 test set suggest that the technologies we used except MRF have the potential to improve the performance for the top 5 retrieved results.

Information Retrieval Query Expansion Pseudo Relevance Feedback Markov Random Field

The ShARe/CLEF eHealth Evaluation Lab[ 3 ] is part of CLEF 2014 Conference and Labs of the Evaluation Forum1. It aims to help laypeople understand health related documents better. We participated in Task 3: User-centred health information retrieval[ 2 ]. Its goal is to develop a more accurate retrieval strategy for health related documents. Speci cally, participants were required to submit a list of relevant health related document ids for each query (topic). In 2014, Task 3 includes a monolingual IR task (Task 3a) and a multilingual IR task (Task 3b). We participated in Task 3a only.

In particular we asked questions like: 1) Does sentence splitting on documents help improve retrieval performance? 2) How does one optimize the parameters for pseudo relevance feedback? 3) Is query expansion using descriptions and narratives more e ective than using titles only?

3) Can we include medical phrase detection to make a better Markov random eld (MRF) model?

Dataset

The dataset for Task 3 is provided by Khresmoi project2. It has a set of medicalrelated documents in HTML format. The documents are from well-known health and medical sites and databases. The size of dataset is about 41G (uncompressed), it has 1,103,450 documents. 2.1

HTML to Text

Since the format of the documents is HTML, it has a lot of HTML tags and other noises which may a ect the retrieval performance if we index them directly. We employed Lynx3, a command line browser to convert the HTML les to text only. The size of the text only dataset decreased to 8.6G. Then we replaced frequent UTF-8 broken characters4. We named this text only dataset \All Text". 2.2

Content Cleaning

The ideal text we extract should be the main article from the webpage. However, there are di erent sections in a typical webpage. The sections could be the structure information about the website, contact information, headlines, even advertisement. The example in Figure 1 shows the beginning of one text output from Lynx. Except for the last two lines, all the information is unrelated to the main article. * Home * About * Ask A Question * Attract CME Attract NPHS Logo Search Clinical Questions Enter search details Search A total of 1713 clinical questions available Quick Guide to ATTRACT What is the evidence for betamethasone cream versus circumcision in phimosis? Associated tags:child health, men's health, circumcision, phimosis, treatment, corticosteroid ...

However, to remove all those irrelevant information is not trivial. In order to keep the main article only, we tried to use simple rules to remove the headlines, titles. In particular, we removed all the lines which have less and equal than 3 tokens. We also removed all the lines which start with either `*', `+', `-', `o', `#', and `@'. Those are the headline start symbols from Lynx.

After the data cleaning mentioned above, the dataset we have is about 5.4G. We name this collection \Text Clean". 2.3

Sentence Splitting

Besides indexing whole documents, we also explored sentence level retrieval. We used GENIA Sentence Splitter (GeniaSS) [ 6 ] to split sentences of each text document from \All Text". This sentence splitter is optimized for the biomedical documents and has good performance. Keeping track of the original text document id we created 3 sentence level datasets: \Sent 1", \Sent 2", and \Sent 3".

\Sent 1" has only single sentences. (In other words, we treat each sentence as a logical `document'.) \Sent 2" has pairs of adjacent sentences.

\Sent 3" has sequences of 3 adjacent sentences. 2.4

Training Topic Set

We did not use the training topics provided in CLEF eHealth 2014 because there were only 5 topics and the coverage of qrels le is small. Therefore, we used CLEF eHealth 2013 test topics as our training topics. The 2013 test set has 50 topics. Figure 2.4 shows an example training topic. <query> <id>qtest1</id> <discharge summary>00098-016139-DISCHARGE SUMMARY.txt </discharge summary> <title>Hypothyreoidism</title> <desc>What is hypothyreoidism</desc> <narr>description of what type of disease hypothyreoidism is</narr> <pro le>A forty year old woman, who seeks information about her condition</pro le> </query>

Baseline

To nd out our baseline strategy we created separate indexes from di erent datasets (\All Text", \Text Clean", \Sent 1", \Sent 2" and \Sent 3") using Indri [ 7 ]. We ltered out stopwords during indexing and in the queries. We ran Indri's Query Likelihood model used title only as query to retrieve documents from di erent indexes and the one with best performance is our baseline. For instance, the query for the example in Section 2.4 is \#combine(Hypothyreoidism)"

The evaluation focused on P@5, P@10, NDCG@5, and NDCG@10. These results including MAP are shown in Table 1. We also include the baselines and the best performing runs in 2013. Scores bolded are the best for that measure in the table.

Again, Title All Text is the retrieval strategy using title as query and All Text as index which mentioned before. BM25 and BM25 FB (with Pseudo Relevance Feedback) are the o cial baselines in 2013. The two o cial baselines only use title as query. (The same strategy with Title All Text.) Mayo2 and Mayo3 are the best 2 runs last year from Zhu et al. at Mayo Clinic[ 8 ]. Our Title All Text is better than BM25 in all the measures, it could have bene ted from using Lynx to output text format. It even outperforms Mayo2 and Mayo3 in terms of NDCG@5 and NDCG@10 (but not in P@5, P@10 or MAP). However, using title only to retrieve from Text Clean and Sent 1/2/3 indexes did not improve the performance. Especially for using Sent 1/2/3, the performance for all the measures dropped signi cantly.

Therefore, we use Title All Text as the baseline for the later experiments. We drop the Text Clean and the three sentence level datasets since these do not improve retrieval performance. 4

Optimize Pseudo Relevance Feedback

Pseudo Relevance Feedback is a popular and successful method for expanding queries. We can see in Table 1, the o cial baseline BM25 FB outperforms BM25 in almost all of the measures. We tried to improve on our baseline results with Title All Text by optimizing the parameters of Pseudo Relevance Feedback (Lavrenko's relevance models [ 4 ]) using Indri. There are 3 parameters that need to be set. The rst is the weight of original query (Weight). The weight for the expanded query is 1-Weight. The number of documents used for pseudo relevance feedback. The number of terms selected for the feedback query.

One important notice is that in the later experiments, if a retrieved document which ranked in top 10 is not in the 2013 test qrels (since 2013 test topics are our training topics) provided, we judge it by ourselves and add it to the 2013 test qrels. When judging the documents, we always tried to refer how the documents were labeled in the o cial qrels (Actually, a lot of documents are almost identical, but only some of them were labeled because of pooling). In the end of our experiments, we added total of 310 documents in the qrels. (80 relevant and 230 non-relevent documents.) It is true adding the qrels might make the later comparison against the 2013 o cial submitted runs and 2013 baselines unfair. But it would be also impossible to improve our retrieval strategies if we don't label the unjudged top 10 retrieved documents. 4.1

Weight of Original Query

We experimented with Weight from 0.1 to 0.9. We set the initial value of # terms and # docs to 20 and 5 respectively. Result is shown in Table 2.

Weight between 0.6 and 0.9 seem strong across the measures. We favor 0.6 and 0.7 in terms of emphasizing precision at high ranks. 4.2

Number of Documents

We explored di erent values for number of documents from 5 to 50. We tried both 0.6 and 0.7 for Weight, which is the optimal values from the last experiment. Again, the initial value for number of terms is set to 20. Table 3 shows the result for Weight=0.6, as it performs better than 0.7 in the experiment.

The optimal value for number of documents is 10 (both for Weight = 0.6 and 0.7). 4.3

Number of Terms

Next we explored values of number of terms from 5 to 50. We set Weight and # Docs to 0.6 and 10 respectively based on the previous experiments. Table 4 shows the result. We also show the baseline results (without the bene t of pseudo relevance feedback). Both 40 and 45 are good values for # Terms. We choose 45 for the later experiment since we would like to focus more on top 5 performance (In the later o cial evaluation, top 10 was used in the primary measures). Finally our parameters for pseudo relevance feedback, Weight, number of Docs, number of Terms are 0.6, 10, and 45 respectively.

Expanding the Query Using Description & Narrative

From the topic example in Section 2.4, we know the title only contains the minimum information for the topic. In order to better describe the information needs of the user, we could expand the query using description or narrative eld of the topic.

We explored linear combinations of title and description, title and narrative to improve retrieval performance. Speci cally we weight the title by WeightT and weight for description or narrative by 1-WeightT. (We also ltered out stopwords for description or narrative elds.)

The results of linear combination of title and description, title and narrative are shown in Table 5 and Table 6 respectively. We can see that for both Table 5 and Table 6, when the weightT increases, performance also increases. But even the weightT=0.9, it is still not as good as the baseline. Therefore, using description or narrative elds did not signi cantly improve retrieval performance. These elds may require more sophisticated methods to extract keywords and combine them with the title. Inspired by Zhu et al. [ 8 ], we explored Markov Random Field (MRF) model [ 5 ] as well. Zhu et al. used the parameters settings described in [ 5 ]. For example if the topic title is "Coronary artery disease", the expanded Indri query using MRF model should be: #weight( 0.8 #combine(coronary artery disease) 0.1 #combine( #1(coronary artery) #1(artery disease) ) 0.1 #combine( #uw8(coronary artery) #uw8(artery disease) ) ) In this section, we describe how we modi ed the MRF model and explored the parameters. In order to distinguish the original MRF from our modi ed version, we call the original MRF, MRF Bigram since it expands the query using bigrams in the query. And we call our modi ed version, MRF MedPhrase. 6.1

MRF Bigram

There are 3 parameters for MRF Bigram model: weight of the title (WeightT) (weights for #1 part and uw8 part are both equal to (1-WeightT)/2 ), Window Type (uw or od: uw/od means unordered/ordered window for the terms), and Window Size (e.g uw8 means unordered window size 8 in Indri). We began with the experiment for the WeightT. The initial value for Window Type & Size are set to uw and 8 respectively. The result is shown in Table 7.

MRF Bigram model does improve retrieval performance compared to our baseline (Title All Text). The optimal value for the WeightT is 0.8 or 0.9. We choose 0.8 since we focused on the top 5 performance more (Again, the o cial evaluation later focuses on the top 10 ).

Next, we would like to nd if changing Window Type & Size would a ect the retrieval performance. Results exploring Window Type & Size are shown in Table 8.

Therefore, WeightT 0.8, uw5 are our optimal parameters for MRF Bigram model. MRF Bigram does improve the retrieval performance, but using bigram does not always make sense. For example, ideally topic \facial cuts and scar tissue" should be interpreted as phrases \facial cuts" and \scar tissue". Bigram \cuts scar" (ignore stopwords) does not make sense. Therefore, we modi ed the original MRF model and only use medical phrases to expand the query. Using the same example in Section 2.4, MRF MedPhrase model should generate the query like: #weight( 0.8 #combine(coronary artery disease) 0.1 #combine( #1(coronary artery disease)) 0.1 #combine( #uw5(coronary artery disease) ) ) Because coronary artery disease is a medical phrase. Using another topic example: \shortness breath swelling". The query using MRF MedPhrase model should generate the query like: #weight( 0.8 #combine(shortness breath swelling) 0.1 #combine( #1(shortness breath) swelling ) 0.1 #combine( #uw5(shortness breath) swelling ) ) To identify the medical phrases, we use MetaMap [ 1 ] to parse the title of topic. Similar with the MRF Bigram, we found the optimal parameter value for WeightT is 0.8, the Window Type & Size should be set as uw5 as well.

To make the extraction of medical phrases correct, we need to also enabled spell checking (SC) for MRF models. Table 9 shows the comparison for MRF Bigram and MRF MedPhrase. In the comparison, we combined MRF with Pseudo Relevance Feedback (RF) as well.

Supporting our intuition, MRF MedPhrase model outperforms MRF Bigram for all the measures.

Expand Medical Abbreviation

Our best run using MRF MedPhrase with spell checking and pseudo relevance feedback is signi cantly better than the best runs last year. But there is one more important thing to do. There are several abbreviations in the medical topics, which would be very helpful if we can expand them. However, to expand medial abbreviation is also not trivial. We tried several medical abbreviation lists and found the one from Wikipedia5 might be the most appropriate one for our task. However, there are still some abbreviations missed. In the 2014 test data, we found \L" could mean \left" which our method cannot expand.

The result is shown in Table 10. Using medical abbreviation expansion does help achieve higher performance.

So far, we did several experiments including cleaning the web text, sentence level retrieval, pseudo relevance feedback, linear combination of title and description/narrative, MRF model, spell checking and abbreviation expansion. The comparison between our best strategy and our baseline is shown in Table 11. Our best strategy improved about 15% for the measures on top 5 retrieved results. It also improved about 8-9% for the measures on top 10 retrieved results from baseline. 0.5520 0.5120 0.5498 MRF MedPhrase (14.05%") (7.56%") (15.41%")

RF SC Abbr 0.5257 0.2625 (9.27%") (11.7%") 8

Submitted Runs And Results

Because the discharge summary is very noisy, we didn't develop retrieval strategies utilizing it. We submitted 4 runs in our nal submission. (The baseline is 5 http://en.wikipedia.org/wiki/List_of_medical_abbreviations:_A run 1, the experiments without discharge summaries should be Runs 5-7. 5 is the highest priority while 7 is the lowest.) Table 12 shows our runs and the technologies used.

Run 1 is our baseline, which only uses title to retrieve medical documents. Run 5 is our best run, it uses Markov Random Field (MRF) model which expands queries using only medical phrases, it also utilizes abbreviations expansion, pseudo relevance feedback and spell checking. Run 6 is the same as Run 5, but without pseudo relevance feedback. Run 7 is the same as Run 5, but without MRF model.

Table 13 shows the nal performance from the o cial evaluation. Unfortunately, the runs do not signi cantly di er from each other. Our Run 7 has better scores for P@5 and NDCG@5 which is our original focus. It shows that pseudo relevance feedback has the ability to achieve high accuracy retrieval especially for the top 5 results. (In the nal judgement, run 7 submission was not in the judged pool. Therefore, the real performance for run 7 could be even higher.) But our baseline (Run 1) has better performance for P@10 and NDCG@10 which are the primary o cial measures. The MRF model we trained using 2013 test data does not improve retrieval performance using 2014 test dataset. The reason could be that we over tted the model though we attempted to avoid that pitfall.

Figure 3 shows our Run 1 (since it has the best top 10 performance in our runs) against the median and best performance (p@10) across all systems submitted to CLEF for each query topic. Topics 8, 13, 15, 28, 34, 44, and 50 are easily handled by Run 1, but topics 7, 11, 22, 32, 38, 40, 47 are di cult for it.

Conclusion

We explored cleaning of the dataset and sentence level retrieval. We showed that retrieval performance did not improve by utilizing the two methods. We also tried linear combinations of title and description/narrative, it seems it is a non trivial task. We did experiments to nd out the optimal parameters for pseudo relevance feedback, showed that it can achieve higher performance for top 5 retrieved items. We modi ed the Markov Random Field model by using the medical phrases to expand the query. This method shows the ability to achieve higher performance on the 2013 queries but fails using the 2014 test dataset. Future work planned includes a more sophisticated method to combine the title and description/narrative/discharge summary, and avoiding the over tting of the MRF model.

A. R.

Aronson and

F.-M.

Lang . An overview of metamap: historical perspective and recent advances . Journal of the American Medical Informatics Association , 17 ( 3 ): 229 { 236 , 2010 .

Goeuriot ,

Kelly ,

Li ,

Palotti ,

Pecina ,

Zuccon ,

Hanbury , G. Jones, and

Mueller . Share/clef ehealth evaluation lab 2014, task 3: User-centred health information retrieval . In Proceedings of CLEF 2014 , 2014 .

Kelly ,

Goeuriot ,

Suominen ,

Schrek ,

Leroy ,

D. L.

Mowery ,

Velupillai ,

W. W.

Chapman ,

Martinez , G. Zuccon, and

Palotti . Overview of the share/clef ehealth evaluation lab 2014 . In Proceedings of CLEF 2014, Lecture Notes in Computer Science (LNCS) . Springer, 2014 .

Lavrenko and

W. B.

Croft . Relevance based language models . In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , pages 120 { 127 . ACM, 2001 .

Metzler and

W. B.

Croft . A markov random eld model for term dependencies . In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval , pages 472 { 479 . ACM, 2005 .

Saetre ,

Yoshida ,

Yakushiji ,

Miyao ,

Matsubayashi , and

Ohta . Akane system: protein-protein interaction pairs in biocreative2 challenge, ppi-ips subtask . In Proceedings of the Second BioCreative Challenge Workshop , pages 209 { 212 , 2007 .

Strohman ,

Metzler ,

Turtle , and

W. B.

Croft . Indri: A language model-based search engine for complex queries . In Proceedings of the International Conference on Intelligent Analysis , volume 2 , pages 2 { 6. Citeseer , 2005 .

Zhu ,

Wu ,

James ,

Carterette , and

Liu . Using discharge summaries to improve information retrieval in clinical domain . Proceedings of the ShARe/-CLEF eHealth Evaluation Lab , 2013 .