Improving Understandability in Consumer Health Information Search: UEVORA @ 2016 FIRE CHIS Hua Yang Teresa Gonçalves Computer science department Computer science department University of Évora University of Évora Évora, Portugal Évora, Portugal huayangchn@gmail.com tcg@uevora.pt ABSTRACT While factual health information search has matured considerably, This paper presents our work at 2016 FIRE CHIS. Given a CHIS complex health information searching with more than just one query and a document associated with that query, the task is to single correct answer still remains elusive. Consumer Health classify the sentences in the document as relevant to the query or Information Search (CHIS) for FIRE 2016 is proposed for not; and further classify the relevant sentences to be supporting, investigating complex health information search by laypeople. In neutral or opposing to the claim made in the query. In this paper, this scenario, laypeople search for health information with we present two different approaches to do the classification. With multiple perspectives from diverse sources both from medical the first approach, we implement two models to satisfy the task. research and from real world patient narratives. We first implement an information retrieval model to retrieve the There are two sets of tasks: sentences that are relevant to the query; and then we use supervised learning method to train a classification model to A) Given a CHIS query, and a document/set of documents classify the relevant sentences into support, oppose or neutral. associated with that query, the task is to classify the With the second approach, we only use machine learning sentences in the document as relevant to the query or not. techniques to learn a model and classify the sentences into four The relevant sentences are those from that document, which classes (relevant & support, relevant & neutral, relevant & oppose, are useful in providing the answer to the query. irrelevant & neutral). Our submission for CHIS uses the first B) These relevant sentences need to be further classified as approach. supporting the claim made in the query, or opposing the claim made in the query. CCS Concepts The five queries proposed in the task are showed in figure 1. • Information systems➝Data management system engines Figure 2 gives an example of the output of the system. Annotated data set is provided to participants. This paper is divided into 4 sections. In the first section, we Keywords briefly introduced the background and the 2016 FIRE CHIS task. Health information search; machine learning; IR We then talk about the methods we use in the second section. Two different approaches are experimented to accomplish the task and each approach will be discussed. Experiments and the results are 1. INTRODUCTION presented in the third section. Finally, the conclusions are made. Online search engines have become a common way for obtaining health information; a life project report shows that about 69% of U.S. adults have the experience of using Internet as a tool for Q1: Does sun exposure cause skin cancer? health information such as weight, diet, symptoms and so on [4]. Q2: Are e-cigarettes safer than normal In the meanwhile, research interest in health information retrieval cigarettes? (HIR) has also grown in the past years. As a matter of fact, health Q3: Can Harmone Replacement Therapy(HRT) information is of interest to a variety of users, from physicians to cause cancer? specialists, from practitioners to nurses, from patients to patients Q4: Can MMR Vaccine lead to children family, and from biomedical researchers to consumers (general developing autism? public). Also, health information may be available in diverse Q5:Should I take vitamin C for common cold? sources, like electronic health record, personal health records, general web, social media, journal articles, and wearable devices and sensors [5]. Figure 1. 2016 FIRE CHIS queries as relevant to the query and non-retrieved as irrelevant. Figure 3 depicts our model for task A. First, we input the original task queries and provided sentences into the IR model. The relevant Example Query: sentences are retrieved and ranked according to the weighting Are e-cigarettes safer than normal cigarettes? methods. Top ranked (in our experiments, we choose top 3) relevant sentences are used as the source to expand the original S1: queries. Expanded queries are used as the input. The IR model is Because some research has suggested that the levels used again to retrieve sentences with expanded queries. The of most toxicants in vapor are lower than the levels relevant sentences are used as the input of a classification model in smoke, e-cigarettes have been deemed to be safer works. We regard all the retrieved sentences from our IR model as than regular cigarettes relevant to the query and we use them the input of task B. .A) Relevant, B)Support S2: David Peyton, a chemistry professor at Portland State University who helped conduct the research, says that the type of formaldehyde generated by e- cigarettes could increase the likelihood it would get deposited in the lung, leading to lung cancer. A) Relevant, B) oppose S3: Harvey Simon, MD, Harvard Health Editor, expressed concern that the nicotine amounts in e- cigarettes can vary significantly. A)Irrelevant, B) Neutral Figure 2. 2016 FIRE CHIS task description 2. METHODS Figure 3. information retrieval model for task A We propose two different approaches to accomplish the task. In order to make it easier to explain, we name them program A and program B. In program A, two different models are trained by Terrier1 is used to implement a baseline IR model. All queries and using both state of the art in information retrieval and machine sentences are pre-processed. Stop-words are removed, stemming learning techniques. In program B, we take the task as a whole and normalization are applied. TF*IDF weighting model is used and only use machine learning techniques. One single for the computation of sentence scores with respect to the query. classification model is trained in program B. We will discuss each The queries can be retrieved one by one or in batch. We use approach in detail in the following part. pseudo relevance feedback as a way to expand the original queries. We set all parameters to Terrier the default ones. 2.1 Program A Pseudo relevance feedback (a.k.a. blind relevance feedback) is a Considering the task is divided into sub-tasks, we implement two way to improve retrieval performance without the user interaction different models to satisfy the task, with each model processing [1]. Previous works showed its effectiveness in improving the one task. For task A, we implement an information retrieval (IR) performance [2] [3]. Figure 4 depicts how this technique can be model to retrieve relevant sentences. The retrieved sentences are used in an IR model to satisfy the user. regarded as relevant to the query, and non-retrieved ones as This technique is used in our experiments to expand the original irrelevant. For task B, we use a supervised learning algorithm to query. The most informative terms are extracted from top- get a classification model. The retrieved sentences from the first returned documents as the expanded query terms, as shown in part are then classified as support, oppose or neutral to the claim Figure 4. We use Bo1 [6] as the expanded term weighting model. made in the query. A Bo1 model uses the Bose-Einstein statistics and terms are weighted in the top retrieved documents. In our experiments, 10 2.1.1 An IR model for Task A expansion terms are extracted from the top 3 retrieved documents. No other query expansion techniques are used in our experiments. In task A, sentences provided by the organizer should be classified as relevant to the queries or not. We implement an IR model to do this classification. Retrieved sentences are regarded 1 Terrier.org. 2.1.3 Integration The retrieved sentences by an IR model are regarded as relevant to the query and they are further labeled as ‘neutral’, ‘support’, or ‘oppose’ to the query by the classification model. The non- retrieved sentences from the IR model are regarded as irrelevant to the query, and we assign ‘neutral’ label to all the irrelevant sentences. 2.2 Program B As another approach to figure out the problem and provide multi- perspective for the users, we look on the task as a whole and re- organize the annotated data with four different labels: -irrelevant & neutral Figure 4. Pseudo relevance feedback2 -relevant & support -relevant & oppose 2.1.2 A classification model for task B -relevant & neutral For task B, we propose a classification model, presented in Figure 5. With a classification model, we further classify the retrieved Using the annotated data with the labels above, we get a sentences into different classes. classification model and this model is used to classify the test sentences into those four classes. The approach is the same as the The annotated dataset provided by the organizer is first pre- one described in sub-section 2.1.3, but here we are using all the processed. Then TF*IDF scheme is used to extract data features sentences and instead of having three classes, we have four, as from the text. These features will be used as the input of the figure 6 shows. The output is a sentence with one label from the learning system to train a classification model. This model is able fours that we list above. For example: to further classify the relevant sentences retrieved from the IR model into support, oppose or neutral to the claim stated in the query. Sentence: Harvey Simon, MD, Harvard Health Editor, expressed 3 concern that the nicotine amounts in e-cigarettes can vary TextBlob tool is used for text processing. Naïve Bayes and significantly. decision tree classifiers are used as learning methods. Only Output: Irrelevant & Neutral TF*IDF features are extracted, no other data features are used in our experiments. All the sentences provided are pre-processed data and used to train a classification model with supervised machine learning techniques. We extract features with TF*IDF scheme. Test data needs to be pre-processed before classification. Figure5. classification model for task B 2 Image from http://www.slideshare.net/LironZighelnic/querydrift- Figure 6. classification model for program B prevention-for-robust-query-expansion-presentation-43186077 3 https://textblob.readthedocs.io/en/dev/ 3. EXPERIMENTS AND RESULTS Table 1 results comparison of taskA runs (F1 score) In this part, we give the results in our experiments. We will present our experiments separately according to each program we proposed in the previous part. 3.1 Experiments of Program A 3.1.1 Runs for task A The results for different runs are shown in table1. TrecEval 4 program is used to evaluate the performance. We produce different runs to compare the performance using F1 score as the evaluation method. -taskA.run1: process all the queries without bath pseudo relevance feedback Table 2 results of taskB (F1 score) -taskA.run2: process all the queries in batch with pseudo relevance feedback -taskA.run3: process the queries individually without pseudo relevance feedback -taskA.run4: process the queries individually with pseudo relevance feedback We got our best results with run4 and the average F1 score is 0.73. The results present that our IR model works well on query3, query4 and query5. Considering the way of processing, we can see that processing the queries one by one is much better than all the queries in batch. As a way to do the query expansion, PRF technique does improve 3.2 Experiments of Program B the recall obviously, which means it can get more relevant Table 3 gives the final results for this program. In this program, documents returned. Also, this technique reacts differently we regard the task as a whole and only one classification model is depending on the processing way. If all the queries are processed trained. We evaluate the final output of the program and the score in batch, using PFR decreases the performance in F1 score is used for measuring both task A and task B as an integral. compared with the results without using PFR,. If the query is processed one by one, PRF increases the performance totally; but The average score for this model is 0.64. We get highest score for query 3 and the lowest one for query 5. some queries show a lit bit down score compared with non-PRF using. We can also see that for query1 and query2, the score is improved sharply when using PRF. Combining the task and our system, we adopt PRF as a way to improve the system performance. Table 3 results of program B (F1 score) 3.1.2 Run for task B For task B, we use the traditional TF*IDF scheme to extract data features and Naïve Bayes is used as the learning method. Table 2 present our experiment results for this part. From the results, we can see that the average score for this classification is 0.28, which is very low. The classification is based on the results from the IR model. Some sentences may be irrelevant to the query indeed, but is classified as relevant to query, this kind of sentences are regarded as relevant and be classified by the classification model. This will affect the performance of the system. 4 http://trec.nist.gov/trec_eval/ 4. CONCLUSION In this paper, we present our two different approaches to accomplish 2016 FIRE CHIS task. With the first approach, we implement both an IR model and a classification model. The results show that our IR model works well generally except on query2. The classification model shows a low performance for all. With the second approach, we take the task as a whole and using machine learning techniques only to do the classification. Although we figure out different approaches to the task, we have different output form for two approaches; we do not compare the performance of both approaches. The second approach presented in our paper is just another possible way to solve the problem proposed by the organizer. Program A is used as the final submission to the challenge. 5. REFERENCES [1] Christopher D Manning and Hinrich Schütze. Foundations of statistical natural language processing, volume 999. MIT Press, 1999. [2] Yang Song, Yun He, Qinmin Hu, Liang He, and E Mark Haacke. Ecnu at 2015 ehealth task 2: User-centred health information retrieval. Proceedings of the ShARe/CLEF eHealth Evaluation Lab, 2015. [3] Ellen M Voorhees, Donna K Harman, et al. TREC: Experiment and evaluation in information retrieval, volume 1. MIT press Cambridge, 2005. [4] Susannah Fox and Maeve Duggan. Tracking for health. Pew Research Center’s Internet & American Life Project, 2013. [5] Lorraine Goeuriot, Gareth JF Jones, Liadh Kelly, Henning M¨uller, and Justin Zobel. Medical information retrieval: introduction to the special issue. Information Retrieval Journal, 1(19):1–5, 2016. [6] Giambattista Amati. Probability models for information retrieval based on divergence from randomness. PhD thesis, University of Glasgow, 2003.