JU_KS_Group@FIRE 2016: Consumer Health Information Search Kamal Sarkar Debanjan Das Dept. of Computer Sc. & Engg. Dept. of Computer Sc. & Engg. Jadavpur University Jadavpur University Kolkata, WB 700032 Kolkata, WB 700032 jukamal2001@yahoo.com dasdebanjan624@gmail.com Indra Banerjee Mamta Kumari Prasenjit Biswas Dept. of Computer Sc. & Engg. Dept. of Computer Sc. & Engg. Dept. of Computer Sc. & Engg. Jadavpur University Jadavpur University Jadavpur University Kolkata, WB 700032 Kolkata, WB 700032 Kolkata, WB 700032 ardnibanerjee@gmail.com mamta.mk222@gmail.com p.biswas.ju94@gmail.com ABSTRACT Information access mechanisms for factual health information In this paper, we describe the methodology used and the results retrieval have matured considerably, with search engines obtained by us for completing the tasks given under the shared providing Fact Checked Health Knowledge Graph search results task on Consumer Health Information Search (CHIS) collocated to factual health queries. It is pretty straightforward to get an with the Forum for Information Retrieval Evaluation (FIRE) answer to the query “what are the symptoms of Diabetes” from 2016, ISI Kolkata. The shared task consists of two sub-tasks – (1) these search engines [8][9][10]. But the most general purpose task1: given a query and a document/set of documents associated search engines can hardly find the answers of the complex health with that query, the task is to classify the sentences in the search queries which do not have a single definitive answer and document as relevant to the query or not and (2) task 2: the whose answers have multiple perspectives. There may have a relevant sentences need to be further classified as supporting the search queries for which there are a large number of search results claim made in the query, or opposing the claim made in the query. reflecting the different perspectives and view-points in favor or We have participated in both the sub-tasks. The percentage against the query. accuracy obtained by our developed system for task1 was 73.39 The term “Consumer Health Information Search” (CHIS) has which is third highest among the 9 teams participated in the been used by the organizers of the shared task on Consumer shared task. Health Information Search @FIRE 2016 to denote such information retrieval search tasks for which there are no “Single Categories and Subject Descriptors Correct Answer(s)” and instead, multiple and diverse perspectives/points of view, which very often are contradictory in H.1.2 [Information Systems]: User/Machine Systems – human nature, are available on the web regarding the queried factors, human information processing. information1. Keywords 1.2 Problem Statement Consumer health information search, searching behavior, search The shared task on Consumer Health Information Search @FIRE tasks, user query, document sentences. 2016 has the following two sub-tasks: A) Task 1- Given a CHIS query and a document/set of documents associated with that query, the task given was to classify the sentences in the document as relevant to the query or not. 1. INTRODUCTION Relevant sentences in the document being those which are useful 1.1 Our Motivation in providing the answer to the query. A large number of websites provide health related information B) Task 2- These relevant sentences had to be further classified as [1][2]. Consumer use of the Internet for seeking health supporting the claim made in the query or opposing it. information is rapidly growing [3]. By 1997, nearly half of Internet users in the US had sought health information [4]. 1.2.1 Examples Expressed in raw numbers, an estimated 18 million adults in the E.g. Query - Are e-cigarettes safer than normal cigarettes? US sought health information online in 1998. The majority of S1: Because some research has suggested that the levels of most consumers seek for themselves health information related to toxicants in vapor are lower than the levels in smoke, e-cigarettes diseases for consultation with their physicians [5] [6]. Information have been deemed to be safer than regular cigarettes. found trough search on the web may influence medical decision making and help consumers to manage their own care [7]. The A) Relevant, B) Support most common topics which are searched on the web are the leading causes of death (heart disease and cancer) and Children health. 1 https://sites.google.com/site/multiperspectivehealthqa/ S2: David Peyton, a chemistry professor at Portland State calculated the noun matching similarity using the following University who helped conduct the research, says that the type of formula; formaldehyde generated by e-cigarettes could increase the Noun Similarity = (No. of nouns that are exactly matching with likelihood it would get deposited in the lung, leading to lung the nouns in the query) / (No. of nouns present in the user query) cancer. --- (ii) A)Relevant, B) oppose 2.1.1.4 Neighborhood Matching: There were some words S3: Harvey Simon, MD, Harvard Health Editor, expressed present in the sentences which were not matching exactly with the concern that the nicotine amounts in e-cigarettes can vary words of user query, but they are semantically similar with the significantly. user query words; e.g. A) Irrelevant, B) Neutral Let ‘skin cancer’ be present in the user query and ‘melanoma’ be present in the current sentence. Both words are spelt differently 2. METHODOLOGY but their meanings are similar, i.e. they are meaningfully similar. 2.1 Description To check whether the words were equivalent or not, we took each For both the tasks-Task 1 and Task 2, we have used support word from the current sentence, searched it in our self-made vector machines (SVM) as the classifier, but the feature sets for Wikipedia Dictionary [11], and extracted the first three sentences the task1 and task2 were different. We discuss the feature sets describing that word’s meaning. We then matched the user query used for task1 and task 2 in sub section 2.1.1 and sub-section words with the words present in the extracted sentences, and if the 2.1.3 respectively. word is present, we consider it as a match and, finally we calculate the similarity again between the user query and the 2.1.1 Our Used Features for Task1 current sentence using equation (i). For the task 1, we were given by the organizers of the shared task We create the Wikipedia Dictionary by saving words along with a set of excel files where the heading of each excel file was a user their meanings, which were extracted from Wikipedia. We use query. Each excel file contained a set of sentences that were our developed python script for creating this dictionary. labeled as relevant or not relevant to the user query. The sentences in these given training excel files were already labeled 2.1.1.5 COSINE Similarity: as ‘relevant’ or ‘irrelevant’. We took each and every sentence We represent both the query and a sentence using bag-of-words from each excel file and pair it with the corresponding query, model and each query as well as the sentence is represented as examined them, and calculated a set of five features discussed in vector. The component of each vector is TFIDF weight of a word this sub section. t which is calculated as follows: 2.1.1.1 Exact Matching: We matched each sentences with IDF (t) = log(N/DF) the user given query, word by word, and calculated the similarity Where N= Total number of sentences and DF= Number of between the user query and the current sentence in the excel file; sentences with word ‘t’ in it e.g. TF(t) = (Number of times word ‘t’ appears in a sentence) / (Total Let the user query be “Ram is a good boy” and the current number of words in the sentence) sentence be “Shyam is a bad boy”. Between the user query and the current sentence there are three words which are exactly After calculating the vectors for the query and the sentence, the matching, i.e. “is”, “a” & “boy”. Now the similarity between cosine similarity between the query vector and the sentence these two strings is given as; vector is calculated. The cosine similarity value is used as one of feature values for relevance checking. Similarity = {2 * (No. of Common Words)} / {(No. of words in user query) + (No. Of words in the current sentence)} --- (i) 2.1.2 Search as Classification where, no. of Common Words = Number of words common to For task 1, we represent each training sentence as vector of five both the user query and the current sentence. feature values mentioned above and label each vector as “relevant” or “not relevant”. With this labeled training data, we 2.1.1.2 Stemmed Word Matching: We stemmed both the train the support vector machines (SVM). For SVM, we have used user query and the current sentence using a stemming tool SVC tool available in Python scikit learn and a model is available in Python programming language. Stemming normalizes generated. Since no development set was available, for parameter a word by cutting out the excess part of a word due to tuning, we split the training data into two parts-(1) the first part pluralization, or if the word is an adverb; e.g. mangoes → mango, contains 60% of the training data and second part contain 40% of highly → high etc. After stemming we again calculated the the training data. We train SVM with the 60% of the training data similarity between both the strings using equation (i). and then we test the obtained model on the remaining part of the 2.1.1.3 Noun matching: We found, on a perusal of initial training data. Thus we tune the parameters to obtain the best sample data, that the nouns present in each sentence largely parameter settings. Finally, we obtain the best results with the influenced whether a search result was relevant or irrelevant to settings where the cost parameter C set to 107, gamma set to 0.006 the user query. So we isolated the nouns present in the user query, and kernel set to “poly”. searched whether any of these nouns were matching with any Like training data, we represent the unlabeled test data released word present in current sentence, and by this process we found by the organizers of the shared task in the similar way using the out the number of nouns present in the current sentence that were five features mentioned in sub-section 2.1.1, and then submit it to exactly matching the nouns present in the user query. We the trained classifier. The classifier, using its knowledge from previous training data, predicts the labels for each of the sentences parameter C set to 107, gamma set to 0.005 and kernel set to present in test data. “rbf”. 2.1.3 Our Used Features for Task2 We also represent unlabeled test data released by the organizers After relevancy checking (Task 1), Task 2 is carried out. By task for the task 2 as the vectors using the same feature set consisting 1, all the sentences in the excel file are divided into two classes; of N+4 features and submit them to the trained model which in (a) relevant and (b) irrelevant. Now the task is to determine turn predicts label ‘supporting’/’opposing’/’neutral’ for each whether a relevant sentence was supporting the user query, sentence present in the test excel file. opposing the user query, or neutral with regard to the user query. For this task we again calculated a set of N+4 features, where N = 2.2 Architecture number of distinct words present in the entire training files. Here The architecture of our developed system used for task 1 and task the feature set includes N number of distinct unigrams present in 2 are shown in Figure 1 and Figure 2 respectively. For both the the training data and four other features discussed in the following systems, the important modules are feature extraction and sub-sections. classifier. For the task 1, we have 5 features discussed in the 2.1.3.1 Number of Positive Words: We calculated the earlier sections and for task 2, we have used N + 4 features which number of positive words that were present in each sentence of are also discussed in the earlier sections. the excel file. We recognized the positive words from a particular For task 1, after feature extraction from each query-sentence sentence by using a Python package called SentiWordNet2. pairs, each sentence is represented as a vector which is labeled with the label of the corresponding training sentence. Then the 2.1.3.2 Number of Negative Words: We calculated the labeled vectors are given to the classifier to produce a model. number of negative words that were present in each sentence of Finally the learned model is used to determine the relevancy of the excel file. We recognized the negative words from a particular the test sentences given a query. For task 2, we extract features sentence by using a Python package called SentiWordNet. from the sentences and sentences are represented as the vectors 2.1.3.3 Number of Neutral Words: We had already found labeled with one of the categories-“oppose”, “support” and out the positive and negative words for a particular sentence, so “neutral”. The classifier is trained with the labeled training pattern the words that were neither negative nor positive were classified vectors and the learned model is used to classify the test sentences as neutral words and their occurrence in the current sentence was into one of categories-“oppose”, “support” and “neutral”. counted. 3. DATA SETS, RESULTS, EVALUATION 2.1.3.4 Relevant or Irrelevant: In Task-1 we have already labeled each sentence to be either relevant or irrelevant. We took 3.1 Data Sets For the training data, we were given five user queries along with this label into consideration for this task. This was a binary number of sentences per query [12]. feature as the current sentence could either be relevant or irrelevant. • “does_sun_exposure_cause_skin_cancer” -- 68 sentences • “e – cigarettes” -- 83 sentences 2.1.3.5 ‘N’ Features: we represent each sentence as a bag-of- words model. According to vector space model, a sentence is • “HRT_cause_cancer” -- 61 sentences represented as N-dimensional vectors where N is the distinct • “MMR_vaccine_lead_to_autism” -- 71 sentences number of unigrams present in the training data. Weight of a word used as the component of a vector is calculated using TFIDF • “vitamin_C_common_cold” -- 65 sentences formula. A total of 348 sentences were present in the training data set. 2.1.4 Sentiment Classification For the test data, the queries were the same as the training data and the number of unlabeled sentences per query given was as We represent each sentence in the excel file as a vector using the follows. above mentioned N+4 features and label each vector with the • “does_sun_exposure_cause_skin_cancer” -- 342 sentences label of the corresponding training sentence. The label can be one of three types- “Support”, “Oppose” and “Neutral”. Finally, we • “e – cigarettes” -- 414 sentences submit labeled vectors to the SVM classifier as specified in the • “HRT_cause_cancer” -- 260 sentences Task-1 and trained it using them. The model is generated after • “MMR_vaccine_lead_to_autism” -- 279 sentences training. Like the task 1, we also we split the training data into two parts-(1) the first part contains 60% of the training data, • “vitamin_C_common_cold” -- 247 sentences which is used to develop the initial model and (2) the remaining A total 1542 sentences were present in the test data set 40% of the training data is used to test the model while tuning the parameters. After tuning the parameters of SVC tool available in 3.2 Results Python scikit learn, we obtain the best model with the cost We developed our systems for both task 1 and task 2 using the training data [12] supplied to us by the organizers of the contest. 2 http://www.nltk.org/howto/sentiwordnet.html http://www.nltk.org/_modules/nltk/corpus/reader/sen tiwordnet.html Training Data Representation User Query Wikipedia Training Data Dictionary Relevant/ Sentences F1 F2 F3 F4 F5 Irrelevant … … … … … … … Feature Extraction Classifier User Query Test Data Test Data Representation Sentences F1 F2 F3 F4 F5 Test Data Result Sentences Relevant/Irrelevant … … … … … … … … Figure 1. System Architecture for Task 1 User Query User Query Training Data Representation Training Data Test Data Support/ Sentence f1 … fn fn+1 fn+2 fn+3 fn+4 Oppose/ s Neutral Feature Extraction … ... … ... … … … … … Test Data Representation Classifier Sentences f1 … fn fn+1 fn+2 f n+3 fn+4 Relevant/ Support/ Sentences … … …… … … … Irrelevant Oppose/Neutral … … … … Figure 2. System Architecture for Task 2 Table 1. Performance of the participating systems for Task 1 Table 2. Performance of the participating systems for Task 2 After release of test data by the organizers, we run our system on methods so that our systems can perform more accurately for both the test data and send the result files along with the complete the tasks. system to the organizers. They evaluated the results using the traditional percentage accuracy and published the results which 4. CONCLUSION were sent to us through e-mail. There has been a dearth of proper searching systems for medical We have shown the officially published results of task 1 and task queries and our work on the CHIS tasks put us on the path to 2 for the 9 participating teams in Table 1 and Table 2 filling this void. The methodology we used can be improved on respectively. The results shown in red bold font are the and innovated with to create a novel searching method for not performances of top systems participated in the tasks. only medical queries, but any specific search queries of any field. What we have done, and our continuing to improve on, is a Out of the 9 participants, our system (JU_KS_Group) achieves the logical way of searching through data which is already available third highest average accuracy for task 1, i.e. 73.39257557%. We to the public. We sincerely believe that through machine learning can evaluate the results for task 1 in a different angle. It is evident and natural language processing, the future of online searching from Table 1 that our system performs better for 3 queries out of can be achieved; and have tried to contribute towards this goal 5 queries whereas the system SSN_NLP with the best average through our paper. And that this will especially be of use in the accuracy (78.10%) performs better for 2 queries out of five medical field. queries. The main reason for my system giving better results for task 1 is the use of two novel features, noun matching and For future work, we would incorporate a word sense neighborhood matching. disambiguation module to disambiguate the query words. We For the task 2, our system achieves an average accuracy of hope that our system will give more accurate results for task 2 if 33.63514278%. For the task 2, our system achieves relatively we consider classification of relevant sentences as 2-class poor performance. One of the reasons of getting poor performance problem (“support” and “oppose”) instead of considering it as the for task 2 is that we have considered “neutral” class along with 3-class (“support”, “oppose” and “neutral”) problem that we did other two classes “oppose” and “support” while classifying the during the contest. relevant sentences. It is evident from the training data that only the irrelevant sentences in the training data were assigned the “neutral” class. Actually the task2 was to classify the relevant ACKNOWLEDGMENTS sentences into two categories-“Support” and “oppose”, but we We would like to thank the Forum for Information Retrieval have mistakenly considered the task2 as 3-class problem instead Evaluation (FIRE) 2016, ISI Kolkata, for providing us the tasks of 2-class problem. We are working to improve our proposed and datasets for C.H.I.S. 5. REFERENCES [7]. Wilkins, A. S. (1999) Expanding Internet access for health [1]. Cline, R. J., & Haynes, K. M. (2001). Consumer health care consumers. Health Care Management Review, 24, 30– information seeking on the Internet: the state of the art. 41. Health education research, 16(6), 671-692. [8]. Hong, Y., Cruz. N., Marnas, G., Early, E., and Gillis, R. [2]. Grandinetti, D. A. (2000) Doctors and the Web: help your 2002. A query analysis of consumer health information patients surf the Net safely. Medical Economics, April, 28– retrieval. In the Proceedings of AMIA 2002, 1046. 34 [9]. Belkin, N., J., Oddy, R. N., and Brooks, H. M. 1982. Ask for [3]. Lacroix, E. M., Backus, J. E. and Lyon, B. J. (1994) Service information retrieval: Part I. background and theory. Journal providers and users discover the Internet. Bulletin of the of Documentation, 38, 61-71. Medical Library Association, 82, 412–418. [10]. Keselman, A., Browne, A.C., and Kaufman, D.R. 2008. [4]. Eng, T. R., Maxfield, A., Patrick, K., Deering, M. J., Ratzan, Consumer health information seeking as hypothesis testing. S. C. and Gustafson, D. H. (1998) Access to health JAMIA, 15, 484-495. information and support: a public highway or a private road? [11]. Efthimiadis, E.N. 2009. How students search for consumer Journal of the American Medical Association, 280, 1371– health information on the web. In Proceedings of the 42nd 1375. Hawaii International Conference on System Sciences, 1-8. [5]. Chi-Lum, B. (1999) Friend or foe: consumers using the [12]. Sinha, M. and Mannarswamy, S. and Roy, S. 2016. Internet for medical information. Journal of Medical Practice CHIS@FIRE: Overview of the CHIS Track on Consumer Management, 14, 196–198. Health Information Search, Working notes of FIRE 2016 - [6]. Boyer, C., Selby, M. and Appel, R. D. (1998) The Health on Forum for Information Retrieval Evaluation, Kolkata, India, the Net Code of Conduct for medical and health Web sites. December 7-10, 2016, CEUR Workshop Proceedings, Medinfo, 9 (Part 2), 1163–1166. CEUR-WS.org.