Decision Tree Approach for Consumer Health Information Search D. Thenmozhi P. Mirunalini Chandrabose Aravindan Department of CSE Department of CSE Department of CSE SSN College of Engineering SSN College of Engineering SSN College of Engineering Kalavakkam, Chennai Kalavakkam, Chennai Kalavakkam, Chennai theni_d@ssn.edu.in miruna@ssn.edu.in aravindanc@ssn.edu.in ABSTRACT irrelevant information which may not satisfy diverse users Health information search (HIS) is the process of seeking of CHIS. The retrieval performance may be improved either health related information on the Internet by public health by assisting the consumers to reformulate the query with professionals and consumers. Abundance of health related more precise and domain specific terms [20, 13, 18], or by information on the Internet may help a consumer for self- categorizing the retrieved information into relevant or irrel- management of illness. Present day search engines retrieve evant [9]. In this work, we have focused on the shared task information on consumer queries, but all of the retrieved of CHIS@FIRE2016 [12] which aims to identify text as rele- information may not be relevant to the given query. It is vant or irrelevant for a query. CHIS@FIRE2016 is a shared a challenging task to identify the relevant information for a Task on Consumer Health Information Search (CHIS) collo- query from the result. In this paper, we present our method- cated with the Forum for Information Retrieval Evaluation ology for a task to identify whether the information available (FIRE). The goal of CHIS track is to research and develop are relevant or irrelevant for a given query using a machine techniques to support users in complex multi-perspective learning approach. The lexical features that are extracted health information queries1 . This track has two tasks. Given from the text are used by a classifier to predict whether the a CHIS query, and a document associated with that query, text are relevant or not for the query. We have also in- the first task is to classify whether the sentences in the doc- cluded a statistical feature selection methodology to select ument are relevant to the CHIS query or not. The relevant the significantly contributing features for the classification. sentences are those from that document, which are useful in We have evaluated our two variations using the data set providing an answer to the query. The second task is to fur- given by CHIS@FIRE2016 shared task. The performance is ther classify the relevant sentences as supporting the claim measured in terms of accuracy and we have obtained overall made in the query, or opposing the claim made in the query. accuracy of 75.87% for the method without feature selec- Our focus is on the first task of CHIS@FIRE2016. tion and 78.1% for the method using χ2 feature selection. Statistical t-tests confirm that feature selection has signifi- 2. RELATED WORK cantly reduced the sizes of the models without affecting the Several research have been carried out in consumer health performance. information search (CHIS) in recent years. Researchers an- alyzed the behaviour of the CHIS users [2, 22, 3] and the Keywords issues in searching for information [4]. The query construc- Consumer Health Information Search; Machine Learning; tion, query reformulation and ranking of search result may Classification; Decision Tree; Feature Selection improve the performance of CHIS. This section reviews the related work for CHIS. 1. INTRODUCTION 2.1 Query Reformulation Information retrieval (IR) is the process of obtaining in- Many researchers have analyzed the behaviour of the user formation relevant to a given query from a collection of re- in CHIS which help to reformulate the query for improv- sources. Internet is the major source of retrieving infor- ing the performance of the retrieval. Zeng et al. [19] ana- mation for all domains. Health care is one of the domains lyzed the query terms based on the query length, presence where public health professionals and consumers seek for in- of stop words and frequency distribution and characterized formation from the Internet. Consumer Health Information the query as short and simple. Hong et al. [5] analyzed Search (CHIS) is the process of retrieving health related in- HealthLink search logs to find the behaviour of the user and formation from Internet by common people to make some found that the average length of queries submitted was 2.1 health related decisions and for self-management of diseases. words. They have suggested that using of retrieval feed- Survey on CHIS have been is reported by Cline et al. [2], back may improve the consumer health information search Zhang et al. [22] and Fiksdal et al. [3]. They have an- performance. Spink et al. [14] analyzed the query logs of alyzed diverse purposes and diverse users on CHIS. Goeu- Alltheweb.com and Excite.com commerical web search en- riot et al. [4] analyzed the CHIS users based on varying gines to find the behaviour of health care users. They have information needs, varying medical knowledge and varying reported that the average length of queries was 2.2 words. language skills. The existing search engines retrieve infor- 1 mation based on keywords resulting in a large number of https://sites.google.com/site/multiperspectivehealthqa/home Several researchers analyzed how consumers try to reformu- • Predict class label for the instance as “relevant” or “ir- late queries to improve the search performance. Toms and relevant” using the model Latter [17] reported that consumers follow trial-and-error process to formulation of queries. Sillence et al. [11] stated The steps are explained in detail in the sequel. that the queries are reformulated using Boolean operators 3.1 Feature Extraction by the consumers to alter search terms. Several researchers presented algorithms for reformulat- The given text is preprocessed before extracting the fea- ing queries to improve health information search. Zeng [20] tures by removing punctuations like “, ”, –, ‘, ’, and and by recommended additional query terms by computing the se- replacing the term such as n’t with not, & with and, ’m with mantic distance among concepts related to the user’s ini- am, and ’ll with will. The terms of the each sentence in the tial query based on concept co-occurrences in the medical given training text are annotated with parts of speech infor- domain. Soldaini et al. [13] proposed a methodology to mation such as noun, verb, determiner, adjectives and ad- bridge the gap between layperson and expert vocabularies verbs. In general, keyterms/features are extracted from the by providing appropriate medical expressions for their un- noun information. However, in medical domain, adjectives familiar terms. The approach adds the expert expression may also be contributed to the keyterms. For example, the to the queries submitted by the users which they call as sentence “Skin cancer is more common in people with light query clarifications. They have used a supervised approach colored skin who have spent a lot of time in the sunlight.” is to select the most appropriate synonym mapping for each relevant to the query “skin cancer”. In this sentence, the ad- query to improve the performance. Keselman et al. [7] sup- jective “light colored” is also important along with the nouns ported the users with query formulation support tools and namely cancer, skin and sunlight to identify the sentence as suggesting additional or alternative query terms to make relevant. Hence, all the nouns and adjectives from training the query more specific. They also educate the consumers data are extracted as features. We have considered all forms to learn medical terms by providing interactive tools. Yun- of nouns (N N ∗ ) namely NN, NNS and NNP, and all forms zhi et al. [18] proposed a methodology for query expansion of adjectives (JJ ∗ ) JJ, JJR and JJS to extract the features. using hepatitis ontology. They compute semantic similarity The extracted terms are lemmatized to bring them to their using ontology for finding the similarity of retrieval terms to root forms. The feature set is constructed by eliminating all improve retrieval performance. duplicate terms from the extracted terms. We have used machine learning approach with two vari- 2.2 Machine Learning Approaches for Health ations to identify whether the given text is relevant or not. Information Search The variations are Several researchers used machine learning approaches in 1. Approach without feature selection health information search. Zhang et al. [21] used a machine learning approach for rating the quality of depression treat- 2. Approach using χ2 feature selection ment web pages using evidence-based health care guidelines. The two variations are described in the following sub sec- They have used Naı̈ve Bayes classifier to rate the web pages. tions. Nerkar and Gharde [9] proposed a supervised approach us- ing support vector machine to classify the semantic relations 3.2 Approach without Feature Selection between disease and treatment. The best treatment for Dis- We have used machine learning approach by extracting ease is identified by applying voting algorithm. Automatic the linguistic features without explicit feature selection to mapping of concepts from text in clinical report to a ref- build a model. erence terminology is an important task health information The set of extracted features along with the class labels search systems. Casteno et al. [1] presented a machine learn- namely relevant and irrelevant from training data are used ing approach to bio-medical terms normalization for which to build a model using a classifier. We have used a decision they have used hospital thesaurus database. tree based classifier called J48 to build the model. J48 classi- Many works have been reported on query construction fier uses C4.5 algorithm to represent classification rules [10]. and query reformulation to improve the performance of con- With J48 a model is constructed as tree during the learning sumer health information search. However, very few works phase. have been reported on categorizing the retrieved informa- The features are extracted for each instance of test data tion into relevant or irrelevant. Our focus is to categorize with unknown class label “?”, similar to training data using the information into relevant or irrelevant for the given query the features vector of training data. The class label either using machine learning approach in health care domain. “relevant” or “irrelevant” is predicted for the test data in- stances using the built model. 3. PROPOSED APPROACH 3.3 Approach using χ2 Feature Selection We have implemented a supervised approach for this CHIS The number of features extracted by the methodology task. The steps used in our approach are given below. may be more. All of them may not be helpful to classify the text as “relevant” or “irrelevant”. We have used a method- • Preprocess the given text ology which computes chi-square value for selecting the fea- tures from linguistic features. This χ2 method selects the • Extract features for training data features that have strong dependency on the categories by using the average or maximum χ2 statistic value. • Build a model using a classifier from the features of Since, we have only two categories, we form a 2x2 feature- training data category contingency table which is called as CHI table for every feature fi . This table is used to count the co-occurrence Algorithm 1 χ2 Feature Selection observed frequency (O) of fi for every category C and ¬C. Input: Training data T , Set of linguistic features F Each cell at position (i, j) contains the observed frequency Output: Set of χ2 features Fchi O(i, j), where i ∈ {fi , ¬fi } and j ∈ {C, ¬C}. Table 1 shows 1: Let Chi feature set Fchi = ∅ 2x2 feature-category contingency table in which, O(fi , C) 2: for (each fi ∈ F ) do denotes the number of instances that contain the feature 3: for (each category C ∈ [relevant, irrelevant]) do fi belong to category C, O(fi , ¬C) denotes the number of 4: Construct 2x2 feature-category contingency table instances that contain the feature fi and are in not in cat- (CHI table) with the observed co-occurrence frequencies egory C, O(¬fi , C) denotes the number of instances that (O) of fi and C using T and F does not contain the feature fi but belong to category C, 5: Calculate the expected frequencies (E) using CHI and O(¬fi , ¬C) denotes the number of instances that nei- table ther contain the feature fi nor belong to category C. Σ O(a,j)Σb∈{C,¬C} O(b,j) E(i, j) = a∈{fi ,¬fi } n 2 6: Calculate χ value of fi for C 2 Table 1: Feature-Category Contingency Table χ2stat fi = Σi∈{fi ,¬fi } Σj∈{C,¬C} (O(i,j)−E(i,j)) E(i,j) C ¬C 7: end for fi O(fi , C) O(fi , ¬C) 8: if χ2stat fi >= χ2crit(α=0.05,df =1) : 3.841 then ¬fi O(¬fi , C) O(¬fi , ¬C) 9: Add fi to Fchi 10: end if 11: end for The expected frequencies (E) for every feature fi when 12: Return feature set Fchi they are assumed to be independent can be calculated from the observed frequencies (O). The observed frequencies are compared with the expected frequencies to measure the de- Table 2: Data Set for CHIS task pendency between the feature and the category. The ex- Query Training Test pected frequency E(i, j) is calculated from the observed fre- Skin Cancer 341 88 quencies (O) using the equation E-Cigarettes 413 64 Vitamin-C 278 74 Σa∈{fi ,¬fi } O(a, j)Σb∈{C,¬C} O(b, j) HRT 246 72 E(i, j) = (1) n MMR-Vaccine 259 58 where i represents whether the feature fi is present or not, j represents whether the instance belongs to C or not, and n is the total number of instances. POS tagger2 which uses Penn Treebank tag set. For exam- The expected frequencies namely E(fi , C), E(fi , ¬C), ple, for the sentence “Skin cancer is more common in people E(¬fi , C) and E(¬fi , ¬C) are calculated using the above with light colored skin who have spent a lot of time in the equation. Then the χ2 statistical value for each feature fi is sunlight.”, Stanford POS tagger annotate the sentence as calculated using the equation “Skin NN cancer NN is VBZ more RBR common JJ in IN people NNS with IN light JJ colored VBN skin NN who WP have VBP spent VBN a DT lot NN of IN time NN in IN (O(i, j) − E(i, j))2 the DT sunlight NN”. All forms of nouns and adjectives are χ2stat fi = Σi∈{fi ,¬fi } Σj∈{C,¬C} (2) E(i, j) considered as features. In this example, “skin, cancer, com- mon, people, light, time, sunlight” are extracted as features. The set of features whose χ2stat value is greater than 2 Then the features are lemmatized. We have used Stanford χcrit(α=0.05,df =1) : 3.841 are considered to be significant fea- lemmatizer to bring the features to their root form. Like- tures and those features are selected for building a model wise, the features are extracted from all the training in- using a classifier. The process to select χ2 features from the stances. Duplicates are eliminated to obtain a set of features linguistic features is given in Algorithm 1. for building a model. The number of features extracted for The model Mchi for the classification is build from training each query by this method is given in Table 4. data by considering the selected feature set Fchi instead of We have used J48 as a classifier to build the model with F . The class label either “relevant” or “irrelevant” is now the extracted features. To implement the classifier, we have predicted for the test data instances by considering the built used Weka API3 . Since Weka reads the feature vectors in model Mchi “arff” format, we have prepared the feature vector files in “arff” format. The model is built by training the classifier 4. IMPLEMENTATION using the training data feature vectors. We have implemented our methodologies in Java for the The class labels either “1” for “relevant” or “0” for “irrele- Shared Task on Consumer Health Information Search (CHIS): vant” are predicted using the model for the test instances. Task 1. The data set used to evaluate the task consists of 4.2 Approach using χ2 Feature Selection five queries and a set of training data and test data for each query. The queries, number of training instances and num- In this variation, we have selected set of features which ber of test instances are given in Table 2. significantly contribute to identify the classes, from the lin- guistic features. To select the features, we have used a sta- 4.1 Approach without Feature Selection 2 http://nlp.stanford.edu/software/tagger.shtml 3 We have annotated the given sentences using Stanford http://www.java2s.com/Code/Jar/w/Downloadwekajar.htm tistical approach called χ2 method. We have constructed 4.3 Results the CHIS table for each feature fi . For example, the CHIS We have evaluated the performance of our methodologies table which shows the observed frequencies for the feature using the metric accuracy. We have performed the 10-fold “estrogen”, with respect to the query “HRT” is given in Table cross validation on training data. The cross validation accu- 3. racies given by the methodologies for the queries are sum- marized in Table 6. Table 3: CHIS Table for the feature “Estrogen” with respect to the query “HRT” Table 6: 10-fold cross validation accuracy (%) for Relevant Irrelevant the queries Estrogen 39 14 Query Without Feature χ2 Feature ¬Estrogen 167 26 Selection Selection Skin Cancer 92.96 85.34 The total number of training instances are 246 for the E-Cigarettes 84.26 76.27 query “HRT”. The expected frequencies are calculated from Vitamin-C 88.49 82.37 the CHIS table values using Equation 1. The expected fre- HRT 93.09 86.9 quencies obtained for the feature “Estrogen” are 44.0, 8.0, MMR-Vaccine 93.05 80.31 161.0 and 31.0. The χ2stat (Estrogen) is computed using Equation 2 as 6.098236 which is greater than χ2crit(α=0.05,df =1) : 3.841. Thus, this “Estrogen” feature is selected as a candi- The performance of our both the methods on evaluating date feature for building the model using the classifier. The the test data is shown in Figure Table 7. It is observed from number of features selected by this statistical method for all Table 7 that the accuracy obtained after χ2 feature selection the queries given in the task are shown in Table 4. is more than the method without feature selection by 2.23%. Table 4: Number of features for the queries Table 7: Test data accuracy (%) for the queries Query Without Feature χ2 Feature Query Without Feature χ2 Feature Selection Selection Selection Selection Skin Cancer 742 31 Skin Cancer 86.36 79.54 E-Cigarettes 1014 36 E-Cigarettes 65.25 64.06 Vitamin-C 715 25 Vitamin-C 73.0 78.38 HRT 547 10 HRT 87.5 87.5 MMR-Vaccine 751 12 MMR-Vaccine 67.24 81.03 Average Accuracy 75.87 78.1 Further, the feature vectors for the training data are con- structed similar to our first approach in “arff” format and We have compared our two approaches using k-fold paired the model is built by J48 classifier using Weka API. t-test and Mcnemar test to show that the improvement in Table 5 shows size of the tree in terms of number of nodes performance is statistically significant. We have applied 5- which describe the model created for both variations of our fold paired t-test (1-tailed, 95% confidence, 5 dataset) on approach. It is observed from Table 5 that the number of our two approaches and we have obtained the p − value of nodes used in the decision tree by J48 classifier is consider- 0.278 for accuracy. Since, this p − value is greater than ably reduced when χ2 feature selection method is used. 0.05, we can statistically infer that our approach using χ2 feature selection does not reduce the performance of our sys- Table 5: Size of the Tree tem. When we apply Mcnemar test across all data sets, we Query Without Feature χ2 Feature obtain the p − value of 0.5186 which is also greater than Selection Selection 0.05. These show that our feature selection approach signif- Skin Cancer 57 29 icantly reduces the size of the model without compromising E-Cigarettes 51 27 the performance. Vitamin-C 21 3 HRT 23 7 5. CONCLUSIONS MMR-Vaccine 39 3 We have presented a system for identifying whether the given text are relevant or irrelevant to a query. We have To show that this reduction is statistically significant, we proposed two variations of our methodology namely an ap- have applied a t-test on these 2 models. k-Fold paired t-test proach with all features and an approach with selected fea- with one-tailed distribution is used to show that the reduc- tures based on chi-square statistical value. In both the meth- tion is significant when features are selected using χ2 . The ods, we have identified the features and feature vectors are p − values obtained for size of the tree while applying paired constructed from training data. We have used J48 clas- t-test (one-tailed, 95% confidence) is 0.001236616 which is sifier to build a model with these feature vectors and the less than 0.05. This shows that the reduction in size of the model is used to predict whether the test instances or “rel- tree is statistically significant. evant” or “irrelevant” to the query. We have used the data The prediction is done for the test data as in our first set given by CHIS@FIRE2016 shared task to evaluate our approach to identify whether the test instances belong to methodology. We have performed a statistical t-test which one of the category “relevant” or “irrelevant”. shows our χ2 feature selection method significantly reduces the size of the model for CHIS@FIRE2016 data set. We [11] E. Sillence, P. Briggs, L. Fishwick, and P. Harris. Trust have measured the performance of our approaches using the and mistrust of online health sites. In Proceedings of metric accuracy. We have obtained the accuracy of 75.87% the SIGCHI conference on Human factors in and 78.1% for the method without feature selection and the computing systems, pages 663–670. ACM, 2004. method using χ2 feature selection respectively for the Task 1 [12] M. Sinha, S. Mannarswamy, and S. Roy. CHIS@FIRE: of CHIS@FIRE2016 shared task. Statistical t-tests namely overview of the CHIS track on consumer health k-fold paired t-test and Mcnemar test confirm that feature information search. In Working notes of FIRE 2016 - selection has significantly reduced the sizes of the models Forum for Information Retrieval Evaluation, Kolkata, without affecting the performance. At present we have used India, December 7-10, 2016, CEUR Workshop parts of speech (POS) information and χ2 value to extract Proceedings. CEUR-WS.org, 2016. and select the features respectively. Further, the features [13] L. Soldaini, A. Yates, E. Yom-Tov, O. Frieder, and may be extracted based on the predicate information of the N. Goharian. Enhancing web search in the medical text [15, 16]. The CHIR value [8, 6] may be calculated from domain via query clarification. Inf. Retr. Journal, χ2 value to select the features in future. 19(1-2):149–173, 2016. [14] A. Spink, Y. Yang, J. Jansen, P. Nykanen, D. P. Acknowledgments Lorence, S. Ozmutlu, and H. C. Ozmutlu. A study of We would like to thank the management of SSN Institutions medical and health queries to web search engines. for funding the High Performance Computing (HPC) lab Health Information & Libraries Journal, 21(1):44–51, where this work is being carried out. 2004. [15] D. Thenmozhi and C. Aravindan. An automatic and 6. REFERENCES clause based approach to learn relations for ontologies. [1] J. Castano, H. Berinsky, H. Park, D. Pérez, P. Avila, The Computer Journal, Accepted for Publication, L. Gambarte, S. Benıtez, D. Luna, F. Campos, and DOI: 10.1093/comjnl/bxv071, 2015. S. Zanetti. A machine learning approach to clinical [16] D. Thenmozhi and C. Aravindan. Paraphrase terms normalization. ACL 2016, page 1, 2016. identification by using clause based similarity features [2] R. J. Cline and K. M. Haynes. Consumer health and machine translation metrics. The Computer information seeking on the internet: the state of the Journal, Accepted for Publication, DOI: art. Health education research, 16(6):671–692, 2001. 10.1093/comjnl/bxv083, 2015. [3] A. S. Fiksdal, A. Kumbamu, A. S. Jadhav, C. Cocos, [17] E. G. Toms and C. Latter. How consumers search for L. A. Nelsen, J. Pathak, and J. B. McCormick. health information. Health informatics journal, Evaluating the process of online health information 13(3):223–235, 2007. searching: a qualitative approach to exploring [18] C. Yunzhi, L. Huijuan, L. Shapiro, and L. Travillian, consumer perspectives. Journal of medical Internet Ravensara S.and Lanjuan. An approach to semantic research, 16(10):e224, 2014. query expansion system based on hepatitis ontology. [4] L. Goeuriot, G. J. Jones, L. Kelly, H. Müller, and Journal of Biological Research-Thessaloniki, 23(1):11, J. Zobel. Medical information retrieval: Introduction 2016. to the special issue. Inf. Retr., 19(1-2):1–5, April 2016. [19] Q. Zeng, S. Kogan, N. Ash, R. Greenes, A. Boxwala, [5] Y. Hong, N. de la Cruz, G. Barnas, E. Early, and et al. Characteristics of consumer terminology for R. Gillis. A query analysis of consumer health health information retrieval. Methods of information information retrieval. In Proceedings of the AMIA in medicine, 41(4):289–298, 2002. Symposium, page 1046. American Medical Informatics [20] Q. T. e. a. Zeng. Assisting consumer health Association, 2002. information retrieval with query recommendations. [6] M. Janaki Meena and K. Chandran. Naive bayes text Journal of the American Medical Informatics classification with positive features selected by Association, 13(1):80–90, 2006. statistical method. In In International Conference on [21] Y. Zhang, H. Cui, J. Burkell, and R. E. Mercer. A Autonomic Computing and Communications, ICAC machine learning approach for rating the quality of 2009, pages 28–33. IEEE, 2009. depression treatment web pages. iConference 2014 [7] A. Keselman, A. C. Browne, and D. R. Kaufman. Proceedings, 2014. Consumer health information seeking as hypothesis [22] Y. Zhang, P. Wang, A. Heaton, and H. Winkler. testing. Journal of the American Medical Informatics Health information searching behavior in medlineplus Association, 15(4):484–495, 2008. and the impact of tasks. In Proceedings of the 2nd [8] C. L. Li Yanjun and S. M. Chung. Text clustering ACM SIGHIT International Health Informatics with feature selection by using statistical data. IEEE Symposium, pages 641–650. ACM, 2012. Transactions on Knowledge and Data Engineering, 20(5):641–652, 2008. [9] B. E. Nerkar and S. S. Gharde. Best treatment identification for disease using machine learning approach in relation to short text. IOSR Journal of Computer Engineering (IOSR-JCE), 16(3):5–12, 2014. [10] P. A. F. Pavel, Yonghong Peng and B. C. Soares. Decision tree-based data characterization for meta-learning. IDDM-2002, page 111, 2002.