Decision Tree Approach for Consumer Health Information
                         Search

               D. Thenmozhi                            P. Mirunalini               Chandrabose Aravindan
            Department of CSE                     Department of CSE                    Department of CSE
         SSN College of Engineering            SSN College of Engineering           SSN College of Engineering
           Kalavakkam, Chennai                   Kalavakkam, Chennai                  Kalavakkam, Chennai
           theni_d@ssn.edu.in                     miruna@ssn.edu.in                 aravindanc@ssn.edu.in

ABSTRACT                                                          irrelevant information which may not satisfy diverse users
Health information search (HIS) is the process of seeking         of CHIS. The retrieval performance may be improved either
health related information on the Internet by public health       by assisting the consumers to reformulate the query with
professionals and consumers. Abundance of health related          more precise and domain specific terms [20, 13, 18], or by
information on the Internet may help a consumer for self-         categorizing the retrieved information into relevant or irrel-
management of illness. Present day search engines retrieve        evant [9]. In this work, we have focused on the shared task
information on consumer queries, but all of the retrieved         of CHIS@FIRE2016 [12] which aims to identify text as rele-
information may not be relevant to the given query. It is         vant or irrelevant for a query. CHIS@FIRE2016 is a shared
a challenging task to identify the relevant information for a     Task on Consumer Health Information Search (CHIS) collo-
query from the result. In this paper, we present our method-      cated with the Forum for Information Retrieval Evaluation
ology for a task to identify whether the information available    (FIRE). The goal of CHIS track is to research and develop
are relevant or irrelevant for a given query using a machine      techniques to support users in complex multi-perspective
learning approach. The lexical features that are extracted        health information queries1 . This track has two tasks. Given
from the text are used by a classifier to predict whether the     a CHIS query, and a document associated with that query,
text are relevant or not for the query. We have also in-          the first task is to classify whether the sentences in the doc-
cluded a statistical feature selection methodology to select      ument are relevant to the CHIS query or not. The relevant
the significantly contributing features for the classification.   sentences are those from that document, which are useful in
We have evaluated our two variations using the data set           providing an answer to the query. The second task is to fur-
given by CHIS@FIRE2016 shared task. The performance is            ther classify the relevant sentences as supporting the claim
measured in terms of accuracy and we have obtained overall        made in the query, or opposing the claim made in the query.
accuracy of 75.87% for the method without feature selec-          Our focus is on the first task of CHIS@FIRE2016.
tion and 78.1% for the method using χ2 feature selection.
Statistical t-tests confirm that feature selection has signifi-   2.     RELATED WORK
cantly reduced the sizes of the models without affecting the
                                                                     Several research have been carried out in consumer health
performance.
                                                                  information search (CHIS) in recent years. Researchers an-
                                                                  alyzed the behaviour of the CHIS users [2, 22, 3] and the
Keywords                                                          issues in searching for information [4]. The query construc-
Consumer Health Information Search; Machine Learning;             tion, query reformulation and ranking of search result may
Classification; Decision Tree; Feature Selection                  improve the performance of CHIS. This section reviews the
                                                                  related work for CHIS.

1.   INTRODUCTION                                                 2.1      Query Reformulation
   Information retrieval (IR) is the process of obtaining in-       Many researchers have analyzed the behaviour of the user
formation relevant to a given query from a collection of re-      in CHIS which help to reformulate the query for improv-
sources. Internet is the major source of retrieving infor-        ing the performance of the retrieval. Zeng et al. [19] ana-
mation for all domains. Health care is one of the domains         lyzed the query terms based on the query length, presence
where public health professionals and consumers seek for in-      of stop words and frequency distribution and characterized
formation from the Internet. Consumer Health Information          the query as short and simple. Hong et al. [5] analyzed
Search (CHIS) is the process of retrieving health related in-     HealthLink search logs to find the behaviour of the user and
formation from Internet by common people to make some             found that the average length of queries submitted was 2.1
health related decisions and for self-management of diseases.     words. They have suggested that using of retrieval feed-
Survey on CHIS have been is reported by Cline et al. [2],         back may improve the consumer health information search
Zhang et al. [22] and Fiksdal et al. [3]. They have an-           performance. Spink et al. [14] analyzed the query logs of
alyzed diverse purposes and diverse users on CHIS. Goeu-          Alltheweb.com and Excite.com commerical web search en-
riot et al. [4] analyzed the CHIS users based on varying          gines to find the behaviour of health care users. They have
information needs, varying medical knowledge and varying          reported that the average length of queries was 2.2 words.
language skills. The existing search engines retrieve infor-
                                                                  1
mation based on keywords resulting in a large number of               https://sites.google.com/site/multiperspectivehealthqa/home
Several researchers analyzed how consumers try to reformu-           • Predict class label for the instance as “relevant” or “ir-
late queries to improve the search performance. Toms and               relevant” using the model
Latter [17] reported that consumers follow trial-and-error
process to formulation of queries. Sillence et al. [11] stated      The steps are explained in detail in the sequel.
that the queries are reformulated using Boolean operators         3.1    Feature Extraction
by the consumers to alter search terms.
   Several researchers presented algorithms for reformulat-          The given text is preprocessed before extracting the fea-
ing queries to improve health information search. Zeng [20]       tures by removing punctuations like “, ”, –, ‘, ’, and and by
recommended additional query terms by computing the se-           replacing the term such as n’t with not, & with and, ’m with
mantic distance among concepts related to the user’s ini-         am, and ’ll with will. The terms of the each sentence in the
tial query based on concept co-occurrences in the medical         given training text are annotated with parts of speech infor-
domain. Soldaini et al. [13] proposed a methodology to            mation such as noun, verb, determiner, adjectives and ad-
bridge the gap between layperson and expert vocabularies          verbs. In general, keyterms/features are extracted from the
by providing appropriate medical expressions for their un-        noun information. However, in medical domain, adjectives
familiar terms. The approach adds the expert expression           may also be contributed to the keyterms. For example, the
to the queries submitted by the users which they call as          sentence “Skin cancer is more common in people with light
query clarifications. They have used a supervised approach        colored skin who have spent a lot of time in the sunlight.” is
to select the most appropriate synonym mapping for each           relevant to the query “skin cancer”. In this sentence, the ad-
query to improve the performance. Keselman et al. [7] sup-        jective “light colored” is also important along with the nouns
ported the users with query formulation support tools and         namely cancer, skin and sunlight to identify the sentence as
suggesting additional or alternative query terms to make          relevant. Hence, all the nouns and adjectives from training
the query more specific. They also educate the consumers          data are extracted as features. We have considered all forms
to learn medical terms by providing interactive tools. Yun-       of nouns (N N ∗ ) namely NN, NNS and NNP, and all forms
zhi et al. [18] proposed a methodology for query expansion        of adjectives (JJ ∗ ) JJ, JJR and JJS to extract the features.
using hepatitis ontology. They compute semantic similarity        The extracted terms are lemmatized to bring them to their
using ontology for finding the similarity of retrieval terms to   root forms. The feature set is constructed by eliminating all
improve retrieval performance.                                    duplicate terms from the extracted terms.
                                                                     We have used machine learning approach with two vari-
2.2     Machine Learning Approaches for Health                    ations to identify whether the given text is relevant or not.
        Information Search                                        The variations are
   Several researchers used machine learning approaches in           1. Approach without feature selection
health information search. Zhang et al. [21] used a machine
learning approach for rating the quality of depression treat-        2. Approach using χ2 feature selection
ment web pages using evidence-based health care guidelines.          The two variations are described in the following sub sec-
They have used Naı̈ve Bayes classifier to rate the web pages.     tions.
Nerkar and Gharde [9] proposed a supervised approach us-
ing support vector machine to classify the semantic relations     3.2    Approach without Feature Selection
between disease and treatment. The best treatment for Dis-           We have used machine learning approach by extracting
ease is identified by applying voting algorithm. Automatic        the linguistic features without explicit feature selection to
mapping of concepts from text in clinical report to a ref-        build a model.
erence terminology is an important task health information           The set of extracted features along with the class labels
search systems. Casteno et al. [1] presented a machine learn-     namely relevant and irrelevant from training data are used
ing approach to bio-medical terms normalization for which         to build a model using a classifier. We have used a decision
they have used hospital thesaurus database.                       tree based classifier called J48 to build the model. J48 classi-
   Many works have been reported on query construction            fier uses C4.5 algorithm to represent classification rules [10].
and query reformulation to improve the performance of con-        With J48 a model is constructed as tree during the learning
sumer health information search. However, very few works          phase.
have been reported on categorizing the retrieved informa-            The features are extracted for each instance of test data
tion into relevant or irrelevant. Our focus is to categorize      with unknown class label “?”, similar to training data using
the information into relevant or irrelevant for the given query   the features vector of training data. The class label either
using machine learning approach in health care domain.            “relevant” or “irrelevant” is predicted for the test data in-
                                                                  stances using the built model.

3.    PROPOSED APPROACH                                           3.3    Approach using χ2 Feature Selection
  We have implemented a supervised approach for this CHIS           The number of features extracted by the methodology
task. The steps used in our approach are given below.             may be more. All of them may not be helpful to classify the
                                                                  text as “relevant” or “irrelevant”. We have used a method-
     • Preprocess the given text                                  ology which computes chi-square value for selecting the fea-
                                                                  tures from linguistic features. This χ2 method selects the
     • Extract features for training data                         features that have strong dependency on the categories by
                                                                  using the average or maximum χ2 statistic value.
     • Build a model using a classifier from the features of        Since, we have only two categories, we form a 2x2 feature-
       training data                                              category contingency table which is called as CHI table for
every feature fi . This table is used to count the co-occurrence        Algorithm 1 χ2 Feature Selection
observed frequency (O) of fi for every category C and ¬C.               Input: Training data T , Set of linguistic features F
Each cell at position (i, j) contains the observed frequency            Output: Set of χ2 features Fchi
O(i, j), where i ∈ {fi , ¬fi } and j ∈ {C, ¬C}. Table 1 shows            1: Let Chi feature set Fchi = ∅
2x2 feature-category contingency table in which, O(fi , C)               2: for (each fi ∈ F ) do
denotes the number of instances that contain the feature                 3:     for (each category C ∈ [relevant, irrelevant]) do
fi belong to category C, O(fi , ¬C) denotes the number of                4:          Construct 2x2 feature-category contingency table
instances that contain the feature fi and are in not in cat-                (CHI table) with the observed co-occurrence frequencies
egory C, O(¬fi , C) denotes the number of instances that                    (O) of fi and C using T and F
does not contain the feature fi but belong to category C,                5:          Calculate the expected frequencies (E) using CHI
and O(¬fi , ¬C) denotes the number of instances that nei-                   table
ther contain the feature fi nor belong to category C.                                    Σ            O(a,j)Σb∈{C,¬C} O(b,j)
                                                                            E(i, j) = a∈{fi ,¬fi }         n
                                                                                                  2
                                                                         6:          Calculate χ value of fi for C
                                                                                                                               2
     Table 1: Feature-Category Contingency Table                            χ2stat fi = Σi∈{fi ,¬fi } Σj∈{C,¬C} (O(i,j)−E(i,j))
                                                                                                                      E(i,j)
                       C         ¬C                                      7:     end for
               fi   O(fi , C) O(fi , ¬C)                                 8:     if χ2stat fi >= χ2crit(α=0.05,df =1) : 3.841 then
               ¬fi O(¬fi , C) O(¬fi , ¬C)                                9:          Add fi to Fchi
                                                                        10:      end if
                                                                        11: end for
  The expected frequencies (E) for every feature fi when
                                                                        12: Return feature set Fchi
they are assumed to be independent can be calculated from
the observed frequencies (O). The observed frequencies are
compared with the expected frequencies to measure the de-
                                                                                    Table 2: Data Set for CHIS task
pendency between the feature and the category. The ex-                                Query         Training Test
pected frequency E(i, j) is calculated from the observed fre-                         Skin Cancer      341    88
quencies (O) using the equation
                                                                                      E-Cigarettes     413    64
                                                                                      Vitamin-C        278    74
                  Σa∈{fi ,¬fi } O(a, j)Σb∈{C,¬C} O(b, j)                              HRT              246    72
         E(i, j) =                                          (1)
                                      n                                               MMR-Vaccine      259    58
   where i represents whether the feature fi is present or not,
j represents whether the instance belongs to C or not, and
n is the total number of instances.                                     POS tagger2 which uses Penn Treebank tag set. For exam-
   The expected frequencies namely E(fi , C), E(fi , ¬C),               ple, for the sentence “Skin cancer is more common in people
E(¬fi , C) and E(¬fi , ¬C) are calculated using the above               with light colored skin who have spent a lot of time in the
equation. Then the χ2 statistical value for each feature fi is          sunlight.”, Stanford POS tagger annotate the sentence as
calculated using the equation                                           “Skin NN cancer NN is VBZ more RBR common JJ in IN
                                                                        people NNS with IN light JJ colored VBN skin NN who WP
                                                                        have VBP spent VBN a DT lot NN of IN time NN in IN
                                           (O(i, j) − E(i, j))2         the DT sunlight NN”. All forms of nouns and adjectives are
     χ2stat fi = Σi∈{fi ,¬fi } Σj∈{C,¬C}                          (2)
                                                  E(i, j)               considered as features. In this example, “skin, cancer, com-
                                                                        mon, people, light, time, sunlight” are extracted as features.
  The set of features whose χ2stat value is greater than
 2                                                                      Then the features are lemmatized. We have used Stanford
χcrit(α=0.05,df =1) : 3.841 are considered to be significant fea-
                                                                        lemmatizer to bring the features to their root form. Like-
tures and those features are selected for building a model
                                                                        wise, the features are extracted from all the training in-
using a classifier. The process to select χ2 features from the
                                                                        stances. Duplicates are eliminated to obtain a set of features
linguistic features is given in Algorithm 1.
                                                                        for building a model. The number of features extracted for
   The model Mchi for the classification is build from training
                                                                        each query by this method is given in Table 4.
data by considering the selected feature set Fchi instead of
                                                                          We have used J48 as a classifier to build the model with
F . The class label either “relevant” or “irrelevant” is now
                                                                        the extracted features. To implement the classifier, we have
predicted for the test data instances by considering the built
                                                                        used Weka API3 . Since Weka reads the feature vectors in
model Mchi
                                                                        “arff” format, we have prepared the feature vector files in
                                                                        “arff” format. The model is built by training the classifier
4.     IMPLEMENTATION                                                   using the training data feature vectors.
   We have implemented our methodologies in Java for the                  The class labels either “1” for “relevant” or “0” for “irrele-
Shared Task on Consumer Health Information Search (CHIS):               vant” are predicted using the model for the test instances.
Task 1. The data set used to evaluate the task consists of              4.2     Approach using χ2 Feature Selection
five queries and a set of training data and test data for each
query. The queries, number of training instances and num-                  In this variation, we have selected set of features which
ber of test instances are given in Table 2.                             significantly contribute to identify the classes, from the lin-
                                                                        guistic features. To select the features, we have used a sta-
4.1      Approach without Feature Selection                             2
                                                                            http://nlp.stanford.edu/software/tagger.shtml
                                                                        3
     We have annotated the given sentences using Stanford                   http://www.java2s.com/Code/Jar/w/Downloadwekajar.htm
tistical approach called χ2 method. We have constructed               4.3     Results
the CHIS table for each feature fi . For example, the CHIS              We have evaluated the performance of our methodologies
table which shows the observed frequencies for the feature            using the metric accuracy. We have performed the 10-fold
“estrogen”, with respect to the query “HRT” is given in Table         cross validation on training data. The cross validation accu-
3.                                                                    racies given by the methodologies for the queries are sum-
                                                                      marized in Table 6.
Table 3: CHIS Table for the feature “Estrogen” with
respect to the query “HRT”
                                                                      Table 6: 10-fold cross validation accuracy (%) for
                      Relevant Irrelevant
                                                                      the queries
           Estrogen      39        14
                                                                           Query         Without Feature χ2 Feature
           ¬Estrogen    167        26                                                        Selection     Selection
                                                                           Skin Cancer         92.96         85.34
  The total number of training instances are 246 for the                   E-Cigarettes        84.26         76.27
query “HRT”. The expected frequencies are calculated from                  Vitamin-C           88.49         82.37
the CHIS table values using Equation 1. The expected fre-                  HRT                 93.09          86.9
quencies obtained for the feature “Estrogen” are 44.0, 8.0,                MMR-Vaccine         93.05         80.31
161.0 and 31.0. The χ2stat (Estrogen) is computed using
Equation 2 as 6.098236 which is greater than χ2crit(α=0.05,df =1) :
3.841. Thus, this “Estrogen” feature is selected as a candi-             The performance of our both the methods on evaluating
date feature for building the model using the classifier. The         the test data is shown in Figure Table 7. It is observed from
number of features selected by this statistical method for all        Table 7 that the accuracy obtained after χ2 feature selection
the queries given in the task are shown in Table 4.                   is more than the method without feature selection by 2.23%.


    Table 4: Number of features for the queries                            Table 7: Test data accuracy (%) for the queries
     Query        Without Feature χ2 Feature                                Query              Without Feature χ2 Feature
                      Selection      Selection                                                    Selection     Selection
     Skin Cancer         742            31                                  Skin Cancer             86.36         79.54
     E-Cigarettes       1014            36                                  E-Cigarettes            65.25         64.06
     Vitamin-C           715            25                                  Vitamin-C                73.0         78.38
     HRT                 547            10                                  HRT                      87.5          87.5
     MMR-Vaccine         751            12                                  MMR-Vaccine             67.24         81.03
                                                                            Average Accuracy        75.87          78.1
  Further, the feature vectors for the training data are con-
structed similar to our first approach in “arff” format and              We have compared our two approaches using k-fold paired
the model is built by J48 classifier using Weka API.                  t-test and Mcnemar test to show that the improvement in
  Table 5 shows size of the tree in terms of number of nodes          performance is statistically significant. We have applied 5-
which describe the model created for both variations of our           fold paired t-test (1-tailed, 95% confidence, 5 dataset) on
approach. It is observed from Table 5 that the number of              our two approaches and we have obtained the p − value of
nodes used in the decision tree by J48 classifier is consider-        0.278 for accuracy. Since, this p − value is greater than
ably reduced when χ2 feature selection method is used.                0.05, we can statistically infer that our approach using χ2
                                                                      feature selection does not reduce the performance of our sys-
               Table 5: Size of the Tree                              tem. When we apply Mcnemar test across all data sets, we
      Query          Without Feature χ2 Feature                       obtain the p − value of 0.5186 which is also greater than
                         Selection      Selection                     0.05. These show that our feature selection approach signif-
      Skin Cancer           57             29                         icantly reduces the size of the model without compromising
      E-Cigarettes          51             27                         the performance.
      Vitamin-C             21              3
      HRT                   23              7                         5.     CONCLUSIONS
      MMR-Vaccine           39              3                            We have presented a system for identifying whether the
                                                                      given text are relevant or irrelevant to a query. We have
   To show that this reduction is statistically significant, we       proposed two variations of our methodology namely an ap-
have applied a t-test on these 2 models. k-Fold paired t-test         proach with all features and an approach with selected fea-
with one-tailed distribution is used to show that the reduc-          tures based on chi-square statistical value. In both the meth-
tion is significant when features are selected using χ2 . The         ods, we have identified the features and feature vectors are
p − values obtained for size of the tree while applying paired        constructed from training data. We have used J48 clas-
t-test (one-tailed, 95% confidence) is 0.001236616 which is           sifier to build a model with these feature vectors and the
less than 0.05. This shows that the reduction in size of the          model is used to predict whether the test instances or “rel-
tree is statistically significant.                                    evant” or “irrelevant” to the query. We have used the data
   The prediction is done for the test data as in our first           set given by CHIS@FIRE2016 shared task to evaluate our
approach to identify whether the test instances belong to             methodology. We have performed a statistical t-test which
one of the category “relevant” or “irrelevant”.                       shows our χ2 feature selection method significantly reduces
the size of the model for CHIS@FIRE2016 data set. We              [11] E. Sillence, P. Briggs, L. Fishwick, and P. Harris. Trust
have measured the performance of our approaches using the              and mistrust of online health sites. In Proceedings of
metric accuracy. We have obtained the accuracy of 75.87%               the SIGCHI conference on Human factors in
and 78.1% for the method without feature selection and the             computing systems, pages 663–670. ACM, 2004.
method using χ2 feature selection respectively for the Task 1     [12] M. Sinha, S. Mannarswamy, and S. Roy. CHIS@FIRE:
of CHIS@FIRE2016 shared task. Statistical t-tests namely               overview of the CHIS track on consumer health
k-fold paired t-test and Mcnemar test confirm that feature             information search. In Working notes of FIRE 2016 -
selection has significantly reduced the sizes of the models            Forum for Information Retrieval Evaluation, Kolkata,
without affecting the performance. At present we have used             India, December 7-10, 2016, CEUR Workshop
parts of speech (POS) information and χ2 value to extract              Proceedings. CEUR-WS.org, 2016.
and select the features respectively. Further, the features       [13] L. Soldaini, A. Yates, E. Yom-Tov, O. Frieder, and
may be extracted based on the predicate information of the             N. Goharian. Enhancing web search in the medical
text [15, 16]. The CHIR value [8, 6] may be calculated from            domain via query clarification. Inf. Retr. Journal,
χ2 value to select the features in future.                             19(1-2):149–173, 2016.
                                                                  [14] A. Spink, Y. Yang, J. Jansen, P. Nykanen, D. P.
Acknowledgments                                                        Lorence, S. Ozmutlu, and H. C. Ozmutlu. A study of
We would like to thank the management of SSN Institutions              medical and health queries to web search engines.
for funding the High Performance Computing (HPC) lab                   Health Information & Libraries Journal, 21(1):44–51,
where this work is being carried out.                                  2004.
                                                                  [15] D. Thenmozhi and C. Aravindan. An automatic and
6.   REFERENCES                                                        clause based approach to learn relations for ontologies.
 [1] J. Castano, H. Berinsky, H. Park, D. Pérez, P. Avila,            The Computer Journal, Accepted for Publication,
     L. Gambarte, S. Benıtez, D. Luna, F. Campos, and                  DOI: 10.1093/comjnl/bxv071, 2015.
     S. Zanetti. A machine learning approach to clinical          [16] D. Thenmozhi and C. Aravindan. Paraphrase
     terms normalization. ACL 2016, page 1, 2016.                      identification by using clause based similarity features
 [2] R. J. Cline and K. M. Haynes. Consumer health                     and machine translation metrics. The Computer
     information seeking on the internet: the state of the             Journal, Accepted for Publication, DOI:
     art. Health education research, 16(6):671–692, 2001.              10.1093/comjnl/bxv083, 2015.
 [3] A. S. Fiksdal, A. Kumbamu, A. S. Jadhav, C. Cocos,           [17] E. G. Toms and C. Latter. How consumers search for
     L. A. Nelsen, J. Pathak, and J. B. McCormick.                     health information. Health informatics journal,
     Evaluating the process of online health information               13(3):223–235, 2007.
     searching: a qualitative approach to exploring               [18] C. Yunzhi, L. Huijuan, L. Shapiro, and L. Travillian,
     consumer perspectives. Journal of medical Internet                Ravensara S.and Lanjuan. An approach to semantic
     research, 16(10):e224, 2014.                                      query expansion system based on hepatitis ontology.
 [4] L. Goeuriot, G. J. Jones, L. Kelly, H. Müller, and               Journal of Biological Research-Thessaloniki, 23(1):11,
     J. Zobel. Medical information retrieval: Introduction             2016.
     to the special issue. Inf. Retr., 19(1-2):1–5, April 2016.   [19] Q. Zeng, S. Kogan, N. Ash, R. Greenes, A. Boxwala,
 [5] Y. Hong, N. de la Cruz, G. Barnas, E. Early, and                  et al. Characteristics of consumer terminology for
     R. Gillis. A query analysis of consumer health                    health information retrieval. Methods of information
     information retrieval. In Proceedings of the AMIA                 in medicine, 41(4):289–298, 2002.
     Symposium, page 1046. American Medical Informatics           [20] Q. T. e. a. Zeng. Assisting consumer health
     Association, 2002.                                                information retrieval with query recommendations.
 [6] M. Janaki Meena and K. Chandran. Naive bayes text                 Journal of the American Medical Informatics
     classification with positive features selected by                 Association, 13(1):80–90, 2006.
     statistical method. In In International Conference on        [21] Y. Zhang, H. Cui, J. Burkell, and R. E. Mercer. A
     Autonomic Computing and Communications, ICAC                      machine learning approach for rating the quality of
     2009, pages 28–33. IEEE, 2009.                                    depression treatment web pages. iConference 2014
 [7] A. Keselman, A. C. Browne, and D. R. Kaufman.                     Proceedings, 2014.
     Consumer health information seeking as hypothesis            [22] Y. Zhang, P. Wang, A. Heaton, and H. Winkler.
     testing. Journal of the American Medical Informatics              Health information searching behavior in medlineplus
     Association, 15(4):484–495, 2008.                                 and the impact of tasks. In Proceedings of the 2nd
 [8] C. L. Li Yanjun and S. M. Chung. Text clustering                  ACM SIGHIT International Health Informatics
     with feature selection by using statistical data. IEEE            Symposium, pages 641–650. ACM, 2012.
     Transactions on Knowledge and Data Engineering,
     20(5):641–652, 2008.
 [9] B. E. Nerkar and S. S. Gharde. Best treatment
     identification for disease using machine learning
     approach in relation to short text. IOSR Journal of
     Computer Engineering (IOSR-JCE), 16(3):5–12, 2014.
[10] P. A. F. Pavel, Yonghong Peng and B. C. Soares.
     Decision tree-based data characterization for
     meta-learning. IDDM-2002, page 111, 2002.