JU_KS_Group@FIRE 2016: Consumer Health Information
                      Search
                                  Kamal Sarkar                                        Debanjan Das
                         Dept. of Computer Sc. & Engg.                        Dept. of Computer Sc. & Engg.
                              Jadavpur University                                  Jadavpur University
                              Kolkata, WB 700032                                   Kolkata, WB 700032
                           jukamal2001@yahoo.com                              dasdebanjan624@gmail.com
            Indra Banerjee                                  Mamta Kumari                                 Prasenjit Biswas
    Dept. of Computer Sc. & Engg.                   Dept. of Computer Sc. & Engg.                 Dept. of Computer Sc. & Engg.
         Jadavpur University                              Jadavpur University                           Jadavpur University
         Kolkata, WB 700032                              Kolkata, WB 700032                            Kolkata, WB 700032
     ardnibanerjee@gmail.com                          mamta.mk222@gmail.com                         p.biswas.ju94@gmail.com


ABSTRACT                                                               Information access mechanisms for factual health information
In this paper, we describe the methodology used and the results        retrieval have matured considerably, with search engines
obtained by us for completing the tasks given under the shared         providing Fact Checked Health Knowledge Graph search results
task on Consumer Health Information Search (CHIS) collocated           to factual health queries. It is pretty straightforward to get an
with the Forum for Information Retrieval Evaluation (FIRE)             answer to the query “what are the symptoms of Diabetes” from
2016, ISI Kolkata. The shared task consists of two sub-tasks – (1)     these search engines [8][9][10]. But the most general purpose
task1: given a query and a document/set of documents associated        search engines can hardly find the answers of the complex health
with that query, the task is to classify the sentences in the          search queries which do not have a single definitive answer and
document as relevant to the query or not and (2) task 2: the           whose answers have multiple perspectives. There may have a
relevant sentences need to be further classified as supporting the     search queries for which there are a large number of search results
claim made in the query, or opposing the claim made in the query.      reflecting the different perspectives and view-points in favor or
We have participated in both the sub-tasks. The percentage             against the query.
accuracy obtained by our developed system for task1 was 73.39          The term “Consumer Health Information Search” (CHIS) has
which is third highest among the 9 teams participated in the           been used by the organizers of the shared task on Consumer
shared task.                                                           Health Information Search @FIRE 2016 to denote such
                                                                       information retrieval search tasks for which there are no “Single
Categories and Subject Descriptors                                     Correct Answer(s)” and instead, multiple and diverse
                                                                       perspectives/points of view, which very often are contradictory in
H.1.2 [Information Systems]: User/Machine Systems – human              nature, are available on the web regarding the queried
factors, human information processing.                                 information1.

Keywords                                                               1.2 Problem Statement
Consumer health information search, searching behavior, search         The shared task on Consumer Health Information Search @FIRE
tasks, user query, document sentences.                                 2016 has the following two sub-tasks:
                                                                       A) Task 1- Given a CHIS query and a document/set of documents
                                                                       associated with that query, the task given was to classify the
                                                                       sentences in the document as relevant to the query or not.
1. INTRODUCTION                                                        Relevant sentences in the document being those which are useful
1.1 Our Motivation                                                     in providing the answer to the query.
A large number of websites provide health related information          B) Task 2- These relevant sentences had to be further classified as
[1][2]. Consumer use of the Internet for seeking health                supporting the claim made in the query or opposing it.
information is rapidly growing [3]. By 1997, nearly half of
Internet users in the US had sought health information [4].            1.2.1 Examples
Expressed in raw numbers, an estimated 18 million adults in the        E.g. Query - Are e-cigarettes safer than normal cigarettes?
US sought health information online in 1998. The majority of
                                                                       S1: Because some research has suggested that the levels of most
consumers seek for themselves health information related to
                                                                       toxicants in vapor are lower than the levels in smoke, e-cigarettes
diseases for consultation with their physicians [5] [6]. Information
                                                                       have been deemed to be safer than regular cigarettes.
found trough search on the web may influence medical decision
making and help consumers to manage their own care [7]. The            A) Relevant, B) Support
most common topics which are searched on the web are the
leading causes of death (heart disease and cancer) and Children
health.
                                                                       1
                                                                           https://sites.google.com/site/multiperspectivehealthqa/
S2: David Peyton, a chemistry professor at Portland State              calculated the noun matching similarity using the following
University who helped conduct the research, says that the type of      formula;
formaldehyde generated by e-cigarettes could increase the              Noun Similarity = (No. of nouns that are exactly matching with
likelihood it would get deposited in the lung, leading to lung         the nouns in the query) / (No. of nouns present in the user query)
cancer.                                                                                                                          --- (ii)
A)Relevant, B) oppose
                                                                       2.1.1.4 Neighborhood Matching: There were some words
S3: Harvey Simon, MD, Harvard Health Editor, expressed                 present in the sentences which were not matching exactly with the
concern that the nicotine amounts in e-cigarettes can vary             words of user query, but they are semantically similar with the
significantly.                                                         user query words; e.g.
A) Irrelevant, B) Neutral                                              Let ‘skin cancer’ be present in the user query and ‘melanoma’ be
                                                                       present in the current sentence. Both words are spelt differently
2. METHODOLOGY                                                         but their meanings are similar, i.e. they are meaningfully similar.
2.1 Description                                                        To check whether the words were equivalent or not, we took each
For both the tasks-Task 1 and Task 2, we have used support             word from the current sentence, searched it in our self-made
vector machines (SVM) as the classifier, but the feature sets for      Wikipedia Dictionary [11], and extracted the first three sentences
the task1 and task2 were different. We discuss the feature sets        describing that word’s meaning. We then matched the user query
used for task1 and task 2 in sub section 2.1.1 and sub-section         words with the words present in the extracted sentences, and if the
2.1.3 respectively.                                                    word is present, we consider it as a match and, finally we
                                                                       calculate the similarity again between the user query and the
2.1.1 Our Used Features for Task1                                      current sentence using equation (i).
For the task 1, we were given by the organizers of the shared task     We create the Wikipedia Dictionary by saving words along with
a set of excel files where the heading of each excel file was a user   their meanings, which were extracted from Wikipedia. We use
query. Each excel file contained a set of sentences that were          our developed python script for creating this dictionary.
labeled as relevant or not relevant to the user query. The
sentences in these given training excel files were already labeled     2.1.1.5 COSINE Similarity:
as ‘relevant’ or ‘irrelevant’. We took each and every sentence         We represent both the query and a sentence using bag-of-words
from each excel file and pair it with the corresponding query,         model and each query as well as the sentence is represented as
examined them, and calculated a set of five features discussed in      vector. The component of each vector is TFIDF weight of a word
this sub section.                                                      t which is calculated as follows:
2.1.1.1 Exact Matching: We matched each sentences with                 IDF (t) = log(N/DF)
the user given query, word by word, and calculated the similarity
                                                                       Where N= Total number of sentences and DF= Number of
between the user query and the current sentence in the excel file;
                                                                       sentences with word ‘t’ in it
e.g.
                                                                       TF(t) = (Number of times word ‘t’ appears in a sentence) / (Total
Let the user query be “Ram is a good boy” and the current
                                                                       number of words in the sentence)
sentence be “Shyam is a bad boy”. Between the user query and
the current sentence there are three words which are exactly           After calculating the vectors for the query and the sentence, the
matching, i.e. “is”, “a” & “boy”. Now the similarity between           cosine similarity between the query vector and the sentence
these two strings is given as;                                         vector is calculated. The cosine similarity value is used as one of
                                                                       feature values for relevance checking.
Similarity = {2 * (No. of Common Words)} / {(No. of words in
user query) + (No. Of words in the current sentence)} --- (i)          2.1.2 Search as Classification
where, no. of Common Words = Number of words common to                 For task 1, we represent each training sentence as vector of five
both the user query and the current sentence.                          feature values mentioned above and label each vector as
                                                                       “relevant” or “not relevant”. With this labeled training data, we
2.1.1.2 Stemmed Word Matching: We stemmed both the                     train the support vector machines (SVM). For SVM, we have used
user query and the current sentence using a stemming tool              SVC tool available in Python scikit learn and a model is
available in Python programming language. Stemming normalizes          generated. Since no development set was available, for parameter
a word by cutting out the excess part of a word due to                 tuning, we split the training data into two parts-(1) the first part
pluralization, or if the word is an adverb; e.g. mangoes → mango,      contains 60% of the training data and second part contain 40% of
highly → high etc. After stemming we again calculated the              the training data. We train SVM with the 60% of the training data
similarity between both the strings using equation (i).                and then we test the obtained model on the remaining part of the
2.1.1.3 Noun matching: We found, on a perusal of initial               training data. Thus we tune the parameters to obtain the best
sample data, that the nouns present in each sentence largely           parameter settings. Finally, we obtain the best results with the
influenced whether a search result was relevant or irrelevant to       settings where the cost parameter C set to 107, gamma set to 0.006
the user query. So we isolated the nouns present in the user query,    and kernel set to “poly”.
searched whether any of these nouns were matching with any             Like training data, we represent the unlabeled test data released
word present in current sentence, and by this process we found         by the organizers of the shared task in the similar way using the
out the number of nouns present in the current sentence that were      five features mentioned in sub-section 2.1.1, and then submit it to
exactly matching the nouns present in the user query. We               the trained classifier. The classifier, using its knowledge from
previous training data, predicts the labels for each of the sentences   parameter C set to 107, gamma set to 0.005 and kernel set to
present in test data.                                                   “rbf”.

2.1.3 Our Used Features for Task2                                       We also represent unlabeled test data released by the organizers
After relevancy checking (Task 1), Task 2 is carried out. By task       for the task 2 as the vectors using the same feature set consisting
1, all the sentences in the excel file are divided into two classes;    of N+4 features and submit them to the trained model which in
(a) relevant and (b) irrelevant. Now the task is to determine           turn predicts label ‘supporting’/’opposing’/’neutral’ for each
whether a relevant sentence was supporting the user query,              sentence present in the test excel file.
opposing the user query, or neutral with regard to the user query.
For this task we again calculated a set of N+4 features, where N =      2.2 Architecture
number of distinct words present in the entire training files. Here     The architecture of our developed system used for task 1 and task
the feature set includes N number of distinct unigrams present in       2 are shown in Figure 1 and Figure 2 respectively. For both the
the training data and four other features discussed in the following    systems, the important modules are feature extraction and
sub-sections.                                                           classifier. For the task 1, we have 5 features discussed in the
2.1.3.1 Number of Positive Words: We calculated the                     earlier sections and for task 2, we have used N + 4 features which
number of positive words that were present in each sentence of          are also discussed in the earlier sections.
the excel file. We recognized the positive words from a particular      For task 1, after feature extraction from each query-sentence
sentence by using a Python package called SentiWordNet2.                pairs, each sentence is represented as a vector which is labeled
                                                                        with the label of the corresponding training sentence. Then the
2.1.3.2 Number of Negative Words: We calculated the                     labeled vectors are given to the classifier to produce a model.
number of negative words that were present in each sentence of
                                                                        Finally the learned model is used to determine the relevancy of
the excel file. We recognized the negative words from a particular
                                                                        the test sentences given a query. For task 2, we extract features
sentence by using a Python package called SentiWordNet.
                                                                        from the sentences and sentences are represented as the vectors
2.1.3.3 Number of Neutral Words: We had already found                   labeled with one of the categories-“oppose”, “support” and
out the positive and negative words for a particular sentence, so       “neutral”. The classifier is trained with the labeled training pattern
the words that were neither negative nor positive were classified       vectors and the learned model is used to classify the test sentences
as neutral words and their occurrence in the current sentence was       into one of categories-“oppose”, “support” and “neutral”.
counted.
                                                                        3. DATA SETS, RESULTS, EVALUATION
2.1.3.4 Relevant or Irrelevant: In Task-1 we have already
labeled each sentence to be either relevant or irrelevant. We took
                                                                        3.1 Data Sets
                                                                        For the training data, we were given five user queries along with
this label into consideration for this task. This was a binary
                                                                        number of sentences per query [12].
feature as the current sentence could either be relevant or
irrelevant.                                                             • “does_sun_exposure_cause_skin_cancer”              -- 68 sentences
                                                                        • “e – cigarettes”                                   -- 83 sentences
2.1.3.5 ‘N’ Features: we represent each sentence as a bag-of-
words model. According to vector space model, a sentence is             • “HRT_cause_cancer”                                 -- 61 sentences
represented as N-dimensional vectors where N is the distinct            • “MMR_vaccine_lead_to_autism”                       -- 71 sentences
number of unigrams present in the training data. Weight of a word
used as the component of a vector is calculated using TFIDF             • “vitamin_C_common_cold”                            -- 65 sentences
formula.                                                                A total of 348 sentences were present in the training data set.

2.1.4 Sentiment Classification                                           For the test data, the queries were the same as the training data
                                                                        and the number of unlabeled sentences per query given was as
We represent each sentence in the excel file as a vector using the      follows.
above mentioned N+4 features and label each vector with the             • “does_sun_exposure_cause_skin_cancer”             -- 342 sentences
label of the corresponding training sentence. The label can be one
of three types- “Support”, “Oppose” and “Neutral”. Finally, we          • “e – cigarettes”                                  -- 414 sentences
submit labeled vectors to the SVM classifier as specified in the        • “HRT_cause_cancer”                                -- 260 sentences
Task-1 and trained it using them. The model is generated after
                                                                        • “MMR_vaccine_lead_to_autism”                      -- 279 sentences
training. Like the task 1, we also we split the training data into
two parts-(1) the first part contains 60% of the training data,         • “vitamin_C_common_cold”                           -- 247 sentences
which is used to develop the initial model and (2) the remaining        A total 1542 sentences were present in the test data set
40% of the training data is used to test the model while tuning the
parameters. After tuning the parameters of SVC tool available in        3.2 Results
Python scikit learn, we obtain the best model with the cost             We developed our systems for both task 1 and task 2 using the
                                                                        training data [12] supplied to us by the organizers of the contest.

2
    http://www.nltk.org/howto/sentiwordnet.html
http://www.nltk.org/_modules/nltk/corpus/reader/sen
 tiwordnet.html
                                                             Training Data Representation
    User Query                    Wikipedia
   Training Data                  Dictionary                                                                                  Relevant/
                                                                 Sentences      F1     F2       F3       F4      F5
                                                                                                                              Irrelevant


                                                                    …           …      …        …        …       …               …
              Feature Extraction


                                                                                                               Classifier
    User Query
     Test Data         Test Data Representation

                      Sentences       F1    F2      F3      F4     F5            Test Data Result

                                                                                       Sentences            Relevant/Irrelevant
                           …          …     …       …       …      …
                                                                                            …                         …


                                               Figure 1. System Architecture for Task 1


     User Query                User Query                                Training Data Representation
    Training Data               Test Data
                                                                                                                                           Support/
                                                                   Sentence
                                                                                 f1    … fn          fn+1     fn+2    fn+3       fn+4      Oppose/
                                                                      s
                                                                                                                                           Neutral


             Feature Extraction                                         …        ...   … ...          …        …          …        …          …


Test Data Representation                                                                             Classifier

 Sentences     f1   … fn   fn+1     fn+2    f n+3    fn+4

                                                                                            Relevant/                     Support/
                                                                        Sentences
     …         … ……         …        …         …                                            Irrelevant                Oppose/Neutral
                                                      …
                                                                            …                   …                              …


                                                Figure 2. System Architecture for Task 2
                                       Table 1. Performance of the participating systems for Task 1


                                       Table 2. Performance of the participating systems for Task 2


After release of test data by the organizers, we run our system on        methods so that our systems can perform more accurately for both
the test data and send the result files along with the complete           the tasks.
system to the organizers. They evaluated the results using the
traditional percentage accuracy and published the results which           4. CONCLUSION
were sent to us through e-mail.                                           There has been a dearth of proper searching systems for medical
We have shown the officially published results of task 1 and task         queries and our work on the CHIS tasks put us on the path to
2 for the 9 participating teams in Table 1 and Table 2                    filling this void. The methodology we used can be improved on
respectively. The results shown in red bold font are the                  and innovated with to create a novel searching method for not
performances of top systems participated in the tasks.                    only medical queries, but any specific search queries of any field.
                                                                          What we have done, and our continuing to improve on, is a
Out of the 9 participants, our system (JU_KS_Group) achieves the
                                                                          logical way of searching through data which is already available
third highest average accuracy for task 1, i.e. 73.39257557%. We
                                                                          to the public. We sincerely believe that through machine learning
can evaluate the results for task 1 in a different angle. It is evident
                                                                          and natural language processing, the future of online searching
from Table 1 that our system performs better for 3 queries out of
                                                                          can be achieved; and have tried to contribute towards this goal
5 queries whereas the system SSN_NLP with the best average
                                                                          through our paper. And that this will especially be of use in the
accuracy (78.10%) performs better for 2 queries out of five
                                                                          medical field.
queries. The main reason for my system giving better results for
task 1 is the use of two novel features, noun matching and                For future work, we would incorporate a word sense
neighborhood matching.                                                    disambiguation module to disambiguate the query words. We
For the task 2, our system achieves an average accuracy of                hope that our system will give more accurate results for task 2 if
33.63514278%. For the task 2, our system achieves relatively              we consider classification of relevant sentences as 2-class
poor performance. One of the reasons of getting poor performance          problem (“support” and “oppose”) instead of considering it as the
for task 2 is that we have considered “neutral” class along with          3-class (“support”, “oppose” and “neutral”) problem that we did
other two classes “oppose” and “support” while classifying the            during the contest.
relevant sentences. It is evident from the training data that only
the irrelevant sentences in the training data were assigned the
“neutral” class. Actually the task2 was to classify the relevant          ACKNOWLEDGMENTS
sentences into two categories-“Support” and “oppose”, but we              We would like to thank the Forum for Information Retrieval
have mistakenly considered the task2 as 3-class problem instead           Evaluation (FIRE) 2016, ISI Kolkata, for providing us the tasks
of 2-class problem. We are working to improve our proposed                and datasets for C.H.I.S.
5. REFERENCES                                                         [7]. Wilkins, A. S. (1999) Expanding Internet access for health
[1]. Cline, R. J., & Haynes, K. M. (2001). Consumer health                 care consumers. Health Care Management Review, 24, 30–
     information seeking on the Internet: the state of the art.            41.
     Health education research, 16(6), 671-692.                       [8]. Hong, Y., Cruz. N., Marnas, G., Early, E., and Gillis, R.
[2]. Grandinetti, D. A. (2000) Doctors and the Web: help your              2002. A query analysis of consumer health information
     patients surf the Net safely. Medical Economics, April, 28–           retrieval. In the Proceedings of AMIA 2002, 1046.
     34                                                               [9]. Belkin, N., J., Oddy, R. N., and Brooks, H. M. 1982. Ask for
[3]. Lacroix, E. M., Backus, J. E. and Lyon, B. J. (1994) Service          information retrieval: Part I. background and theory. Journal
     providers and users discover the Internet. Bulletin of the            of Documentation, 38, 61-71.
     Medical Library Association, 82, 412–418.                        [10]. Keselman, A., Browne, A.C., and Kaufman, D.R. 2008.
[4]. Eng, T. R., Maxfield, A., Patrick, K., Deering, M. J., Ratzan,         Consumer health information seeking as hypothesis testing.
     S. C. and Gustafson, D. H. (1998) Access to health                     JAMIA, 15, 484-495.
     information and support: a public highway or a private road?     [11]. Efthimiadis, E.N. 2009. How students search for consumer
     Journal of the American Medical Association, 280, 1371–                health information on the web. In Proceedings of the 42nd
     1375.                                                                  Hawaii International Conference on System Sciences, 1-8.
[5]. Chi-Lum, B. (1999) Friend or foe: consumers using the            [12]. Sinha, M. and Mannarswamy, S. and Roy, S. 2016.
     Internet for medical information. Journal of Medical Practice          CHIS@FIRE: Overview of the CHIS Track on Consumer
     Management, 14, 196–198.                                               Health Information Search, Working notes of FIRE 2016 -
[6]. Boyer, C., Selby, M. and Appel, R. D. (1998) The Health on             Forum for Information Retrieval Evaluation, Kolkata, India,
     the Net Code of Conduct for medical and health Web sites.              December 7-10, 2016, CEUR Workshop Proceedings,
     Medinfo, 9 (Part 2), 1163–1166.                                        CEUR-WS.org.