CHIS@FIRE: Overview of the Shared Task on Consumer
                 Health Information Search

                              Manjira Sinha, Sandya Mannarswamy, Shourya Roy
                                                 Xerox Research Center India
                                                       Bengaluru, India
                     (manjira.sinha, sandya.mannarswamy, shourya.roy)@xerox.com


ABSTRACT                                                             anced view of the diverse perspectives/points of view avail-
People are increasingly turning to the World Wide Web to             able, both for and against the hypothesis posed in the search
find answers for their health and lifestyle queries, While           query. Subjective health related queries such as ’does treat-
search engines are effective in answering direct factual ques-       ment X effective for disease Y?’ or ’can X cause disease
tions such as ‘What are the symptoms of a disease X?’, they          Y’ do not have a single definitive answer on the web due
are not so effective in addressing complex consumer health           to the multiple supporting/opposing perspectives available
queries, which do not have a single definitive answer, such as       on the web related to them, instead multiple perspectives
‘Is treatment X effective for disease Y?’. Instead, the users        (which very often are contradictory in nature) are available
are presented with a vast number of search results with of-          on the web regarding the queried information. The pres-
ten contradictory perspectives and no definitive conclusion.         ence of multiple perspectives with different grades of sup-
The term “Consumer Health Information Search” (CHIS)                 porting evidence (which is dynamically changing over time
is used to denote such information retrieval search tasks,           due to the arrival of new research and practice evidence)
for which there is “No Single Best Correct Answer”. The              makes it all the more challenging for a lay searcher. Fig-
proposed CHIS track aims to investigate complex health               ure 1 depicts the precise scenario. In our Consumer Health
information search in scenarios where users search for health        Information search CHIS shared task track on FIRE 1 , we
information with more than just a single correct answer, and         have attempted to encourage the development of innovative
look for multiple perspectives from diverse sources both from        computational models to statistically represent the multiple-
medical research and from real world patient narratives.             perspective around a general health search query and there-
                                                                     fore, assist the self-searcher with better and meaningful in-
                                                                     formation insights..
Keywords
information retrieval for clinical texts, multi-perspective health
data mining

1.   INTRODUCTION
   World Wide Web is increasingly being used by consumers
as an aid for health decision making and for self-management
of chronic illnesses as evidenced by the fact that one in ev-
ery 20 searches on Google [5] is about health. Information
access mechanisms for factual health information retrieval
have matured considerably, with search engines providing             Figure 1: Contradicting search results for clinical
Fact checked Health Knowledge Graph search results to fac-           query
tual health queries. While the direct informational needs of
the Online Health Information Seekers regarding well estab-
lished disease symptoms and remedies are well met by search          2.     BACKGROUND
engines [5], general search engines do not provide defini-              At present times, there has been considerable interest
tive answers for addressing complex consumer health queries          in the field of stance classification and stance modelling.
which have multiple different points of view/perspectives as-        Stance classification has been applied to different debate
sociated with them.                                                  settings such as congressional debates [15, 18, 3], company
   It is pretty straightforward to get an answer to the query        internal debates [9, 10, 1] and online public forums on social
“what are the symptoms of Diabetes” from the search en-              and political topics [13, 14, 17, 4, 16, 2]. Recently there
gines. However retrieval of relevant multiple perspectives           has been work on stance classification of argumentative po-
for complex health search queries which do not have a sin-           litical essays [6], online news articles [7] and online news
gle definitive answer still remains elusive with most of the         comments [12].
general purpose search engines. For example, a user health              Unlike many of the earlier research settings which have an-
query such as “can metabolic therapy cure brain cancer”              alyzed posts on public debate topics, multi-perspective con-
causes considerable frustration for the searcher as he needs
                                                                     1
to wade through hundreds of search results to obtain a bal-              https://sites.google.com/site/multiperspectivehealthqa/
sumer health information is not typically characterized by               to classify the sentences in the document as relevant
strong emotion/opinion bearing language, nor does it have                to the query or not. The relevant sentences are those
strongly delineated supporting/opposing topic words. They                from that document, which are useful in providing the
typically contains domain specific technical terms and sparse            answer to the query.
in emotional/affective words and is typically factual in na-
ture. A closely related work [19] discussed the information            2. TASK B: These relevant sentences need to be further
seeking behaviour on MMR vaccine on internet search en-                   classified as supporting the claim made in the query,
gines and developed an automated way to score Internet                    or opposing the claim made in the query.
search queries and web pages as to the likelihood of the
searcher deciding to vaccinate. Also while socio-political          Example :
debate stances can often be delineated by well demarcated         • Query- “Are e-cigarettes safer than normal cigarettes?”
topic words (for instance, pro-abortion stance is often char-
acterized by the topical words ’right to choose’, whereas anti-     • Retrieved sentence S1 - “Because some research has sug-
abortion is characterized by the topical words ’pro-life’ ),      gested that the levels of most toxicants in vapor are lower
health related texts do not typically contain stance delin-       than the levels in smoke, e-cigarettes have been deemed to
eating topic words since the same proposition can be used         be safer than regular cigarettes”. A)Relevant, B) Sup-
for supporting or opposing a given health query, depending        port
on the supporting research evidence. For instance, consider
the following example sentences retrieved in response to the         • Retrieved sentence S2 - “David Peyton, a chemistry pro-
query Sun exposure causes skin cancer :                           fessor at Portland State University who helped conduct the
                                                                  research, says that the type of formaldehyde generated by
     • S1: Many studies have found that skin cancer rates         e-cigarettes could increase the likelihood it would get de-
       are increasing in indoor workers.                          posited in the lung, leading to lung cancer.” A)Relevant,
     • S2: very few studies have demonstrated that skin can-      B) Oppose
       cer rates are increasing in indoor workers.
                                                                    • Retrieved sentence S2 - “Harvey Simon, MD, Harvard
Both sentences contain the topical phrase skin cancer rates       Health Editor, expressed concern that the nicotine amounts
in indoor workers with sentence S1 providing evidence in          in e-cigarettes can vary significantly.” A)Irrelevant, B)
support of it, whereas sentence S2 providing evidence op-         Neutral
posing it. This illustrates the difficult of identifying stance
delineating topic words in health related text.                      Our task have 5 consumer health queries, Figure 2 and
   The technical language of the information in these queries     figure 3 below presents the comprehensive statistics of the
is also another factor which makes the stance classification      CHIS queries used in our task released as training and test
complex. Given an example sentence E-cigarettes contain           respectively.
di-acetyl which has been associated with popcorn lung syn-
drome for a sample query E-cigarettes are safer than nor-
mal cigarattes, it is not evident at first glance, whether this
sentence is supportive/opposing the query. This makes the
task more challenging, compared to general domain stance
classification.

3.     TASK DESCRIPTION
  Given a CHIS query, and a document/set of documents
associated with that query, the task is to classify the sen-
tences in the document as relevant to the query or not. The
                                                                   Figure 2: Statistics of Queries in Training Data.
relevant sentences are those from that document, which are
useful in providing the answer to the query. These rele-
vant sentences need to be further classified as supporting
the claim made in the query, or opposing the claim made in
the query.
  Example query: Does daily aspirin therapy prevent heart
attack?
  S1: “Many medical experts recommend daily aspirin ther-
apy for preventing heart attacks in people of age fifty and
above.” [affirmative/Support]
  S2: “While aspirin has some role in preventing blood clots,           Figure 3: Statistics of Queries in Test Data.
daily aspirin therapy is not for everyone as a primary heart
attack prevention method.” [disagreement/Oppose]

3.1     Detailed Task Description                                 4.     TASK PARTICIPANTS AND RESULTS
                                                                    A total of 9 teams participated in task and 9 submissions
  There are two sets of tasks:
                                                                  are obtained against Task A and 8 submissions are obtained
     1. TASK A: Given a CHIS query, and a document/set            against Task B. Details of the participating teams are shown
        of documents associated with that query, the task is      in figure 4 below.
                                                                representation.
                                                                   Team JNUTH model uses a aggregate over a range of sim-
                                                                ilarity measures to obtain the relevance-irrelevance decision
                                                                for a data input. They have obtained 54.84% accuracy.

                                                                4.2    Performance of Teams in Task B
                                                                  Figure 6 below demonstrates the team performances for
                Figure 4: Team details                          Task B.


4.1   Performance of Teams in Task A
   Figure 5 below presents the team performance statistics
for Task A, i.e., where a retrieved instance has to be clas-
sified according to whether it is relevant or irrelevant to a
specific query2 .

                                                                       Figure 6: Final Result Table for Task B

                                                                   In task B, team JNUTH has jointly secured the first po-
                                                                sition with team Fermi. JNUTH has used a C-support vec-
                                                                tor machine classifier with radial basis kernel. They have
                                                                used tf-idf for input representation followed by a max-feature
                                                                sorting. Their model has obtained 55.43% accuracy. Team
      Figure 5: Final Result Table for Task A                   Fermi has used a deep neural network architecture and a
                                                                bag-of-phrase representation to achieve 54.87% accuracy.
   As can be observed in Task A, team SSN NLP and team             With a score of 53.99% Hua Yang have secured the sec-
Fermi have secured the top scoring positions with accuracy      ond rank. His model uses a naive-Bayes classifier and tf-idf
78.10% and 77.04% respectively. SSN NLP has proposed a          representation. Team Techie-challengers also used a naive-
decision tree model based on sophisticated text features in-    Bayes classifier, but on doc2vec input representation to ob-
cluding part-of speech. They have used a chi-square feature     tain 52.47% accuracy. Therefore, they hold the third rank.
selection to extract the informative features and reduce the       Team Amrita Fire Cen has used a random forest classifier
number of spurious features and demonstrated that such a        on distributional semantic representation of the input and
feature selection approach can offer a significant gain. Team   obtained 38.53% accuracy. Individual participant Jainisha
Fermi have used a deep neural network architecture with         Sankhavara has developed a model based on BM-2 ranking
Rectified-linear (ReLu) and Sigmoid activation over bag-of-     function to obtain overall accuracy 37.96%.
phrase features.                                                   Team Amrita Cen has modeled using a support vector
   Team JU KS group and Techie-challengers have secured         machine classifier with input feature representation obtained
the second position jointly with a closed call of 73.39% and    by word-embedding and keyword generation techniques to
73.03% accuracy respectively. JU KS group has implemented       obtain 34.64% accuracy. Team JU KS group has modeled
a support vector machine with polynomial kernel to classify     the task as sentiment classification problem and their inno-
the data. They have curated informative text features such      vative feature set consists of positive, negative and neutral
as part-of-speech matching, neighborhood matching to rep-       polarity words along with information from Task A. They
resent the input data. Techie-challengers has proposed a        have achieved an overall accuracy of 33.64%.
naive-bayes classifier on doc2vec [8] and tf-idf based ensem-
ble representation of the data.                                 5.    CONCLUSION
   With accuracy 70.20% and 70.28%, team Amrita Cen and
individual participant Jainisha Shankhavara have ranked third      We thank all the participants for expressing interest in our
jointly. Team Amrita Cen has used a support-vector-machine      track. It has been a great experience to witness the innova-
classifier on top of input feature representation obtained by   tive models and techniques proposed by different teams. The
word-embedding and keyword generation techniques. Jain-         CHIS task was surely a challenging one with little presiding
isha has proposed classification model based on BM-25 [11]      literature and yet, as can be observed from the previous sec-
ranking function and tf-idf based input representation.         tion, in both the tasks there are closed calls in terms of the
   Hua Yang has approached the task from the perspective        performances of different teams.
of improving understandability in consumer health related          We also express our sincere gratitude to the organizing
searches and their information retrieval based query expan-     and program committee of Forum for Information Re-
sion module has provided a 69.33% accuracy.                     trieval Evaluation (FIRE), 2016, especially Mr. Parth
   Team Amrita Fire Cen has used a random forest classifier     Mehta, for providing us with the opportunity to hold the
on distributional semantic representation of the input and      shared task and to connect with the enthusiast researchers
obtained 68.12% accuracy. They have used the non-negative       across India and abroad who share the same interest.
matrix factorization technique for obtaining the distributed       In future, we are looking forward to work again with such
                                                                expert groups to come up with novel solutions to more chal-
2
 This is the final updated result table, the individual team    lenging health-care data analytic problems.
working notes may not contain the latest updated version
due to some late changes
                                                                6.    REFERENCES
 [1] R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu.          [14] S. Somasundaran and J. Wiebe. Recognizing stances
     Mining newsgroups using networks arising from social             in ideological on-line debates. In Proceedings of the
     behavior. In Proceedings of the 12th International               NAACL HLT 2010 Workshop on Computational
     Conference on World Wide Web, WWW ’03, pages                     Approaches to Analysis and Generation of Emotion in
     529–535, New York, NY, USA, 2003. ACM.                           Text, CAAGET ’10, pages 116–124, Stroudsburg, PA,
 [2] P. Anand, M. Walker, R. Abbott, J. E. F. Tree,                   USA, 2010. Association for Computational Linguistics.
     R. Bowmani, and M. Minor. Cats rule and dogs                [15] M. Thomas, B. Pang, and L. Lee. Get out the vote:
     drool!: Classifying stance in online debate. In                  Determining support or opposition from congressional
     Proceedings of the 2Nd Workshop on Computational                 floor-debate transcripts. In Proceedings of the 2006
     Approaches to Subjectivity and Sentiment Analysis,               Conference on Empirical Methods in Natural Language
     WASSA ’11, pages 1–9, Stroudsburg, PA, USA, 2011.                Processing, EMNLP ’06, pages 327–335, Stroudsburg,
     Association for Computational Linguistics.                       PA, USA, 2006. Association for Computational
 [3] A. Balahur, Z. Kozareva, and A. Montoyo.                         Linguistics.
     Determining the polarity and source of opinions             [16] M. A. Walker, P. Anand, R. Abbott, and R. Grant.
     expressed in political debates. In Proceedings of the            Stance classification using dialogic properties of
     10th International Conference on Computational                   persuasion. NAACL HLT ’12, pages 592–596,
     Linguistics and Intelligent Text Processing, CICLing             Stroudsburg, PA, USA, 2012. Association for
     ’09, pages 468–480, Berlin, Heidelberg, 2009.                    Computational Linguistics.
     Springer-Verlag.                                            [17] Y.-C. Wang and C. P. Rosé. Making conversational
 [4] O. Biran and O. Rambow. Identifying justifications in            structure explicit: Identification of initiation-response
     written dialogs,. In Proceedings of the 2011 IEEE                pairs within online discussions. In 2010 North
     Fifth International Conference on Semantic                       American Chapter of the Association for
     Computing, 2011.                                                 Computational Linguistics, HLT ’10, pages 673–676,
 [5] O. G. Blog. Google health information knowledge                  Stroudsburg, PA, USA, 2010. Association for
     graph, 2015.                                                     Computational Linguistics.
 [6] A. Faulkner. Automated classification of stance in          [18] A. Yessenalina, Y. Yue, and C. Cardie. Multi-level
     student essays: An approach using stance target                  structured models for document-level sentiment
     information and thewikipedia link-based measure. In              classification. In Proceedings of the 2010 Conference
     Proceedings of the Twenty-Seventh International                  on Empirical Methods in Natural Language Processing,
     Artificial Intelligence Research Conference, 2012.               EMNLP ’10, pages 1046–1056, Stroudsburg, PA, USA,
 [7] W. Ferreira and A. Vlachos. Emergent: a novel                    2010. Association for Computational Linguistics.
     data-set for stance classification. In Proceedings of the   [19] E. Yom-Tov and L. Fernández-Luque. Information is
     2016 Conference of the North American Chapter of                 in the eye of the beholder: Seeking information on the
     the Association for Computational Linguistics., 2016.            MMR vaccine through an internet search engine. In
 [8] Q. V. Le and T. Mikolov. Distributed representations             AMIA 2014, American Medical Informatics
     of sentences and documents. In ICML, volume 14,                  Association Annual Symposium, Washington, DC,
     pages 1188–1196, 2014.                                           USA, November 15-19, 2014, 2014.
 [9] T. Mullen and R. Malouf. Taking sides: User
     classification for informal online political discourse.
     Internet Research, 18:177–190, 2008.
[10] A. Murakami and R. Raymond. Support or oppose?:
     Classifying positions in online debates from reply
     activities and opinion expressions. In Proceedings of
     the 23rd International Conference on Computational
     Linguistics: Posters, COLING ’10, pages 869–875,
     Stroudsburg, PA, USA, 2010. Association for
     Computational Linguistics.
[11] S. Robertson, H. Zaragoza, and M. Taylor. Simple
     bm25 extension to multiple weighted fields. In
     Proceedings of the thirteenth ACM international
     conference on Information and knowledge
     management, pages 42–49. ACM, 2004.
[12] P. Sobhani, D. Inkpen, and S. Matwin. From
     argumentation mining to stance classification. NAACL
     HLT 2015, 2015.
[13] S. Somasundaran and J. Wiebe. Recognizing stances
     in online debates. In Proceedings of the Joint
     Conference of the 47th Annual Meeting of the ACL
     and the 4th International Joint Conference on Natural
     Language Processing of the AFNLP: Volume 1 -
     Volume 1, ACL ’09, pages 226–234, Stroudsburg, PA,
     USA, 2009. Association for Computational Linguistics.