CHIS@FIRE: Overview of the Shared Task on Consumer Health Information Search Manjira Sinha, Sandya Mannarswamy, Shourya Roy Xerox Research Center India Bengaluru, India (manjira.sinha, sandya.mannarswamy, shourya.roy)@xerox.com ABSTRACT anced view of the diverse perspectives/points of view avail- People are increasingly turning to the World Wide Web to able, both for and against the hypothesis posed in the search find answers for their health and lifestyle queries, While query. Subjective health related queries such as ’does treat- search engines are effective in answering direct factual ques- ment X effective for disease Y?’ or ’can X cause disease tions such as ‘What are the symptoms of a disease X?’, they Y’ do not have a single definitive answer on the web due are not so effective in addressing complex consumer health to the multiple supporting/opposing perspectives available queries, which do not have a single definitive answer, such as on the web related to them, instead multiple perspectives ‘Is treatment X effective for disease Y?’. Instead, the users (which very often are contradictory in nature) are available are presented with a vast number of search results with of- on the web regarding the queried information. The pres- ten contradictory perspectives and no definitive conclusion. ence of multiple perspectives with different grades of sup- The term “Consumer Health Information Search” (CHIS) porting evidence (which is dynamically changing over time is used to denote such information retrieval search tasks, due to the arrival of new research and practice evidence) for which there is “No Single Best Correct Answer”. The makes it all the more challenging for a lay searcher. Fig- proposed CHIS track aims to investigate complex health ure 1 depicts the precise scenario. In our Consumer Health information search in scenarios where users search for health Information search CHIS shared task track on FIRE 1 , we information with more than just a single correct answer, and have attempted to encourage the development of innovative look for multiple perspectives from diverse sources both from computational models to statistically represent the multiple- medical research and from real world patient narratives. perspective around a general health search query and there- fore, assist the self-searcher with better and meaningful in- formation insights.. Keywords information retrieval for clinical texts, multi-perspective health data mining 1. INTRODUCTION World Wide Web is increasingly being used by consumers as an aid for health decision making and for self-management of chronic illnesses as evidenced by the fact that one in ev- ery 20 searches on Google [5] is about health. Information access mechanisms for factual health information retrieval have matured considerably, with search engines providing Figure 1: Contradicting search results for clinical Fact checked Health Knowledge Graph search results to fac- query tual health queries. While the direct informational needs of the Online Health Information Seekers regarding well estab- lished disease symptoms and remedies are well met by search 2. BACKGROUND engines [5], general search engines do not provide defini- At present times, there has been considerable interest tive answers for addressing complex consumer health queries in the field of stance classification and stance modelling. which have multiple different points of view/perspectives as- Stance classification has been applied to different debate sociated with them. settings such as congressional debates [15, 18, 3], company It is pretty straightforward to get an answer to the query internal debates [9, 10, 1] and online public forums on social “what are the symptoms of Diabetes” from the search en- and political topics [13, 14, 17, 4, 16, 2]. Recently there gines. However retrieval of relevant multiple perspectives has been work on stance classification of argumentative po- for complex health search queries which do not have a sin- litical essays [6], online news articles [7] and online news gle definitive answer still remains elusive with most of the comments [12]. general purpose search engines. For example, a user health Unlike many of the earlier research settings which have an- query such as “can metabolic therapy cure brain cancer” alyzed posts on public debate topics, multi-perspective con- causes considerable frustration for the searcher as he needs 1 to wade through hundreds of search results to obtain a bal- https://sites.google.com/site/multiperspectivehealthqa/ sumer health information is not typically characterized by to classify the sentences in the document as relevant strong emotion/opinion bearing language, nor does it have to the query or not. The relevant sentences are those strongly delineated supporting/opposing topic words. They from that document, which are useful in providing the typically contains domain specific technical terms and sparse answer to the query. in emotional/affective words and is typically factual in na- ture. A closely related work [19] discussed the information 2. TASK B: These relevant sentences need to be further seeking behaviour on MMR vaccine on internet search en- classified as supporting the claim made in the query, gines and developed an automated way to score Internet or opposing the claim made in the query. search queries and web pages as to the likelihood of the searcher deciding to vaccinate. Also while socio-political Example : debate stances can often be delineated by well demarcated • Query- “Are e-cigarettes safer than normal cigarettes?” topic words (for instance, pro-abortion stance is often char- acterized by the topical words ’right to choose’, whereas anti- • Retrieved sentence S1 - “Because some research has sug- abortion is characterized by the topical words ’pro-life’ ), gested that the levels of most toxicants in vapor are lower health related texts do not typically contain stance delin- than the levels in smoke, e-cigarettes have been deemed to eating topic words since the same proposition can be used be safer than regular cigarettes”. A)Relevant, B) Sup- for supporting or opposing a given health query, depending port on the supporting research evidence. For instance, consider the following example sentences retrieved in response to the • Retrieved sentence S2 - “David Peyton, a chemistry pro- query Sun exposure causes skin cancer : fessor at Portland State University who helped conduct the research, says that the type of formaldehyde generated by • S1: Many studies have found that skin cancer rates e-cigarettes could increase the likelihood it would get de- are increasing in indoor workers. posited in the lung, leading to lung cancer.” A)Relevant, • S2: very few studies have demonstrated that skin can- B) Oppose cer rates are increasing in indoor workers. • Retrieved sentence S2 - “Harvey Simon, MD, Harvard Both sentences contain the topical phrase skin cancer rates Health Editor, expressed concern that the nicotine amounts in indoor workers with sentence S1 providing evidence in in e-cigarettes can vary significantly.” A)Irrelevant, B) support of it, whereas sentence S2 providing evidence op- Neutral posing it. This illustrates the difficult of identifying stance delineating topic words in health related text. Our task have 5 consumer health queries, Figure 2 and The technical language of the information in these queries figure 3 below presents the comprehensive statistics of the is also another factor which makes the stance classification CHIS queries used in our task released as training and test complex. Given an example sentence E-cigarettes contain respectively. di-acetyl which has been associated with popcorn lung syn- drome for a sample query E-cigarettes are safer than nor- mal cigarattes, it is not evident at first glance, whether this sentence is supportive/opposing the query. This makes the task more challenging, compared to general domain stance classification. 3. TASK DESCRIPTION Given a CHIS query, and a document/set of documents associated with that query, the task is to classify the sen- tences in the document as relevant to the query or not. The Figure 2: Statistics of Queries in Training Data. relevant sentences are those from that document, which are useful in providing the answer to the query. These rele- vant sentences need to be further classified as supporting the claim made in the query, or opposing the claim made in the query. Example query: Does daily aspirin therapy prevent heart attack? S1: “Many medical experts recommend daily aspirin ther- apy for preventing heart attacks in people of age fifty and above.” [affirmative/Support] S2: “While aspirin has some role in preventing blood clots, Figure 3: Statistics of Queries in Test Data. daily aspirin therapy is not for everyone as a primary heart attack prevention method.” [disagreement/Oppose] 3.1 Detailed Task Description 4. TASK PARTICIPANTS AND RESULTS A total of 9 teams participated in task and 9 submissions There are two sets of tasks: are obtained against Task A and 8 submissions are obtained 1. TASK A: Given a CHIS query, and a document/set against Task B. Details of the participating teams are shown of documents associated with that query, the task is in figure 4 below. representation. Team JNUTH model uses a aggregate over a range of sim- ilarity measures to obtain the relevance-irrelevance decision for a data input. They have obtained 54.84% accuracy. 4.2 Performance of Teams in Task B Figure 6 below demonstrates the team performances for Figure 4: Team details Task B. 4.1 Performance of Teams in Task A Figure 5 below presents the team performance statistics for Task A, i.e., where a retrieved instance has to be clas- sified according to whether it is relevant or irrelevant to a specific query2 . Figure 6: Final Result Table for Task B In task B, team JNUTH has jointly secured the first po- sition with team Fermi. JNUTH has used a C-support vec- tor machine classifier with radial basis kernel. They have used tf-idf for input representation followed by a max-feature sorting. Their model has obtained 55.43% accuracy. Team Figure 5: Final Result Table for Task A Fermi has used a deep neural network architecture and a bag-of-phrase representation to achieve 54.87% accuracy. As can be observed in Task A, team SSN NLP and team With a score of 53.99% Hua Yang have secured the sec- Fermi have secured the top scoring positions with accuracy ond rank. His model uses a naive-Bayes classifier and tf-idf 78.10% and 77.04% respectively. SSN NLP has proposed a representation. Team Techie-challengers also used a naive- decision tree model based on sophisticated text features in- Bayes classifier, but on doc2vec input representation to ob- cluding part-of speech. They have used a chi-square feature tain 52.47% accuracy. Therefore, they hold the third rank. selection to extract the informative features and reduce the Team Amrita Fire Cen has used a random forest classifier number of spurious features and demonstrated that such a on distributional semantic representation of the input and feature selection approach can offer a significant gain. Team obtained 38.53% accuracy. Individual participant Jainisha Fermi have used a deep neural network architecture with Sankhavara has developed a model based on BM-2 ranking Rectified-linear (ReLu) and Sigmoid activation over bag-of- function to obtain overall accuracy 37.96%. phrase features. Team Amrita Cen has modeled using a support vector Team JU KS group and Techie-challengers have secured machine classifier with input feature representation obtained the second position jointly with a closed call of 73.39% and by word-embedding and keyword generation techniques to 73.03% accuracy respectively. JU KS group has implemented obtain 34.64% accuracy. Team JU KS group has modeled a support vector machine with polynomial kernel to classify the task as sentiment classification problem and their inno- the data. They have curated informative text features such vative feature set consists of positive, negative and neutral as part-of-speech matching, neighborhood matching to rep- polarity words along with information from Task A. They resent the input data. Techie-challengers has proposed a have achieved an overall accuracy of 33.64%. naive-bayes classifier on doc2vec [8] and tf-idf based ensem- ble representation of the data. 5. CONCLUSION With accuracy 70.20% and 70.28%, team Amrita Cen and individual participant Jainisha Shankhavara have ranked third We thank all the participants for expressing interest in our jointly. Team Amrita Cen has used a support-vector-machine track. It has been a great experience to witness the innova- classifier on top of input feature representation obtained by tive models and techniques proposed by different teams. The word-embedding and keyword generation techniques. Jain- CHIS task was surely a challenging one with little presiding isha has proposed classification model based on BM-25 [11] literature and yet, as can be observed from the previous sec- ranking function and tf-idf based input representation. tion, in both the tasks there are closed calls in terms of the Hua Yang has approached the task from the perspective performances of different teams. of improving understandability in consumer health related We also express our sincere gratitude to the organizing searches and their information retrieval based query expan- and program committee of Forum for Information Re- sion module has provided a 69.33% accuracy. trieval Evaluation (FIRE), 2016, especially Mr. Parth Team Amrita Fire Cen has used a random forest classifier Mehta, for providing us with the opportunity to hold the on distributional semantic representation of the input and shared task and to connect with the enthusiast researchers obtained 68.12% accuracy. They have used the non-negative across India and abroad who share the same interest. matrix factorization technique for obtaining the distributed In future, we are looking forward to work again with such expert groups to come up with novel solutions to more chal- 2 This is the final updated result table, the individual team lenging health-care data analytic problems. working notes may not contain the latest updated version due to some late changes 6. REFERENCES [1] R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu. [14] S. Somasundaran and J. Wiebe. Recognizing stances Mining newsgroups using networks arising from social in ideological on-line debates. In Proceedings of the behavior. In Proceedings of the 12th International NAACL HLT 2010 Workshop on Computational Conference on World Wide Web, WWW ’03, pages Approaches to Analysis and Generation of Emotion in 529–535, New York, NY, USA, 2003. ACM. Text, CAAGET ’10, pages 116–124, Stroudsburg, PA, [2] P. Anand, M. Walker, R. Abbott, J. E. F. Tree, USA, 2010. Association for Computational Linguistics. R. Bowmani, and M. Minor. Cats rule and dogs [15] M. Thomas, B. Pang, and L. Lee. Get out the vote: drool!: Classifying stance in online debate. In Determining support or opposition from congressional Proceedings of the 2Nd Workshop on Computational floor-debate transcripts. In Proceedings of the 2006 Approaches to Subjectivity and Sentiment Analysis, Conference on Empirical Methods in Natural Language WASSA ’11, pages 1–9, Stroudsburg, PA, USA, 2011. Processing, EMNLP ’06, pages 327–335, Stroudsburg, Association for Computational Linguistics. PA, USA, 2006. Association for Computational [3] A. Balahur, Z. Kozareva, and A. Montoyo. Linguistics. Determining the polarity and source of opinions [16] M. A. Walker, P. Anand, R. Abbott, and R. Grant. expressed in political debates. In Proceedings of the Stance classification using dialogic properties of 10th International Conference on Computational persuasion. NAACL HLT ’12, pages 592–596, Linguistics and Intelligent Text Processing, CICLing Stroudsburg, PA, USA, 2012. Association for ’09, pages 468–480, Berlin, Heidelberg, 2009. Computational Linguistics. Springer-Verlag. [17] Y.-C. Wang and C. P. Rosé. Making conversational [4] O. Biran and O. Rambow. Identifying justifications in structure explicit: Identification of initiation-response written dialogs,. In Proceedings of the 2011 IEEE pairs within online discussions. In 2010 North Fifth International Conference on Semantic American Chapter of the Association for Computing, 2011. Computational Linguistics, HLT ’10, pages 673–676, [5] O. G. Blog. Google health information knowledge Stroudsburg, PA, USA, 2010. Association for graph, 2015. Computational Linguistics. [6] A. Faulkner. Automated classification of stance in [18] A. Yessenalina, Y. Yue, and C. Cardie. Multi-level student essays: An approach using stance target structured models for document-level sentiment information and thewikipedia link-based measure. In classification. In Proceedings of the 2010 Conference Proceedings of the Twenty-Seventh International on Empirical Methods in Natural Language Processing, Artificial Intelligence Research Conference, 2012. EMNLP ’10, pages 1046–1056, Stroudsburg, PA, USA, [7] W. Ferreira and A. Vlachos. Emergent: a novel 2010. Association for Computational Linguistics. data-set for stance classification. In Proceedings of the [19] E. Yom-Tov and L. Fernández-Luque. Information is 2016 Conference of the North American Chapter of in the eye of the beholder: Seeking information on the the Association for Computational Linguistics., 2016. MMR vaccine through an internet search engine. In [8] Q. V. Le and T. Mikolov. Distributed representations AMIA 2014, American Medical Informatics of sentences and documents. In ICML, volume 14, Association Annual Symposium, Washington, DC, pages 1188–1196, 2014. USA, November 15-19, 2014, 2014. [9] T. Mullen and R. Malouf. Taking sides: User classification for informal online political discourse. Internet Research, 18:177–190, 2008. [10] A. Murakami and R. Raymond. Support or oppose?: Classifying positions in online debates from reply activities and opinion expressions. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING ’10, pages 869–875, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. [11] S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 42–49. ACM, 2004. [12] P. Sobhani, D. Inkpen, and S. Matwin. From argumentation mining to stance classification. NAACL HLT 2015, 2015. [13] S. Somasundaran and J. Wiebe. Recognizing stances in online debates. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1, ACL ’09, pages 226–234, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.