1. INTRODUCTION

Raksha Sanjay Jalan

jalan.raksha@research jalan.raksha@research. iiit.ac.in 0

Pattisapu Nikhil Priyatam

nikhil.pattisapu@research nikhil.pattisapu@research. iiit.ac.in 0

Vasudeva Varma

vv@iiit.ac.in 0 0 Search and Information, Extraction Lab, IIIT Hyderabad , Hyderabad , India

2 5

World Wide Web acts as one of the major sources of information for health related questions. However, often, there are multiple con icting answers to a single question and it is hard to come up with \a single best correct answer". Therefore, it is highly desirable to identify con icting perspectives about a particular question (or topic). In this paper, we have described our participation in Consumer Health Information System(CHIS) task at FIRE 2016. There were two sub-tasks in this contest. The rst sub-task deals with identifying if a particular answer is relevant to a given question. The second sub-task deals with detecting if a particular answer agrees or refuses the claim posed in a given question. We pose both these tasks as supervised pair classi cation tasks. We report our results for various document representations and classi cation algorithms.

Pair classi cation tasks document representations

1. INTRODUCTION

Most of the research developments in area of Question Answering(QA), as fostered by TREC, have so far focused on open-domain QA systems. Recently however, the eld has witnessed a growing interest in restricted domain QA.

The health domain is one of the most information critical domains in need of intelligent Question Answering systems that can e ectively aid medical researchers and health care professionals in their daily information search.

The proposed CHIS task investigates complex health information search in scenarios where users search for health information with more than just a single correct answer, and look for multiple perspectives from diverse sources both from medical research and from real world patient narratives.

Given a CHIS query,a document/set of documents associated with that query, the task is to classify the sentences in the document as relevant to the query or not. The relevant sentences are those from that document, which are useful in providing the answer to the query. These relevant sentences need to be further classi ed as supporting the claim made in the query, or opposing the claim made in the query.

We pose both these problems as pair classi cation tasks, where given a (question, answer) pair, the system has to judge whether or not the answer is relevant to the query and if so, whether or not it supports the claim made in the query. Consider the following example Question: Are e-cigarettes safer than normal cigarettes? Sentence 1: Because some research has suggested that the levels of most toxicants in vapor are lower than the levels in smoke, e-cigarettes have been deemed to be safer than regular cigarettes.

Sentence 2: David Peyton, a chemistry professor at Portland State University who helped conduct the research, says that the type of formaldehyde generated by e-cigarettes could increase the likelihood it would get deposited in the lung, leading to lung cancer.

Sentence 3: Harvey Simon, MD, Harvard Health Editor, expressed concern that the nicotine amounts in e-cigarettes can vary signi cantly.

In the above example Sentence 1 is Relevant and supports the claim made in the question. Sentence 2 is relevant but refutes the claim made in the question. Sentence 3 is irrelevant to the question. For both the tasks, we used K-fold cross validation technique to evaluate our results. 2.

RELATED WORK

Our proposed method solves question answering task as classi cation task.Lot of research work has been done on text categorization.

Text representation is one of the key factors that a ects the performance of classi er. The Paragraph Vector algorithm by Le and Mikolov[ 5 ]also termed paragraph2vec is a powerful method to nd suitable vector representations for sentences, paragraphs and documents of variable length. The algorithm tries to nd embeddings for separate words and paragraphs at the same time through a procedure similar to word2vec. De Boom, Cedric and Van Canneyt[ 1 ] were rst to come up with hybrid method for short text representations that combines the strength of dense distributed representations with the strength of tf-idf based methods to automatically reduce the impact of less informative terms.According to this paper, combination of word embeddings and tf-idf information leads to a better model for semantic content within short text fragments. Ruiz, Miguel E and Srinivasan, Padmini[ 8 ] presented the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model has used a divide and conquer principle to de ne smaller categorization problems based on a prede ned hierarchical structure. The nal classi er was a hierarchical array of neural networks. They have shown that the use of the hierarchical structure improves text categorization performance with respect to an equivalent at model.

Dumais, Susan[ 2 ]has experimented with di erent automatic learning algorithms for text classi cation.Each document is represented as vector of words as done in vector representation of information retrieval[ 9 ].This vectros are then fed to di erent classi ers for text categorization.Experiments have shown that Linear Support Vector Machines(SVM) is more promising as compared to other classi ers on their dataset.But for our task Naive Bayes has outperformed.

APPROACH

In the pair classi cation task, i.e. categorizing the pair (qm; an) we create two labeled datasets for each query as shown below.

RelevanceDatasetqm = f(an; 1) such that an is relevant to qmg [ f(an; 0) such that an is not relevant to qmg (1) ClaimDatasetqm = f(an; 1) such that an supports the claim made in qmg [ f(an; 0) such that an ref utes the claim made in qmg

[ f(an; 2) such that an is neutral to the claim made in qmg (2)

Note that we could use the above dataset creation techniques only because the number of questions were xed and known in advance.

We observed that, labels were highly imbalanced in both datasets with a larger number of positive examples and fewer negative examples. We use oversampling and under sampling based techniques to mitigate this problem (OverSampling technique:Synthetic Minority Over-sampling Technique (SMOTE)). After creating the datasets. We split the data into train and test sets. We use doc2vec and tf-idf and ensemble based representations to represent each answer (or sentence). We train multiple supervised algorithms on each of the above mentioned datasets. 3.1

TF-IDF

TF-IDF representation is one of the well established document representation technique in the eld of text mining. This kind of representation is capturing syntactic similarities as for the example (is cancer curable?, Chemotherapy is often used to cure cancer). However, TF-IDF based representations are not e cient at capturing the semantic similarities between sentences as in the example: Does sun exposure cause skin cancer ?, Exposure to UV rays from the sun or tanning beds is the most preventable risk factor for melanoma. Note that melanoma, cancer are highly similar concepts but their similarity is not captured in TF-IDF representation. We therefore also experiment with representations that are good at capturing the semantic relations v(doc) v(t-2) v(t-1) v(t+1) v(t+2)

Concatenated Representation v(t) between text. We have used the TF-IDF implementation of scikit-learn. 3.2

Doc2Vec

Recently, Word2Vec[ 6 ] based models have been exploited heavily for several tasks that require capturing semantic relatedness between text. Doc2Vec[ 5 ] is one such model which is trained on huge text corpora for the task of word prediction. The doc2vec algorithm has two variants - Distributed Memory (DM) and Distributed Bag of Words (DBoW). For this work, we use Distributed Memory (DM) based models due to its superior performance in previously reported tasks. The architecture of DM is shown in gure 1

Input

Projection

Output

The problem with doc2vec or any other neural network based model is that it requires huge amount of training data. The main reason for this is the large number of parameters which need to be learnt. Consider the example of doc2vec model shown in gure 1. The vector representations of 4 words, document representation, neural network weights, all have to be learnt. The number of sentences available in CHIS task is too low for such representation learning schemes. To address this issue, we choose pre-trained word vectors which already capture semantic relatedness between words to a large extent.

Although, google released word vectors trained on google news corpus using the word2vec algorithm, we did not choose these vectors as the number of hits were too low. The main reason for this is the di erence in domain (many words in the health care domain, found in the CHIS dataset were not present in the google news dataset). We therefore used the vectors released by Pyssalo et al who also train word2vec algorithm on PubMed corpus. We used Gensims implementation for Doc2Vec1. 3.3

Ensemble Representation

1https://radimrehurek.com/gensim/models/doc2vec.html In order to capture both the syntactic and semantic similarities e ciently, we use an ensemble approach, where for each sentence we obtain its TF-IDF and doc2vec representations (from previous sections). We then concatenate both these representations to form an ensemble representation.

DATASET

This CHIS dataset consists of 5 health related queries and 5 les containing labeled sentences for respective queries. Each sentence has two associated labels

Relevance Label (Relevant or Irrelevant)

Support Variable (Support, Oppose or Neutral) The queries are of the following formats, where A, B represent medical entities.

Does A causes B? Does A cure B? Is A is better than B? EXPERIMENTS

We used document embedding size of 400 for all the experiments involving doc2vec, word embedding size obtained using word2vec was 200. We have used Pythons sklearn library to realize the SVM, Naive Bayes algorithms.We have realized a neural network using Keras library 2 using Theano as backend. We have used sigmoid as activation function and Binary Cross Entropy(BCE) as loss function. Data is fed to the network in mini-batches with a mini-batch size of 32. We use a 10 fold cross validation to evaluate all our results.

RESULTS

In this section we present the results of various document representations and classi cation algorithms for both the CHIS subtasks: predicting relevant answers and predicting whether or not a given answer supports the claim made in the question.

Query Name Skin Cancer MMR HRT

E-cigarettes

Vitamin C Average Accuracy

Neural Network 14.62 8.45 10.11 17.79 6.05 11.404 2https://keras.io/keras-deep-learning-library-for-theanoand-tensor ow

Query Name Skin Cancer MMR HRT

E-cigarettes

Vitamin C Average Accuracy

Query Name Skin Cancer MMR HRT

E-cigarettes

Vitamin C Average Accuracy Neural Network 28.66 12.35 15.92 20.81 19.76 19.5

Query Name Skin Cancer MMR HRT

E-cigarettes

Vitamin C Average Accuracy

CONCLUSION AND FUTURE WORK

In this work, we have designed algorithms to detect if an answer is relevant to a particular health query and whether or not it supports the claim made in the query. We pose both SVM 62.91 36.06 34.32 52.23 50.67 47.238 SVM 54.95 25.42 24.67 32.96 35.78 34.756 these tasks as classi cation tasks. We experimented with a combination of several document representation schemes and classi cation algorithms. We note that Naive Bayes classi er has outperformed other classi cation algorithms by a signi cant margin. We got the average accuracy of 73.03% in sub-task 1 and 52.46 in sub-task 2. We also additionally note that our model has predicted results with highest accuracy for MMR query. The choice of training one classi er for a query also gave superior performance compared to training one classi er per class. We observed that our model's performance is highly sensitive towards towards quality of pre-trained word vectors, choice of classi er.

We wish to further extend this work by obtaining pretrained word vectors using other neural network based algorithms like GLoVE[ 7 ], Skip thought[ 4 ], Deep Structured Semantic Model(DSSM)[ 3 ], Convolutional Deep Structured SemanticModels(CDSSM)[ 10 ]. We also wish to use these algorithms in order to obtain richer document representations. In this work, we have trained one classi er per query, but such a setting is not feasable for building real applications where the queries are not known in advance. In such scenarios we wish to categorize queries and train a single classi er for each query category.

[1] C. De Boom , S. Van

Canneyt , S.

Bohez , T.

Demeester , and B.

Dhoedt . Learning semantic similarity for very short texts . In 2015 IEEE International Conference on Data Mining Workshop (ICDMW) , pages 1229 { 1234 . IEEE, 2015 .

[2]

Dumais ,

Platt ,

Heckerman , and

Sahami . Inductive learning algorithms and representations for text categorization . In Proceedings of the seventh international conference on Information and knowledge management , pages 148 { 155 . ACM, 1998 .

[3]

P.-S.

Huang ,

He ,

Gao ,

Deng ,

Acero , and

Heck . Learning deep structured semantic models for web search using clickthrough data . In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management , pages 2333 { 2338 . ACM, 2013 .

[4]

Kiros ,

Zhu ,

R. R.

Salakhutdinov ,

Zemel ,

Urtasun ,

Torralba , and

Fidler . Skip-thought vectors . In Advances in neural information processing systems , pages 3294 { 3302 , 2015 .

[5]

Q. V.

Le and

Mikolov . Distributed representations of sentences and documents . In ICML, volume 14 , pages 1188 { 1196 , 2014 .

[6]

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado , and

Dean . Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems , pages 3111 { 3119 , 2013 .

[7]

Pennington ,

Socher , and

C. D.

Manning . Glove: Global vectors for word representation . In EMNLP , volume 14 , pages 1532 { 43 , 2014 .

[8]

M. E.

Ruiz and

Srinivasan . Hierarchical text categorization using neural networks . Information Retrieval , 5 ( 1 ): 87 { 118 , 2002 .

[9]

Salton and

Buckley . Term-weighting approaches in automatic text retrieval . Information processing & management , 24 ( 5 ): 513 { 523 , 1988 .

[10]

Shen ,

He ,

Gao ,

Deng , and

Mesnil . Learning semantic representations using convolutional neural networks for web search . In Proceedings of the 23rd International Conference on World Wide Web , pages 373 { 374 . ACM, 2014 .