Relevance Detection and Argumentation Mining in Medical
                   Vijayasaradhi Indurthi                                             Subba Reddy Oota
                   IIIT Hyderabad, India                                             IIIT Hyderabad, India
              vijaya.saradhi@students.iiit.ac.in                                  oota.subba@students.iiit.ac.in

ABSTRACT                                                             Sentence 2: “While aspirin has some role in preventing blood
In this paper we describe a method to determine the relevancy of a   clots, daily aspirin therapy is not for everyone as a primary heart
query with a sentence in the document in the field of medical        attack prevention method”. [Disagreement/Oppose]
domain. We also describe a method to determine if the given
statement supports the query, opposes the query or is neutral with
                                                                     3. DESCRIPTION
respect to the query. This is a part of CHIS shared task at FIRE     For the shared tasks described above, we adopt a deep learning
2016.                                                                approach for solving them. Deep learning is a method which
                                                                     allows computers to learn from experience and understand the
Keywords                                                             world in terms of a hierarchy of concepts, with each concept
Information retrieval, argument mining, relevancy detection          defined in terms of its relation to simpler concepts. By gathering
                                                                     knowledge from experience, this approach avoids the need for
1. INTRODUCTION                                                      human operators to formally specify all of the knowledge that the
World Wide Web is increasingly being used by consumers as an         computer needs. The hierarchy of concepts allows the computer to
aid for health decision making and for self-management of            learn complicated concepts by building them out of simpler ones.
chronic illnesses as evidenced by the fact that one in every 20      We use a deep neural network to train the sentences.
searches on google is about health. Information access
mechanisms for factual health information retrieval have matured
considerably, with search engines providing Fact checked Health
Knowledge Graph search results to factual health queries. It is
pretty straightforward to get an answer to the query “what are the
symptoms of Diabetes” from the search engines. However
retrieval of relevant multiple perspectives for complex health
search queries which do not have a single definitive answer still
remains elusive with most of the general purpose search engines.
The presence of multiple perspectives with different grades of
supporting evidence (which is dynamically changing over time
due to the arrival of new research and practice evidence) makes it
all the more challenging for a lay searcher.

                                                                          Figure 1. The architecture of a deep neural network
     We use the term “Consumer Health Information Search”
(CHIS) to denote such information retrieval search tasks, for        The problems described above are modeled as a supervised
which there is “No Single Best Correct Answer”; Instead multiple     learning task [1][4]. For a given query, we have been given a
and diverse perspectives/points of view (which very often are        document consisting of a set of sentences. For each sentence we
contradictory in nature) are available on the web regarding the      have been provided with the ground truths, i.e. if the sentence is
queried information. The goal of CHIS track is to research and       relevant to the query, and if the sentence supports, opposes or is
develop techniques to support users in complex multi-perspective     neutral to the query. We have trained a deep neural network [2]
health information queries.                                          for this supervised learning task.
     Given a CHIS query, and a document/set of documents             4. FEATURES
associated with that query, the FIRST task is to classify the        We have selected binary bag-of-phrases [3] representation of the
sentences in the document as relevant to the query or not [4]. The   document. Since all words in the sentence are not relevant, we
relevant sentences are those from that document, which are useful    have identified the most important features manually and used
in providing the answer to the query. The SECOND task is to          these phrases to create the feature matrix. Some of the features
classify these relevant sentences as supporting the claim made in    included the presence of supporting words like ‘evidence’,
the query, or opposing the claim made in the query [4].              ‘cause’, ‘exhibit’, ‘abnormal’, ‘nonetheless’. Opposing words like
Example query: Does daily aspirin therapy prevent heart attack?      ‘oppose’, ‘does not’, ‘least’, ‘less’, ‘nothing’, ‘harmless’ were
                                                                     also used as features as these words contribute in determining that
Sentence 1: “Many medical experts recommend daily aspirin            the sentence opposes the given query. If a feature phrase is present
therapy for preventing heart attacks in people of age fifty and      in the given text, the value for that feature would be 1. Otherwise,
above.” [Affirmative/Support]                                        the value of the feature is 0. All our features are binary. In the
                                                                     preprocessing phase, all text in the upper case was converted to
                                                                     lower case and all numbers were deleted. Some of the feature
                                                                     words and phrases are documented in the table 1.
        Table 1. Some relevant phrases used as features
                                                                       For task 1, the classification is a binary classification problem
  Increase            Intense        Evidence        Harmful           with a binary cross entropy layer at the output. For task 2, it is a
  However           Nonetheless      Oppose          Does not          multi-class classification problem, and hence a softmax layer is
                                                                       used at the output layer. For training the deep neural network, we
    Safe             Healthier       Harmless       Decreased          used keras. Keras is an open source neural network library written
                                                                       in Python. It is capable of running on top of either Tensorflow or
  Inversed            Weak            Deadly          Cancer
                                                                       Theano. Designed to enable fast experimentation with deep neural
   Disease           Overdose       Dangerous            Risk          networks, it focuses on being minimal, modular and extensible.
                                                                       We train both the neural networks for 150 epochs for
  Adverse             Hazard          Poison         Prohibit          convergence.
  Overdose            Irritate       How safe       Associated
  Suppress          Side effect      Oppose          Disorder          6. RESULTS
                                                                       The following are the results obtained on the test set. Table 4
  Incidence           Deficit        Though          Whereas           shows the average precision, recall and F1 score of the classifier
                                                                       for task 1. Table 5 shows the average precision, recall and F1
Nonetheless           Shorten         Reduce         Prevent
                                                                       score of the classifier for task 2.
   Protect          Wards off        Effective        Fewer                           Table 4. Task 1 precision on test set
Questionable          Benefit        Disagree      Unsupported             Task          Precision       Recall       F1-score
     Not                                                               Q1- Skincare         0.80          0.78           0.78
                   Inconclusive     Unjustified        Myth
                                                                         Q2-MMR             0.84          0.79           0.81
    Viral            Evidence       No increase    Good choice
                                                                         Q3-HRT             0.89          0.89           0.89
   Flawed           Counteract        Lessen        Cause pain
                                                                         Q4-ECIG            0.79          0.66           0.68
  Still high         Effective     Bothersome       No longer
                                                                         Q5-Vit C           0.73          0.73           0.71
 Inadvisable        Strengthens      Lessens         Fighting
                                                                                      Table 5. Task 2 precision on test set
  Unlikely           Still high    Good choice      Alarming
                                                                           Task          Precision       Recall      F1-score

Table 2 shows the number of features used for each dataset             Q1- Skincare         0.76          0.74          0.75
               Table 2. Features used for each dataset                   Q2-MMR             0.55          0.45          0.47

            Query                 Number of Features                     Q3-HRT             0.66          0.54          0.53
       Q1- Skincare                        81                            Q4-ECIG            0.54          0.52          0.52
        Q2-MMR                             64                            Q5-Vit C           0.52          0.50          0.49
           Q3-HRT                         105
        Q4-ECIG                            95                          7. OBSERVATIONS
                                                                       Predicting the relevance and determining if a sentence supports
         Q5-Vit C                         124                          the given query is not a trivial problem and needs knowledge of
                                                                       Natural Language Processing and Information Retrieval
                                                                       techniques. In this paper we proposed a fast deep learning method
5. ARCHITECTURE                                                        to predict the same using a deep neural network. We observe that
We use a deep neural network for training for both the tasks. The      the average precision for task 1 is 77.03% and for task 2 is
input layer had as many neurons as the input features. Task 1 is a     54.86%. Task 2 is a multi-class problem and is more difficult than
binary classification problem, indicating if the sentence was          task1.
relevant to the query or not. Task 2 is a multi-class classification
problem, which indicates if the sentence supports, opposes or is       8. FUTURE WORK
neutral to the query. Table 3 shows the architecture of the neural     In this paper, we have used a select set of phrases as features.
network for both of the CHIS tasks [2][5].                             Since the sentences and the query, both are short text segments,
     Table 3. Neural Architectures for CHIS tasks 1 and 2              features using Natural Langauge Processing like POS tagging etc
                                                                       can be used as features augmented with the existing features to
                    Hidden        #Neurons in                          improve the precision and recall [6]. Although we have identified
    Task                                           Activations
                    Layers        Hidden layer                         the features manually, the features could have been figured out by
   Task 1             2              120, 8       relu, sigmoid        selecting the adjectives and adverbs using any of the existing NLP
                                                                       toolkits. This would make the solution scalable and generic and
   Task 2              2            150, 150        tanh, tanh         can be applied for other similar datasets.
9. CODE                                                               [4] Andrenucci, A., 2008. Automated Question-Answering
All the code is available at https://github.com/saradhix/chis for         Techniques and the Medical Domain. In HEALTHINF (2)
research and academic purpose.                                            (pp. 207-212).
