=Paper= {{Paper |id=Vol-1737/T5-2 |storemode=property |title=AMRITA _CEN@FIRE 2016: Consumer Health Information Search using Keyword and Word Embedding Features |pdfUrl=https://ceur-ws.org/Vol-1737/T5-2.pdf |volume=Vol-1737 |authors=Veena P V,Remmiya Devi G,Anand Kumar M,Soman K P |dblpUrl=https://dblp.org/rec/conf/fire/VGMP16 }} ==AMRITA _CEN@FIRE 2016: Consumer Health Information Search using Keyword and Word Embedding Features== https://ceur-ws.org/Vol-1737/T5-2.pdf
     AMRITA_CEN@FIRE 2016: Consumer Health Information
      Search using Keyword and Word Embedding Features

                        Veena P V, Remmiya Devi G, Anand Kumar M, Soman K P
                              Center for Computational Engineering and Networking (CEN)
                                       Amrita School of Engineering, Coimbatore
                                 Amrita Vishwa Vidyapeetham, Amrita University, India


ABSTRACT                                                         for text retrieval for clinical studies by the Text Retrieval
This work is submitted to Consumer Health Information            Conference (TREC) in 2011 [12]. The ShARe/CLEF eHealth
Search (CHIS) Shared Task in Forum for Information Re-           evaluation lab conducted a task to analyze the effect of using
trieval Evaluation (FIRE) 2016. Information retrieval from       additional information like discharge summaries and exter-
any part of web should include informative content relevant      nal resources such as medical ontologies on the Information
to the search of web user. Hence the major task is to retrieve   Retrieval [5]. PubMed, an archive of biomedical journal and
only relevant documents according to the users query. The        Google Scholar were compared to analyze the retrieval effi-
given task includes further refinement of the classification     ciency for clinical searches which is discussed in [10]. Em-
process into three categories of relevance such as support,      bedding features also can be used efficiently for entity ex-
oppose and neutral. Any user reading an article from web         traction which also helps in improving retrieval of precise
must know whether the content of that article supports or        documents. Several methods were proposed in various FIRE
opposes title of the article. This seems to be a big challenge   tasks to perform entity extraction such as Conditional Ran-
to the system. Our proposed system is developed based on         dom Field (CRF) based entity extraction [9], Entity extrac-
the combination of Keyword based features and Word em-           tion for Indian languages using SVM based classifier [2] and
bedding based features. Classification of sentences is done      Named Entity Recognition for Indian Languages using rich
by machine learning based classifier, Support Vector Ma-         features [1]. A paper was also published on extracting en-
chine (SVM).                                                     tities for Malayalam language using Structured skip-gram
                                                                 based embedding [8].
                                                                    The proposed system is useful in finding sentences relevant
CCS Concepts                                                     to the given query and also to find whether it is a supporting,
•Information systems → Clustering and classifica-                opposing or neutral sentence. An example of training data
tion; •Applied computing → Health care information               of the query ‘Skin Cancer’ is given in Table 1.
systems;                                                            The given dataset along with some additionally collected
                                                                 clinical documents from Web were subjected to unsuper-
Keywords                                                         vised feature extraction. Different approaches like Keyword
                                                                 based and Word embedding based information search were
Word embedding, Machine Learning, Support Vector Ma-
                                                                 carried out. The integration of these two features achieved
chine (SVM)
                                                                 better results. Our proposed system is developed based on
                                                                 the combination of keyword and word embedding features.
1.    INTRODUCTION                                               The embedding vectors with the keyword feature vector and
   Natural Language Processing plays a vital role in the in-     the corresponding labels of Relevant/Irrelevant and Sup-
terpretation of human language in the most understandable        port/Oppose/Neutral tags were together given to train clas-
format to the system. This type of role finds application in     sifier. A well-known machine-learning based classifier, Sup-
delivering the most relevant information through web search.     port Vector Machine (SVM) is used for classification task.
Nowadays, information regarding health issues is one among       Section 2 includes the task description. The details about
the essential need for people. The number of researches and      the dataset used is given in Section 3. Section 4 discusses
evidences in this field are growing rapidly day by day. So,      about our proposed methodology. Experimentation and Re-
if an individual searches through web for any health related     sults are explained in Section 5. The conclusion of the paper
query, a large number of documents will be retrieved. The        is given in Section 6.
efficiency of the search lies in the fact that the retrieved
documents are relevant to the query. The main objective of
proposed system is that a person irrespective of his absence     2. TASK DESCRIPTION
of domain knowledge is supposed to get benefited through            Our system is submitted in Consumer Health Informa-
web search.                                                      tion Search in FIRE2016 [11]. The task given includes two
   In the past years, many developments were made on ef-         subtasks. The first task is to classify the sentences in the
ficient retrieval of relevant clinical data. A paper was pub-    document as relevant to the query or not (R/IR). The second
lished which discusses on the role of shared task on overcom-    task is to further classify Relevant and Irrelevant sentences
ing barriers to NLP in clinical domain [4]. Medical Records      into Support, Oppose and Neutral (S/O/N) sentences with
Track was a task conducted for comparing algorithms used         respect to the query. Dataset contains 5 queries say, Qskin ,
                                         Table 1: Example from training data
                                                Sentences                              R/IR     S/O/N
                        Most skin cancers are caused by exposure to the sun.            R         S
                        Skin cancer can look many different ways.                       IR        N
                        Evidence shows that the Sun Protects You from Melanoma.         R         O


                             Table 2: Number of sentences in Train Data and Test Data
                                                      Train Data
              Query                     Relevant                       Irrelevant                           Test Data
                            Support Oppose Neutral Total Support Oppose Neutral Total
          Skin Cancer         104       76       13   193      0        2        146  148                       88
          E-cigarette          93      165       35   293      0        0        120  120                       64
          MMR-Vaccine          72       92       44   208      0        0         51   51                       58
          Vitamin- C          111       68       29   208      0        0         70   70                       74
          Women-HRT            41      132       31   204      1        4         37   42                       72


                                                                  4. METHODOLOGY
        Table 3: Additional Dataset Collected
               Query     Additional dataset
                                                                    Our proposed system is based on the combination of Word
           Skin Cancer          1044
                                                                  embedding and Keyword based generation of features. Word
           E-cigarette          1003
                                                                  embedding features are generally word vectors obtained us-
           MMR-Vaccine          1084
                                                                  ing the word2vec tool [7]. In this case, the input is sen-
           Vitamin-C            1199                              tences of each query. So we need to get embedding features
           Women-HRT            1469                              for sentences. These embedding features are obtained from
                                                                  word2vec features. Input for word2vec includes training
                                                                  data and additional dataset collected from online resource.
Qecig , Qmmr , Qvitc and Qhrt . Each query contains 200 to        Word2vec is trained to get embedding features for training
400 sentences. For a query say Qskin , each sentence in that      data. The size of vector is set to 100. Skip gram model is
particular document is classified for relevant/irrelevant and     chosen to train word2vec. The embedding features result-
support/oppose/neutral (given two task) with labels L1 and        ing from word2vec is used to generate embedding feature for
L2 such that L1 ∈ { R, IR } and L2 ∈ { S, O N }. Thus,            sentence as in Eq. (1) [6].
each sentence qi in the document of Q will have two labels
- l1 denoting relevant or irrelevant and l2 denoting support,
                                                                               y = a + T h(D, wt−k , ..., wt+k ; W )       (1)
oppose or neutral. Relevant sentences are useful in provid-
ing answer for the given query. With the help of resulting           where a,T are the softmax parameters and h is the combi-
predicted label from task 1, classification in task 2 has been    nation of word and sentence embedding features. W stands
carried out.                                                      for word vectors and D stands for sentence vectors. Hence
                                                                  these embedding features are considered to be a feature set
                                                                  for the approach using word embedding model.
3.   DATASET DESCRIPTION                                             The second feature set in our methodology is keyword
   Training data given for this task holds 5 queries Q1 , Q2 ,    features. Keywords are extracted from the dataset given for
Q3 , Q4 and Q5 corresponding to Skin Cancer, E-cigarette,         training and testing. For the task to classify the sentences
MMR-Vaccine, Vitamin-C and Women-HRT. Each query                  as relevant or irrelevant, keywords are extracted based on
contains sentences under two categories - Relevant and Ir-        its frequency of occurrence in the relevant and irrelevant
relevant tags (R / IR tag). For each query, the number            sentences. The threshold value for determining frequency of
of relevant and irrelevant sentences is different. It is fur-     words is set as 7 for task 1 and 6 for task 2.
ther categorized into Support, Oppose and Neutral (S/O/N             The list of keywords extracted for task 1 and task 2 is
tag). Individual count of R/IR sentences and S/O/N sen-           given in Table 4. For n keywords, a vector of length n is de-
tences, for each query is tabulated in Table 2. Additional        fined which indicates the presence or absence of the keyword
dataset related to these 5 queries were collected from online     as 1 or 0 respectively. The vector of length n is considered
resources. Individual sentence count of additional dataset        to be the keyword feature in our system. Word embedding
for each query is given in Table 3. Due to time-constraint,       model and keyword based model were also separately eval-
we limited the collection of additional dataset around 1000.      uated using SVM classifier.
   After analysis of given training data it has been found
that, if a sentence is irrelevant to a query then most probably
it will be a neutral sentence with respect to that query. At
the time of generation of embedding features, training data
and additional dataset is used to train the word embedding
model.
                                  Table 4: List of keywords extracted for Task 1 and Task 2
                                                                         KEYWORDS
     QUERY
                                                     Task 1                                                  Task 2
  Vitamin -C            Vitamin, Prevent, Symptoms, Severe, Incidence, Dose, Cold         Prevent, Reduce, Severe, Benefit, Risk, No
  E-Cigarette           Smokers, E-Cigarette, Cigarette, Tobacco, Cancer, Quit, Cessation Safe, Less, Harm, Damage, Risk, No
 MMR-Vaccine            Vaccine, MMR, Autism, Children, Disorder, Thimerosal, Measles     No, Evidence, Cause, Possible, Risk, Develop
  Skin Cancer           UV, Melanoma, Exposure, Cancer, Sun, Skin, Radiation              Increase, Cause, Work, Rate, Exposure, Not
 Women-HRT              Menopause, HRT, Hormone, Ovarian, Breast, Estrogen, Oestrogen Increase, Effect, Severe, High, Risk, Symptom


                                 Query
                                                                                         Table 5: Cross-Validation Accuracy for classifying
                                                                                         Relevant/Irelevant tags
                                                                                              Query     Embedding Keyword Embedding & Keyword
                                                      Automatic Keyword
      Sentence                                                                            Skin Cancer      65.79     66.57           66.28
     Embedding                                            Generation
                                                                                          E-cigarette      69.49     70.94           71.43
                                                                                          MMR-Vaccine      79.98     80.31           84.94
                                                                          Keywords for
                                                                                          Vitamin-C        80.58     75.18           81.65
                                             Keywords for
  Embedding Vectors
                                                 R/IR                        SON          Women-HRT       83.74     83.74            82.93


                                                                                         Table 6: Cross-Validation Accuracy for classifying
                 Emb Vec                Key Vec                                          Support/Oppose/Neutral tags
                                                                                              Query     Embedding Keyword Embedding & Keyword
                                                                                          Skin Cancer     44.57      54.55           56.6
                             SVM
                                                                                          E-cigarette     47.22      51.33           58.94
                           classifier                                                     MMR-Vaccine      55.3      57.53           62.55
                                                                                          Vitamin-C       55.04       51.8           52.52
                                                                                          Women-HRT       54.88     55.28            52.78
                      Predicted R/IR label

                                                                                         queries are used in further classification of sentences. The
                           Predicted                                                     system is subjected to 10-fold cross validation while train-
             Emb Vec                        Key Vec
                             R /IR                                                       ing. The cross validation accuracy obtained from this task
                                                                                         using three different approaches - Keyword, Word embed-
                                                                                         ding, Keyword combined with Word embedding respectively
                             SVM                                                         is tabulated in Table 5.
                           classifier                                                       Considering the second task, keyword features differs in
                                                                                         this case because the keywords contributing R/IR label is
                                                                                         different from S/O/N label. So, keywords for further classi-
                      Predicted SON label
                                                                                         fication are selected based on frequency of occurrence of key-
                                                                                         words in support, oppose and neutral statements of training
                                                                                         data. Therefore, to classify the sentences into Supporting,
                                                                                         Opposing or Neutral, combined feature set which includes
 Figure 1: Methodology of the Proposed System                                            the embedding features, keyword features (S/O/N), labels
                                                                                         of support, oppose, neutral of training data and predicted
                                                                                         R/IR labels taken from task 1 are used for SVM train-
5.   EXPERIMENTS AND RESULTS                                                             ing. The system is subjected to 10-fold cross validation
   As mentioned above, our system is developed based on the                              while training. Training results in 5 models for 5 queries
combination of word embedding and keyword features. The                                  that is used for S/O/N (Support/Oppose/Neutral) classifi-
methodology of the proposed system is illustrated in Fig-
ure 1. The sentence vectors and keyword features of each
sentence in a particular query in the training data are com-                             Table 7: Accuracy obtained for Task 1 and Task 2
bined. The Relevant/Irrelevant label from the training data                              (in %)
is taken. Machine learning based SVM classifier is used for                                            Query       Task 1   Task 2
training the system [3]. The combined feature set and the la-                                   Skin Cancer       48.8636   23.8636
bel set is given as input. During training, each query holds a
                                                                                                E-cigarette       76.5625   39.0625
model that includes word embedding and keyword features.
                                                                                                MMR-Vaccine       88.8889   34.7222
Hence there will be 5 models (to classify the sentences into
                                                                                                Vitamin-C         60.8108   32.4324
relevant or irrelevant) for 5 queries. These 5 models for 5
queries are used to predict the R/IR (relevant/irrelevant)                                      Women-HRT         75.8621   43.1034
label using SVM for test data. The predicted labels for 5                                       Overall Accuracy 70.1976 34.6368
cation. SVM predicts S/O/N label for test data. Table 6                 AMRITA-CEN@FIRE-2014: Named entity
tabulates the cross-validation accuracy obtained for the sec-           recognition for Indian languages using rich features.
ond task using three different approaches -Word embedding,              ACM International Conference Proceeding Series,
Keyword, Keyword combined with Word embedding respec-                   05-07-Dec-2014:103–111, 2014.
tively. From the cross validation results, it is evident that       [2] M. Anand Kumar, S. Shriya, and K. P. Soman.
the method of combination of keyword features and word                  AMRITA-CEN@FIRE 2015: Extracting entities for
embedding features is acceptable.                                       social media texts in Indian languages. CEUR
                                                                        Workshop Proceedings, 1587:85–88, 2015.
                                                                    [3] M. Arun Kumar and M. Gopal. A comparison study
        Table 8: Task 1 results by organizers                           on multiple binary-class svm methods for unilabel text
          Team Name        Accuracy Position                            categorization. Pattern Recognition Letters,
       SSN NLP               78.10                                      31(11):1437–1444, 2010.
                                          I
       Fermi                 77.04                                  [4] W. W. Chapman, P. M. Nadkarni, L. Hirschman,
       JU KS Group           73.39                                      L. W. D’Avolio, G. K. Savova, and O. Uzuner.
                                          II
       Techie Challangers    73.03                                      Overcoming barriers to NLP for clinical text: the role
       Jainisha Sankhavara   70.28                                      of shared tasks and the need for additional creative
                                         III
       Amrita CEN           70.19                                       solutions. Journal of the American Medical
                                                                        Informatics Association, 18(5):540–543, 2011.
                                                                    [5] L. Goeuriot, G. J. Jones, L. Kelly, J. Leveling,
                                                                        A. Hanbury, H. Müller, S. Salantera, H. Suominen,
        Table 9: Task 2 results by organizers                           and G. Zuccon. ShARe/CLEF eHealth Evaluation Lab
          Team Name        Accuracy Position                            2013, Task 3: Information Retrieval to Address
       JNTUH                 55.43                                      Patients’ Questions when Reading Clinical Reports.
                                          I
       Fermi                 54.87                                      CLEF 2013 Online Working Notes, 8138, 2013.
       Hua Yang              53.98        II                        [6] Q. Le and T. Mikolov. Distributed representations of
       Techie Challangers    52.46       III                            sentences and documents. volume 4, pages 2931–2939,
       Amrita fire CEN       38.53       IV                             2014.
       Jainisha Sankhavara   37.95        V                         [7] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
       Amrita CEN           34.63        VI                             J. Dean. Distributed representations of words and
                                                                        phrases and their compositionality. In Advances in
   Results by CHIS task organizers for our proposed system              neural information processing systems, pages
is tabulated in Table 7.                                                3111–3119, 2013.
   The accuracy given by CHIS organizers for submission             [8] G. Remmiya Devi, P. V. Veena, M. Anand Kumar,
of top 6 teams for task 1 is tabulated in Table 8. Results              and K. P. Soman. Entity Extraction for Malayalam
by organizers for submission of top 7 teams for task 2 is               Social Media Text using Structured Skip-gram based
tabulated in Table 9.                                                   Embedding Features from Unlabeled Data. Procedia
                                                                        Computer Science, 93:547–553, 2016.
6.   CONCLUSIONS                                                    [9] S. Sanjay, M. Anand Kumar, and K. P. Soman.
   In this paper, we have proposed a methodology based on               AMRITA-CEN-NLP@FIRE 2015:CRF based named
the combination of keyword and word embedding features.                 entity extraction for Twitter microposts. CEUR
These features contribute in the effective retrieval of relevant        Workshop Proceedings, 1587:96–99, 2015.
information. Keyword features for any set of document can          [10] S. Z. Shariff, S. A. Bejaimal, J. M. Sontrop, A. V.
be extracted based on its frequency of occurrence. The pro-             Iansavichus, R. B. Haynes, M. A. Weir, and A. X.
posed system will be helpful in extracting the most relevant            Garg. Retrieving clinical evidence: a comparison of
document for a query, among a large pool of documents in                Pubmed and Google Scholar for quick clinical searches.
web. Irrespective of the position we have acquired in task              Journal of medical Internet research, 15(8):e164, 2013.
1, our accuracy value is comparable to that of others. The         [11] M. Sinha, S. Mannarswamy, and S. Roy. CHIS@FIRE:
second task is more challenging due to further classification.          Overview of the CHIS Track on Consumer Health
By considering sentimental features for sentences, accuracy             Information Search. In Working notes of FIRE 2016 -
can be increased.                                                       Forum for Information Retrieval Evaluation, Kolkata,
                                                                        India, December 7-10, 2016, CEUR Workshop
                                                                        Proceedings. CEUR-WS.org, 2016.
7.   ACKNOWLEDGMENTS                                               [12] N. Tracy Edinger, A. M. Cohen, S. Bedrick,
  We would like to express our sincere gratitude to the or-             K. Ambert, and W. Hersh. Barriers to retrieving
ganizers of Forum for Information Retrieval Evaluation 2016             patient information from electronic health record data:
for organizing a task with great scope of for research. We              failure analysis from the TREC medical records track.
would also like to thank Xerox Research Centre, for orga-               pages 180–188, 2012.
nizing the CHIS task.

8.   REFERENCES
 [1] N. Abinaya, N. John, H. Barathi Ganesh,
     M. Anand Kumar, and K. P. Soman.