AMRITA_CEN@FIRE 2016: Consumer Health Information Search using Keyword and Word Embedding Features Veena P V, Remmiya Devi G, Anand Kumar M, Soman K P Center for Computational Engineering and Networking (CEN) Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham, Amrita University, India ABSTRACT for text retrieval for clinical studies by the Text Retrieval This work is submitted to Consumer Health Information Conference (TREC) in 2011 [12]. The ShARe/CLEF eHealth Search (CHIS) Shared Task in Forum for Information Re- evaluation lab conducted a task to analyze the effect of using trieval Evaluation (FIRE) 2016. Information retrieval from additional information like discharge summaries and exter- any part of web should include informative content relevant nal resources such as medical ontologies on the Information to the search of web user. Hence the major task is to retrieve Retrieval [5]. PubMed, an archive of biomedical journal and only relevant documents according to the users query. The Google Scholar were compared to analyze the retrieval effi- given task includes further refinement of the classification ciency for clinical searches which is discussed in [10]. Em- process into three categories of relevance such as support, bedding features also can be used efficiently for entity ex- oppose and neutral. Any user reading an article from web traction which also helps in improving retrieval of precise must know whether the content of that article supports or documents. Several methods were proposed in various FIRE opposes title of the article. This seems to be a big challenge tasks to perform entity extraction such as Conditional Ran- to the system. Our proposed system is developed based on dom Field (CRF) based entity extraction [9], Entity extrac- the combination of Keyword based features and Word em- tion for Indian languages using SVM based classifier [2] and bedding based features. Classification of sentences is done Named Entity Recognition for Indian Languages using rich by machine learning based classifier, Support Vector Ma- features [1]. A paper was also published on extracting en- chine (SVM). tities for Malayalam language using Structured skip-gram based embedding [8]. The proposed system is useful in finding sentences relevant CCS Concepts to the given query and also to find whether it is a supporting, •Information systems → Clustering and classifica- opposing or neutral sentence. An example of training data tion; •Applied computing → Health care information of the query ‘Skin Cancer’ is given in Table 1. systems; The given dataset along with some additionally collected clinical documents from Web were subjected to unsuper- Keywords vised feature extraction. Different approaches like Keyword based and Word embedding based information search were Word embedding, Machine Learning, Support Vector Ma- carried out. The integration of these two features achieved chine (SVM) better results. Our proposed system is developed based on the combination of keyword and word embedding features. 1. INTRODUCTION The embedding vectors with the keyword feature vector and Natural Language Processing plays a vital role in the in- the corresponding labels of Relevant/Irrelevant and Sup- terpretation of human language in the most understandable port/Oppose/Neutral tags were together given to train clas- format to the system. This type of role finds application in sifier. A well-known machine-learning based classifier, Sup- delivering the most relevant information through web search. port Vector Machine (SVM) is used for classification task. Nowadays, information regarding health issues is one among Section 2 includes the task description. The details about the essential need for people. The number of researches and the dataset used is given in Section 3. Section 4 discusses evidences in this field are growing rapidly day by day. So, about our proposed methodology. Experimentation and Re- if an individual searches through web for any health related sults are explained in Section 5. The conclusion of the paper query, a large number of documents will be retrieved. The is given in Section 6. efficiency of the search lies in the fact that the retrieved documents are relevant to the query. The main objective of proposed system is that a person irrespective of his absence 2. TASK DESCRIPTION of domain knowledge is supposed to get benefited through Our system is submitted in Consumer Health Informa- web search. tion Search in FIRE2016 [11]. The task given includes two In the past years, many developments were made on ef- subtasks. The first task is to classify the sentences in the ficient retrieval of relevant clinical data. A paper was pub- document as relevant to the query or not (R/IR). The second lished which discusses on the role of shared task on overcom- task is to further classify Relevant and Irrelevant sentences ing barriers to NLP in clinical domain [4]. Medical Records into Support, Oppose and Neutral (S/O/N) sentences with Track was a task conducted for comparing algorithms used respect to the query. Dataset contains 5 queries say, Qskin , Table 1: Example from training data Sentences R/IR S/O/N Most skin cancers are caused by exposure to the sun. R S Skin cancer can look many different ways. IR N Evidence shows that the Sun Protects You from Melanoma. R O Table 2: Number of sentences in Train Data and Test Data Train Data Query Relevant Irrelevant Test Data Support Oppose Neutral Total Support Oppose Neutral Total Skin Cancer 104 76 13 193 0 2 146 148 88 E-cigarette 93 165 35 293 0 0 120 120 64 MMR-Vaccine 72 92 44 208 0 0 51 51 58 Vitamin- C 111 68 29 208 0 0 70 70 74 Women-HRT 41 132 31 204 1 4 37 42 72 4. METHODOLOGY Table 3: Additional Dataset Collected Query Additional dataset Our proposed system is based on the combination of Word Skin Cancer 1044 embedding and Keyword based generation of features. Word E-cigarette 1003 embedding features are generally word vectors obtained us- MMR-Vaccine 1084 ing the word2vec tool [7]. In this case, the input is sen- Vitamin-C 1199 tences of each query. So we need to get embedding features Women-HRT 1469 for sentences. These embedding features are obtained from word2vec features. Input for word2vec includes training data and additional dataset collected from online resource. Qecig , Qmmr , Qvitc and Qhrt . Each query contains 200 to Word2vec is trained to get embedding features for training 400 sentences. For a query say Qskin , each sentence in that data. The size of vector is set to 100. Skip gram model is particular document is classified for relevant/irrelevant and chosen to train word2vec. The embedding features result- support/oppose/neutral (given two task) with labels L1 and ing from word2vec is used to generate embedding feature for L2 such that L1 ∈ { R, IR } and L2 ∈ { S, O N }. Thus, sentence as in Eq. (1) [6]. each sentence qi in the document of Q will have two labels - l1 denoting relevant or irrelevant and l2 denoting support, y = a + T h(D, wt−k , ..., wt+k ; W ) (1) oppose or neutral. Relevant sentences are useful in provid- ing answer for the given query. With the help of resulting where a,T are the softmax parameters and h is the combi- predicted label from task 1, classification in task 2 has been nation of word and sentence embedding features. W stands carried out. for word vectors and D stands for sentence vectors. Hence these embedding features are considered to be a feature set for the approach using word embedding model. 3. DATASET DESCRIPTION The second feature set in our methodology is keyword Training data given for this task holds 5 queries Q1 , Q2 , features. Keywords are extracted from the dataset given for Q3 , Q4 and Q5 corresponding to Skin Cancer, E-cigarette, training and testing. For the task to classify the sentences MMR-Vaccine, Vitamin-C and Women-HRT. Each query as relevant or irrelevant, keywords are extracted based on contains sentences under two categories - Relevant and Ir- its frequency of occurrence in the relevant and irrelevant relevant tags (R / IR tag). For each query, the number sentences. The threshold value for determining frequency of of relevant and irrelevant sentences is different. It is fur- words is set as 7 for task 1 and 6 for task 2. ther categorized into Support, Oppose and Neutral (S/O/N The list of keywords extracted for task 1 and task 2 is tag). Individual count of R/IR sentences and S/O/N sen- given in Table 4. For n keywords, a vector of length n is de- tences, for each query is tabulated in Table 2. Additional fined which indicates the presence or absence of the keyword dataset related to these 5 queries were collected from online as 1 or 0 respectively. The vector of length n is considered resources. Individual sentence count of additional dataset to be the keyword feature in our system. Word embedding for each query is given in Table 3. Due to time-constraint, model and keyword based model were also separately eval- we limited the collection of additional dataset around 1000. uated using SVM classifier. After analysis of given training data it has been found that, if a sentence is irrelevant to a query then most probably it will be a neutral sentence with respect to that query. At the time of generation of embedding features, training data and additional dataset is used to train the word embedding model. Table 4: List of keywords extracted for Task 1 and Task 2 KEYWORDS QUERY Task 1 Task 2 Vitamin -C Vitamin, Prevent, Symptoms, Severe, Incidence, Dose, Cold Prevent, Reduce, Severe, Benefit, Risk, No E-Cigarette Smokers, E-Cigarette, Cigarette, Tobacco, Cancer, Quit, Cessation Safe, Less, Harm, Damage, Risk, No MMR-Vaccine Vaccine, MMR, Autism, Children, Disorder, Thimerosal, Measles No, Evidence, Cause, Possible, Risk, Develop Skin Cancer UV, Melanoma, Exposure, Cancer, Sun, Skin, Radiation Increase, Cause, Work, Rate, Exposure, Not Women-HRT Menopause, HRT, Hormone, Ovarian, Breast, Estrogen, Oestrogen Increase, Effect, Severe, High, Risk, Symptom Query Table 5: Cross-Validation Accuracy for classifying Relevant/Irelevant tags Query Embedding Keyword Embedding & Keyword Automatic Keyword Sentence Skin Cancer 65.79 66.57 66.28 Embedding Generation E-cigarette 69.49 70.94 71.43 MMR-Vaccine 79.98 80.31 84.94 Keywords for Vitamin-C 80.58 75.18 81.65 Keywords for Embedding Vectors R/IR SON Women-HRT 83.74 83.74 82.93 Table 6: Cross-Validation Accuracy for classifying Emb Vec Key Vec Support/Oppose/Neutral tags Query Embedding Keyword Embedding & Keyword Skin Cancer 44.57 54.55 56.6 SVM E-cigarette 47.22 51.33 58.94 classifier MMR-Vaccine 55.3 57.53 62.55 Vitamin-C 55.04 51.8 52.52 Women-HRT 54.88 55.28 52.78 Predicted R/IR label queries are used in further classification of sentences. The Predicted system is subjected to 10-fold cross validation while train- Emb Vec Key Vec R /IR ing. The cross validation accuracy obtained from this task using three different approaches - Keyword, Word embed- ding, Keyword combined with Word embedding respectively SVM is tabulated in Table 5. classifier Considering the second task, keyword features differs in this case because the keywords contributing R/IR label is different from S/O/N label. So, keywords for further classi- Predicted SON label fication are selected based on frequency of occurrence of key- words in support, oppose and neutral statements of training data. Therefore, to classify the sentences into Supporting, Opposing or Neutral, combined feature set which includes Figure 1: Methodology of the Proposed System the embedding features, keyword features (S/O/N), labels of support, oppose, neutral of training data and predicted R/IR labels taken from task 1 are used for SVM train- 5. EXPERIMENTS AND RESULTS ing. The system is subjected to 10-fold cross validation As mentioned above, our system is developed based on the while training. Training results in 5 models for 5 queries combination of word embedding and keyword features. The that is used for S/O/N (Support/Oppose/Neutral) classifi- methodology of the proposed system is illustrated in Fig- ure 1. The sentence vectors and keyword features of each sentence in a particular query in the training data are com- Table 7: Accuracy obtained for Task 1 and Task 2 bined. The Relevant/Irrelevant label from the training data (in %) is taken. Machine learning based SVM classifier is used for Query Task 1 Task 2 training the system [3]. The combined feature set and the la- Skin Cancer 48.8636 23.8636 bel set is given as input. During training, each query holds a E-cigarette 76.5625 39.0625 model that includes word embedding and keyword features. MMR-Vaccine 88.8889 34.7222 Hence there will be 5 models (to classify the sentences into Vitamin-C 60.8108 32.4324 relevant or irrelevant) for 5 queries. These 5 models for 5 queries are used to predict the R/IR (relevant/irrelevant) Women-HRT 75.8621 43.1034 label using SVM for test data. The predicted labels for 5 Overall Accuracy 70.1976 34.6368 cation. SVM predicts S/O/N label for test data. Table 6 AMRITA-CEN@FIRE-2014: Named entity tabulates the cross-validation accuracy obtained for the sec- recognition for Indian languages using rich features. ond task using three different approaches -Word embedding, ACM International Conference Proceeding Series, Keyword, Keyword combined with Word embedding respec- 05-07-Dec-2014:103–111, 2014. tively. From the cross validation results, it is evident that [2] M. Anand Kumar, S. Shriya, and K. P. Soman. the method of combination of keyword features and word AMRITA-CEN@FIRE 2015: Extracting entities for embedding features is acceptable. social media texts in Indian languages. CEUR Workshop Proceedings, 1587:85–88, 2015. [3] M. Arun Kumar and M. Gopal. A comparison study Table 8: Task 1 results by organizers on multiple binary-class svm methods for unilabel text Team Name Accuracy Position categorization. Pattern Recognition Letters, SSN NLP 78.10 31(11):1437–1444, 2010. I Fermi 77.04 [4] W. W. Chapman, P. M. Nadkarni, L. Hirschman, JU KS Group 73.39 L. W. D’Avolio, G. K. Savova, and O. Uzuner. II Techie Challangers 73.03 Overcoming barriers to NLP for clinical text: the role Jainisha Sankhavara 70.28 of shared tasks and the need for additional creative III Amrita CEN 70.19 solutions. Journal of the American Medical Informatics Association, 18(5):540–543, 2011. [5] L. Goeuriot, G. J. Jones, L. Kelly, J. Leveling, A. Hanbury, H. Müller, S. Salantera, H. Suominen, Table 9: Task 2 results by organizers and G. Zuccon. ShARe/CLEF eHealth Evaluation Lab Team Name Accuracy Position 2013, Task 3: Information Retrieval to Address JNTUH 55.43 Patients’ Questions when Reading Clinical Reports. I Fermi 54.87 CLEF 2013 Online Working Notes, 8138, 2013. Hua Yang 53.98 II [6] Q. Le and T. Mikolov. Distributed representations of Techie Challangers 52.46 III sentences and documents. volume 4, pages 2931–2939, Amrita fire CEN 38.53 IV 2014. Jainisha Sankhavara 37.95 V [7] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and Amrita CEN 34.63 VI J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Results by CHIS task organizers for our proposed system neural information processing systems, pages is tabulated in Table 7. 3111–3119, 2013. The accuracy given by CHIS organizers for submission [8] G. Remmiya Devi, P. V. Veena, M. Anand Kumar, of top 6 teams for task 1 is tabulated in Table 8. Results and K. P. Soman. Entity Extraction for Malayalam by organizers for submission of top 7 teams for task 2 is Social Media Text using Structured Skip-gram based tabulated in Table 9. Embedding Features from Unlabeled Data. Procedia Computer Science, 93:547–553, 2016. 6. CONCLUSIONS [9] S. Sanjay, M. Anand Kumar, and K. P. Soman. In this paper, we have proposed a methodology based on AMRITA-CEN-NLP@FIRE 2015:CRF based named the combination of keyword and word embedding features. entity extraction for Twitter microposts. CEUR These features contribute in the effective retrieval of relevant Workshop Proceedings, 1587:96–99, 2015. information. Keyword features for any set of document can [10] S. Z. Shariff, S. A. Bejaimal, J. M. Sontrop, A. V. be extracted based on its frequency of occurrence. The pro- Iansavichus, R. B. Haynes, M. A. Weir, and A. X. posed system will be helpful in extracting the most relevant Garg. Retrieving clinical evidence: a comparison of document for a query, among a large pool of documents in Pubmed and Google Scholar for quick clinical searches. web. Irrespective of the position we have acquired in task Journal of medical Internet research, 15(8):e164, 2013. 1, our accuracy value is comparable to that of others. The [11] M. Sinha, S. Mannarswamy, and S. Roy. CHIS@FIRE: second task is more challenging due to further classification. Overview of the CHIS Track on Consumer Health By considering sentimental features for sentences, accuracy Information Search. In Working notes of FIRE 2016 - can be increased. Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. 7. ACKNOWLEDGMENTS [12] N. Tracy Edinger, A. M. Cohen, S. Bedrick, We would like to express our sincere gratitude to the or- K. Ambert, and W. Hersh. Barriers to retrieving ganizers of Forum for Information Retrieval Evaluation 2016 patient information from electronic health record data: for organizing a task with great scope of for research. We failure analysis from the TREC medical records track. would also like to thank Xerox Research Centre, for orga- pages 180–188, 2012. nizing the CHIS task. 8. REFERENCES [1] N. Abinaya, N. John, H. Barathi Ganesh, M. Anand Kumar, and K. P. Soman.