Biomedical Information Retrieval Jainisha Sankhavara Prasenjit Majumder Information Retrieval and Language Processing Lab, Dhirubhai Ambani Institute of Information and Dhirubhai Ambani Institute of Information and Communication Technology Communication Technology Gandhinagar, Gujarat Gandhinagar, Gujarat prasenjit.majumder@gmail.com jainishasankhavara@gmail.com ABSTRACT Virus’, all refers to the same entity. Such different different Retrieving relevant information from biomedical text data is a new representations of the same entity should be normalized to challenging area of research. Thousands of articles are being added a single representation. This problem is known as entity into biomedical literature each year and this large collection of normalization. This type of problems of acronym disam- publications offer an excellent opportunity for discovering hidden biguation leads to poor system performance. biomedical knowledge by applying information retrieval (IR) and Also, there can be two types of users of the healthcare related Natural Language Processing (NLP) technologies. Biomedical Text search systems : experts(clinicians) and laymen(other than clini- processing is different from others. It requires special kind of pro- cians). The query formulations of both the users are different for cessing as it has complex medical terminologies. Medical entity the same information need. For example, general people use the identification and normalization itself is a research problem. Rela- words ’heart attack’, ’Irregular heartbeat’, ’Mouth ulcer’ while med- tionships among medical entities have the impact on any system. ical practitioners/experts use the words ’Myocardial infarction’, The Clinical Decision Support systems are aimed to provide as- ’Cardiac arrhythmia’, ’Mucosal ulcer’ respectively. This leads to the sistance to the decision-making tasks in biomedical domain. The problem of vocabulary mismatch where different people name the medical knowledge have the potential to impact considerably on same thing or concept differently. As an effect of the characteristics the quality of care provided by clinicians. Medical field has various of biomedical terminologies and user dependent query formula- types of queries: short questions, medical case reports, medical tions, the problem of vocabulary mismatch between query and case narratives, verbose medical queries, community questioning, documents(relevant) arises in Biomedical Information Retrieval. semi-structured queries, etc. These diverse nature of medical data Missing synonyms causes low retrieval recall i.e. out of all relevant demands special kind of attention from IR and NLP. documents in the collection, very few relevant documents get re- trieved. Also, ambiguous terms cause low precision i.e. out of all CCS CONCEPTS retrieved documents, very few are relevant. We need to construct • Information systems → Query reformulation; Document such Biomedical search engines that can address the above issues. filtering; Clustering and classification; 2 DATA AND RESOURCES KEYWORDS Widely-used text collections in the biomedical domain are MED- Biomedical Text Processing, Query Expansion, Clinical Decision LINE/PubMed, OHSUMED, and GENIA. support • The MEDLINE/PubMed database contains bibliographic ref- erences to journal articles in the life sciences with a concen- 1 MOTIVATION AND CHALLENGES tration on biomedicine, and it is maintained by the U. S. Na- The recent statistics shows that 70% of total web search queries tional Library of Medicine (NLM). This MEDLINE/PubMed are of medical and healthcare category. Biomedical Information records can be downloaded for research. Retrieval(BIR) is a special type of information retrieval. Major chal- • The OHSUMED [14] dataset contains all MEDLINE citations lenges in biomedical information retrieval are in handling complex, of 270 medical journals published over a five-year period ambiguous, inconsistent medical terms and their ad-hoc abbrevia- (1987-1991). tions. • The TREC Genomics Track data [8] contains ten years of • There are many complex terms like ’nuclear factor kappa- MEDLINE citations (1994-2003). light-chain-enhancer of activated B cells’, ’NF-kB DNA bind- • The TREC Clinical Decision Support data [12, 13, 19] is the ing with electromobility shift assay’. The average length collection of 733,138 full-text articles from PubMed Central. of biomedical entities is much higher than general entities. • The GENIA corpus [9] contains 1,999 MEDLINE abstracts Identifying such medical entities is a preliminary subtask. retrieved using the MeSH terms. It is annotated for part-of- • Physicians use ad-hoc abbreviations very frequently and they speech, syntax, coreference, biomedical concepts and events, are ambiguous like ’PSA’ can be ’prostate specific antigen’ cellular localization, disease-gene associations, and path- or ’psoriasis arthritis’ or ’poultry science association’. ways. • The rapid change in terminologies makes them inconsis- • The Unified Medical Language System (UMLS) [3], a com- tent. For instance ’H1N1 influenza’, ’H1N1 Virus’, ’swine pendium of controlled vocabularies that is maintained by influenza’, ’SI’, ’Pig Flu’ and ’Swine-Origin Influenza A H1N1 NLM, is the most comprehensive resource, unifying over 100 dictionaries, terminologies, and ontologies in its Metathe- (RF) based Query Expansion with BM25 [1] and In_expC2 [1] re- saurus. It also provides a semantic network that represents trieval models. Terrier tool has been used for all these experiments. relations between Metathesaurus entries, a lexicon that con- MAP and infNDCG are used as evaluation metrics [10]. Higher the tains lexicographic information about biomedical terms and value of evaluation measure, better the retrieval result of system. common English words. The result improves when Query expansion is used. PRF based query expansion and RF based query expansion give statistically 3 LITERATURE SURVEY significant results (p < 0.05) as compared to no expansion. ’Information Retrieval: A Health and Biomedical Perspective’ [6] provides basic theory, implementation and evaluation of IR systems 4.2 Feedback Documents Discovery in health and biomedicine. The tasks of named entity recognition Query expansion method largely rely on feedback documents and and relation and event extraction, summarization, question answer- feedback terms. Automatic query expansion methods based on ing, and literature based discovery are outlined in Biomedical text pseudo relevance feedback uses top retrieved documents as feed- mining: a survey of recent progress [18]. The original conception of back documents.[10] [4] Those feedback documents might not be literature-based discovery [20] was facilitated by the use of Medical all relevant. The feedback document set might contain non-relevant Subject Headings (MeSH), which are controlled vocabulary terms docs along with truly relevant documents. The retrieval system added to bibliographic citations during the process of MEDLINE gets harm with these non-relevant documents in feedback set. They indexing. are like noise in the feedback system. PubMed is a biomedical search engine which accesses primarily One attempt is to learn the truly relevant documents for feed- the MEDLINE database of abstracts and references on biomedical back by using minimum human intervention. The approach uses topics and life sciences and is maintained by the United States human judgements for a small set of feedback documents and then National Library of Medicine (NLM)1 at the National Institutes of it tries to learn identifying true relevant documents from rest of the Health (NIH). PubMed does binary matching[15] and is useful for documents. Then the documents identified relevant are used for short queries only. feedback and query expansion is performed. Two approaches for On the contrary medical and healthcare related queries are this learning based on classification and clustering are presented longer than general queries since people used to describe the symp- here. toms, tests and ongoing treatments. For verbose and longer queries, biomedical IR systems should deal properly with ambiguous, com- First Algorithm: The first proposed algorithm is based on clas- plex and inconsistent biomedical terminologies which is difficult to sification. If we have human judgements available for some of handle. the feedback documents, then it will serve as a training data for Automatic processing of biomedical text suffers from lexical classification. The documents are represented as a collection of ambiguity (homonymy and polysemy) and synonymy. Automatic bag-of-words, the TF-IDF scores of the words represent features query expansion (AQE) [11], [4] which has a long history in in- and human relevance scores provides the classes. By using this formation retrieval can be useful to deal with such problems. For as a training data, we want to predict the relevance of other top instance, medical queries were expanded with other related terms retrieved feedback documents. from RxNorm, a drug dictionary, to improve the representation of a query for relevance estimation [5]. The emergence of medical domain specific knowledge like UMLS Algo1 : classification can contribute to the retrieval system to gain more understanding For each query Q of the biomedical documents and queries. Various approaches of information retrieval with the UMLS Metathesaurus have been (1) D N - set of N top retrieved documents {d 1 , d 2 , ..., d N } reported: some with decline in results[7] and some with gain in (2) D k - set of k top retrieved documents for which human results[2]. In [2], the pseudo-relevance feedback was used for query judgements are available {d 1 , d 2 , ..., dk } expansion where technique where the top retrieved documents (3) Dl - set of l=N-k top retrieved documents for which human are assumed to be relevant and used as feedback to the query and judgements are not available {dk +1 , dk+2 , ..., d N } retrieval is performed using expanded query. (4) D F - set of feedback documents (5) D F = {di ; relevance o f di > 0, di ∈ D k } 4 BIOMEDICAL DOCUMENT RETRIEVAL (6) Train a classifier C on D k using relevance as a class label and generate model Mc 4.1 Preliminary Experiments (7) For each document d j in Dl , k + 1 ≤ j ≤ N Query Expansion which uses the top retrieved relevant documents (8) Predict the relevance r j of d j using trained model is known as Relevance Feedback since it uses the human judgement Mc to identify the relevancy. Pseudo Relevance Feedback technique (9) If r j > 0, then D F = D F ∪ {d j } assumes the top retrieved documents relevant and uses as feedback documents. The Query expansion based approches for biomedical domain gives better results [16, 17]. Second Algorithm. The second algorithm is an extension of first Table 1 shows the results of standard retrieval, Pseudo-Relevance algorithm. The analysis of results of first algorithm shows that Feedback (PRF) based Query Expansion and Relevance Feedback the feedback document set still contains some non-relevant docs 2 Table 1: Results of Standard Query Expansion CDS 2014 CDS 2015 CDS 2016 MAP infNDCG MAP infNDCG MAP infNDCG BM25 0.1012 0.1779 0.1039 0.2036 0.0371 0.1250 BM25+PRF 0.1448 (+43.1%) 0.2231 (+25.4%) 0.1650 (+58.8%) 0.2725 (+33.8%) 0.0401 (+8.1%) 0.1367 (+9.3%) BM25+RF 0.2043 (+101%) 0.3127 (+75.7%) 0.1834 (+76.5%) 0.3034 (+49.0%) 0.0561(+51.2%) 0.1887 (+50.9%) In_expC2 0.1167 0.1920 0.1118 0.2147 0.0445 0.1401 In_expC2+PRF 0.1483 (+27.1%) 0.2404 (+25.2%) 0.1634 (+46.1%) 0.2689 (+25.2%) 0.0606 (+36.1%) 0.1752 (+25.0%) In_expC2+RF 0.2070 (+77.3%) 0.3431 (+78.6%) 0.1857 (+66.1%) 0.3145 (+46.4%) 0.0713 (+60.2%) 0.2118 (+51.1%) and it is responsible for insignificant improvement. This approach 50 documents is the baseline for other results. All the computed further removes non-relevant documents from relevant document results are compared with the baseline. class identified by classification approach. The idea is to perform The experiments are performed using nine different classifiers clustering on the relevant identified documents with number of for classification in first algorithm. The table 2 shows the results in clusters two: one from actually relevant documents and second terms of MAP score for CDS 2014 dataset. Neural-Net gives best from non-relevant documents. K-means clustering is used with k=2. result among all nine classifiers. Also, the result of classification Since, the convergence of K-means clustering depends on the initial with Nearest-Neighbors is comparable to the baseline. choice of cluster centroids, the initial cluster centroids are choosen The classification results are not significant to the baseline results. as the average of relevant documents’ vectors and the average of We investigated the matter and came to know that the relevant clas- non-relevant documents’ vectors from training data. sified documents in relevance class are not all actually relevant. The feedback document set also contains some irrelevant documents (misclassification). For all the 30 queries of CDS 2014, classification Algo2 : classification + clustering Nearest-Neighbours classified 625 documents as relevant out of all 200*30 documents. Out of 625 documents used for feedback, 244 For each query Q documents were actually relevant while other 381 documents were (1) D N - set of N top retrieved documents {d 1 , d 2 , ..., d N } wrongly classified as relevant. So, these 381 irrelevant documents (2) D k - set of k top retrieved documents for which human are noise to the system. The second approach takes this matter judgements are available {d 1 , d 2 , ..., dk } into consideration and further refine the feedback document set (3) Dl - set of l=N-k top retrieved documents for which human by performing 2-cluster clustering on 625 documents. Manually judgements are not available {dk +1 , dk +2 , ..., d N } removing 381 irrelevant documents from feedback document set (4) D F - set of feedback documents shows significant improvement over baseline. The results of man- (5) D F = {di ; relevance o f di > 0, di ∈ D k } ually removing false classified documents from feedback set and (6) Train a classifier C on D k using relevance as a class label automatic clustering approach are also shown in the table 2. and generate model Mc The same experiments are performed on CDS 2015 and 2016 (7) D R = ϕ, D N R = ϕ datasets. The results of both the algorithms using six different clas- (8) For each document d j in Dl , k + 1 ≤ j ≤ N sifiers are shown in tabel 3. For CDS 2015 dataset second algorithm (9) Predict the relevance r j of d j using trained model performs better than baseline but the difference is not significant. Mc For CDS 2016 dataset, both the algorithms perform similar to the (10) If r j > 0 then baseline. D R = D R ∪ {d j } (11) else D N R = D N R ∪ {d j } REFERENCES \\ D R contains predicted relevant documents from Dl [1] Gianni Amati, Cornelis Joost, and Van Rijsbergen. 2003. Probabilistic models for information retrieval based on divergence from randomness. (2003). (12) Perform K-means clustering on D R with k=2 (relevant docs [2] Alan R Aronson and Thomas C Rindflesch. 1997. Query expansion using the and non-relevant docs) UMLS Metathesaurus.. In Proceedings of the AMIA Annual Fall Symposium. Amer- (13) D F = D F ∪ {documents f rom relevant docs cluster } ican Medical Informatics Association, 485. [3] Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrat- ing biomedical terminology. Nucleic acids research 32, suppl 1 (2004), D267–D270. The query expansion considers top N retrieved documents for [4] Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval. ACM Computing Surveys (CSUR) 44, 1 (2012), feedback. Here, we have considered top 250 documents, from which 1. the set of top 50 documents are used as training i.e. human judge- [5] Dina Demner-Fushman, Swapna Abhyankar, Antonio Jimeno-Yepes, Russell F Loane, Bastien Rance, François-Michel Lang, Nicholas C Ide, Emilia Apostolova, ments for top 50 documents are used in training and rest of 200 and Alan R Aronson. 2011. A Knowledge-Based Approach to Medical Records documents are taken for testing data. The relevance is predicted for Retrieval.. In TREC. those 200 documents and only relevant predicted documents are [6] William Hersh. 2008. Information retrieval: a health and biomedical perspective. Springer Science & Business Media. then used for feedback. The result of relevance feedback using top 3 Table 2: Results of different classifiers on CDS 2014 dataset CDS 2014 MAP classification classification + manually removing false Relevant docs classification + clustering Baseline (RF_50) 0.2768 0.2768 0.2768 Nearest-Neighbors 0.2761 0.2815 (p = 0.048) 0.2794 (p = 0.305) Linear-SVM 0.2736 0.2760 0.2750 RBF-SVM 0.2736 0.2760 0.2750 Gaussian-Process 0.2736 0.2762 0.2753 Decision-Tree 0.2496 0.2788 0.2725 Random-Forest 0.2733 0.2760 0.2747 Neural-Net 0.2790 0.2808 0.2790 AdaBoost 0.2618 0.2806 0.2741 Naive-Bayes 0.2614 0.2792 0.2661 Table 3: Results of different classifiers on CDS 2015 and CDS 2016 dataset CDS 2015 CDS 2016 MAP classification classification + clustering classification classification + clustering Baseline (RF_50) 0.2283 0.2283 0.1456 0.1456 Nearest-Neighbors 0.2234 0.2324 (p = 0.115) 0.1456 0.1459 (p = 0.895) Decision-Tree 0.2065 0.2218 0.1138 0.1370 Random-Forest 0.2130 0.2281 0.1450 0.1458 Neural-Net 0.2295 0.2299 0.1460 0.1466 AdaBoost 0.2092 0.2213 0.1255 0.1345 Naive-Bayes 0.2172 0.2269 0.1436 0.1468 [7] William Hersh, Susan Price, and Larry Donohoe. 2000. Assessing thesaurus-based [20] Don R Swanson. 1986. Fish oil, Raynaud’s syndrome, and undiscovered public query expansion using the UMLS Metathesaurus.. In Proceedings of the AMIA knowledge. Perspectives in biology and medicine 30, 1 (1986), 7–18. Symposium. American Medical Informatics Association, 344. [8] William R Hersh and Ravi Teja Bhupatiraju. 2003. TREC genomics track overview.. In TREC, Vol. 2003. 14–23. [9] Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. 2009. Overview of BioNLP’09 shared task on event extraction. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task. Association for Computational Linguistics, 1–9. [10] Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. 2008. Intro- duction to information retrieval. Vol. 1. Cambridge university press Cambridge. [11] Melvin Earl Maron and John L Kuhns. 1960. On relevance, probabilistic indexing and information retrieval. Journal of the ACM (JACM) 7, 3 (1960), 216–244. [12] Kirk Roberts, Dina Demner-Fushman, Ellen M. Voorhees, and William R. Hersh. 2016. Overview of the TREC 2016 Clinical Decision Support Track. In Proceedings of The Twenty-Fifth Text REtrieval Conference, TREC 2016, Gaithersburg, Maryland, USA, November 15-18, 2016. [13] Kirk Roberts, Matthew S Simpson, Ellen M Voorhees, and William R Hersh. 2015. Overview of the TREC 2015 Clinical Decision Support Track.. In TREC. [14] Stephen E Robertson and David A Hull. 2000. The TREC-9 Filtering Track Final Report.. In TREC. 25–40. [15] Stephen E Robertson and K Sparck Jones. 1976. Relevance weighting of search terms. Journal of the Association for Information Science and Technology 27, 3 (1976), 129–146. [16] Jainisha Sankhavara and Prasenjit Majumder. 2016. Team DA IICT at Clinical Decision Support Track in TREC 2016: Topic Modeling for Query Expansion.. In TREC. [17] Jainisha Sankhavara, Fenny Thakrar, Prasenjit Majumder, and Shamayeeta Sarkar. 2014. Fusing manual and machine feedback in biomedical domain. In Proceedings of The Twenty-Third Text REtrieval Conference, TREC 2014, Gaithersburg, Maryland, USA, November 19-21, 2014. [18] Matthew S Simpson and Dina Demner-Fushman. 2012. Biomedical text mining: a survey of recent progress. In Mining text data. Springer, 465–517. [19] Matthew S Simpson, Ellen M Voorhees, and William Hersh. 2014. Overview of the trec 2014 clinical decision support track. Technical Report. LISTER HILL NATIONAL CENTER FOR BIOMEDICAL COMMUNICATIONS BETHESDA MD. 4