=Paper=
{{Paper
|id=Vol-2456/paper62
|storemode=property
|title=LongLife: a Platform for Personalized Search for Health and Life Sciences
|pdfUrl=https://ceur-ws.org/Vol-2456/paper62.pdf
|volume=Vol-2456
|authors=Patrick Ernst,Erisa Terolli,Gerhard Weikum
|dblpUrl=https://dblp.org/rec/conf/semweb/ErnstTW19
}}
==LongLife: a Platform for Personalized Search for Health and Life Sciences==
LongLife: a Platform for Personalized Search for Health and Life Sciences? Patrick Ernst, Erisa Terolli, and Gerhard Weikum Max Planck Institute for Informatics, Campus E1 4, 66123, Saarbücken, Germany {pernst, eterolli, weikum}@mpi-inf.mpg.de Abstract. This work demonstrates Longlife: a system for semantically enhanced, personalized search of information about health issues and life-science topics. The system supports user-friendly access to entities, categories and free-text phrases in a corpus of 21 million documents, comprising scientific publications, clinical trials, encyclopedic articles, biomedical news and health forum posts. Search results can be person- alized for two kinds of users: patients can provide descriptions of their health history, symptoms and therapies in layperson terms (as in health discussion forums), and doctors or researchers can target specific entities and categories (for disorders, symptoms, risk factors, drugs etc. – e.g., when searching on behalf of a patient). 1 Introduction Motivation: Although individual health and precision medicine are of great im- portance to society, search engines hardly support information needs by patients or doctors. PubMed search over biomedical publications supports filters on fields and MeSH tags, but this is still far from what semantic search can do in other domains such as business or travel where text is enriched with entity markup and background knowledge graphs. The Semantic Web community has worked on creating Linked-Data resources for genes, diseases and drugs (e.g., Bio2RDF, DrugBank, DisGeNET) (incl. work on Sparql querying, e.g., [4]), but there is no linkage with the textual content that doctors and patients provide across the Internet. Moreover, search over online health communities (e.g., ehealthfo- rum.com/health/health forums.html), where patients and doctors discuss personal experiences with disorders, symptoms and therapies, is very basic. IR research for health has largely focused on clinical data (see, e.g., [5] and references there). As an example, consider a user or doctor (on behalf of the patient) query- ing about “pancreatic cysts and abdominal pain”. Search engines over clinical articles or health forums merely return all kinds of pancreas-related posts. Contribution: LongLife provides access to entities, categories and free-text phrases in a corpus of 21 million documents, comprising scientific publications, clinical trials, encyclopedic articles, biomedical news and health forum posts. ? Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 P. Ernst et al. The semantic layer of entities and other annotations is automatically generated by named entity recognition [2] and linking entities to the DeepLife biomedical knowledge base [3] which encompasses a variety of LOD datasets (Bio2RDF, DrugBank etc.) and the UMLS taxonomy. In contrast to most prior works on biomedical entities, our method goes beyond major types like genes, proteins, diseases and drugs, by capturing a much wider range of entities like symp- toms/syndromes, therapies and nutrition- or lifestyle-related risk factors. On top of this semantically enriched corpus, LongLife offers personalized search by incorporating individual user information on a per-query basis. Lay users like patients typically pose keyword queries, but can add free-text self- descriptions of their case histories (e.g., like posts in health forums). LongLife au- tomatically detects health-related entities in such texts, infers relevant biomed- ical categories and expands the user query into a semantic-search request. This way, it can return answers that are of specific relevance to the user, e.g., expe- rience of similar patients. As a second use case, when doctors search on behalf of patients, entities and categories may be manually added and further patient properties can be specified (e.g., blood pressure and other vital signs). Again, LongLife automatically synthesizes the final query from these inputs, and com- putes personalized rankings of answers. 2 System Overview Data and Indexing: LongLife has currently indexed 21,036,802 documents crawled from a diverse corpus that covers the full spectrum of biomedical in- formation on the web: 19,884,225 scientific publications, 111,139 encyclopedic articles, 76,554 news articles, 164,756 clinical trials and 1,048,428 health forum posts. LongLife stores the data based on ElasticSearch v.1.7.6. We index the following parts: title, full text, topical domain (e.g., cancer, diabetes etc.) and all biomedical entities using the UMLS thesaurus as entity repository. For entity recognition, we use the method of [2] based on min-hash sketches for matching candidate phrases to entity names. We disambiguate between multiple entity candidates by considering only the most specific entity according to the UMLS type system and picking the highest ranked entity. Every detected entity is linked to the LOD Cloud leveraging a mapping between UMLS and Bio2RDF. Query Processing: LongLife has a form-based search interface with auto- completion suggestions for each field. Input can take the form of keywords or multi-word phrases, entities and/or categories, where the latter two are identified by having the user choose from auto-completion suggestions. Similar to health forum posts, users are asked to pose a question composed of a short post title and a post body containing a description of the individual case. This input is then processed as follows: • The user question is cast into a keyword query. • The query is expanded with informative entities and their semantic categories identified in the full text of the case description (see below). • The expanded query is issued to ElasticSearch. LongLife: a Platform for Personalized Search for Health and Life Sciences 3 Fig. 1: LongLife Search Interface • The result ranking is computed by LongLife’s customized scoring function that considers the personalized query expansion (see below). Personalized Query Expansion: We expand the initial keyword query with biomedical entities extracted from the medical case description. Since UMLS covers a broad spectrum of entities, we constrain, by default, the entity set to symptoms, diseases, medical findings and pharmacological substances. Each en- tity is assigned an weight computed as the squared Pointwise Mutual Information p(a,b)2 P M I 2 to the document’s domain. P M I 2 between entities a and b is log p(a)p(b) [1]. The domain is the health topic that the document belongs to (e.g., cancer, dia- betes, etc.). It is mostly derived from document meta-data, e.g., keywords field of PubMed articles or the names of sub-forums in health communities. Optionally, we further expand the query with the semantic types/categories of entities obtained from DeepLife [3]. The selected categories do not only encode typing information derived from UMLS, but also reflect relational facts harvested from a large text collection. For example, for Ibuprofen we retrieve the categories anti-inflammatory agent (type) and also treatment of fever (fact) among others. Answer Scoring: Longlife uses a linear combination of TF-IDF-style scores. We define a query Q = (T, E, C) where T is the set of user’s question keywords, E is the set of extracted entities from the case description and C is the set of semantic categories for E. For document D = (Dt , De , Dc ), tf (t,Dt ) 2 tf (e,De ) tf (c,Dc ) √ √ √ P P P score(D,Q) = λT t∈T idf (t) +λE e∈E P M I (d,e)idf (e) +λC c∈C idf (c) DT DE DC p where d is the domain and D{T,E,C} are normalization factors. We tuned λT = 1.0, λE = 0.6, λC = 0.1 via grid search with relevance labels from crowdsourcing. 4 P. Ernst et al. Fig 2: Health Forum Top Result Fig 3: Top Two Results from Scientific Articles 3 Demo Scenarios LongLife supports both lay users and professionals to discover relevant docu- ments for their specific queries within the entire corpus or the sub-corpus of their choice (e.g., scientific articles only or forum posts only). Figure 1 shows a screenshot of the input functionality of our system. We illustrate the benefits of LongLife by the following two use-case scenarios. Lay User Scenario: Consider the patient with the case in Figure 1 searching health forums for other users with similar experience. All she has to do is pose the question and provide the description. LongLife automatically converts these inputs into well-crafted query by inferring entities and categories and expanding the query. The top results for this example search is shown in Figure 2. Professional Scenario: Doctors and researchers are interested in clinical trials and publications. LongLife provides an advanced search box for such experts, where users can specify entities and categories of interest, via convenient auto- completion. Another important feature is to specify vital parameters and lab values of a patient, such as height, weight, age, heart rate and blood pressure. These measurements are automatically mapped into medical entities such as obesity, hypo/hypertension, tachycardia etc., and harnessed for result ranking. Top results of scientific articles for the search example of Figure 1 are shown in Figure 3. References 1. F. Role et al.: Handling the impact of low frequency events on co-occurrence based measures of word similarity. KDIR 2011 2. A. Siu et al.: Fast entity recognition in biomedical text. Workshop on Data Mining for Healthcare at KDD 2013 3. P. Ernst et al.: DeepLife: An Entity-aware Search, Analytics and Exploration Plat- form for Health and Life Sciences. ACL 2016 4. A. Hasnain et al.: BioFed: Federated Query Processing over Life Sciences Linked Open Data. Journal of Biomedical Semantics 2017 5. G. Zuccon, B. Koopman: Tutorial on Health Search: From Consumers to Clinicians. WSDM 2019. https://github.com/ielab/health-search-tutorial/tree/wsdm2019