<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LongLife: a Platform for Personalized Search for Health and Life Sciences?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Patrick Ernst</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erisa Terolli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerhard Weikum</string-name>
          <email>weikumg@mpi-inf.mpg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Max Planck Institute for Informatics</institution>
          ,
          <addr-line>Campus E1 4, 66123, Saarbucken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work demonstrates Longlife: a system for semantically enhanced, personalized search of information about health issues and life-science topics. The system supports user-friendly access to entities, categories and free-text phrases in a corpus of 21 million documents, comprising scienti c publications, clinical trials, encyclopedic articles, biomedical news and health forum posts. Search results can be personalized for two kinds of users: patients can provide descriptions of their health history, symptoms and therapies in layperson terms (as in health discussion forums), and doctors or researchers can target speci c entities and categories (for disorders, symptoms, risk factors, drugs etc. { e.g., when searching on behalf of a patient).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Motivation: Although individual health and precision medicine are of great
importance to society, search engines hardly support information needs by patients
or doctors. PubMed search over biomedical publications supports lters on elds
and MeSH tags, but this is still far from what semantic search can do in other
domains such as business or travel where text is enriched with entity markup
and background knowledge graphs. The Semantic Web community has worked
on creating Linked-Data resources for genes, diseases and drugs (e.g., Bio2RDF,
DrugBank, DisGeNET) (incl. work on Sparql querying, e.g., [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), but there is
no linkage with the textual content that doctors and patients provide across
the Internet. Moreover, search over online health communities (e.g.,
ehealthforum.com/health/health forums.html), where patients and doctors discuss personal
experiences with disorders, symptoms and therapies, is very basic. IR research
for health has largely focused on clinical data (see, e.g., [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and references there).
      </p>
      <p>As an example, consider a user or doctor (on behalf of the patient)
querying about \pancreatic cysts and abdominal pain". Search engines over clinical
articles or health forums merely return all kinds of pancreas-related posts.
Contribution: LongLife provides access to entities, categories and free-text
phrases in a corpus of 21 million documents, comprising scienti c publications,
clinical trials, encyclopedic articles, biomedical news and health forum posts.
? Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        The semantic layer of entities and other annotations is automatically generated
by named entity recognition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and linking entities to the DeepLife biomedical
knowledge base [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] which encompasses a variety of LOD datasets (Bio2RDF,
DrugBank etc.) and the UMLS taxonomy. In contrast to most prior works on
biomedical entities, our method goes beyond major types like genes, proteins,
diseases and drugs, by capturing a much wider range of entities like
symptoms/syndromes, therapies and nutrition- or lifestyle-related risk factors.
      </p>
      <p>On top of this semantically enriched corpus, LongLife o ers personalized
search by incorporating individual user information on a per-query basis. Lay
users like patients typically pose keyword queries, but can add free-text
selfdescriptions of their case histories (e.g., like posts in health forums). LongLife
automatically detects health-related entities in such texts, infers relevant
biomedical categories and expands the user query into a semantic-search request. This
way, it can return answers that are of speci c relevance to the user, e.g.,
experience of similar patients. As a second use case, when doctors search on behalf
of patients, entities and categories may be manually added and further patient
properties can be speci ed (e.g., blood pressure and other vital signs). Again,
LongLife automatically synthesizes the nal query from these inputs, and
computes personalized rankings of answers.
2</p>
    </sec>
    <sec id="sec-2">
      <title>System Overview</title>
      <p>
        Data and Indexing: LongLife has currently indexed 21,036,802 documents
crawled from a diverse corpus that covers the full spectrum of biomedical
information on the web: 19,884,225 scienti c publications, 111,139 encyclopedic
articles, 76,554 news articles, 164,756 clinical trials and 1,048,428 health forum
posts. LongLife stores the data based on ElasticSearch v.1.7.6. We index the
following parts: title, full text, topical domain (e.g., cancer, diabetes etc.) and
all biomedical entities using the UMLS thesaurus as entity repository. For entity
recognition, we use the method of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] based on min-hash sketches for matching
candidate phrases to entity names. We disambiguate between multiple entity
candidates by considering only the most speci c entity according to the UMLS
type system and picking the highest ranked entity. Every detected entity is linked
to the LOD Cloud leveraging a mapping between UMLS and Bio2RDF.
Query Processing: LongLife has a form-based search interface with
autocompletion suggestions for each eld. Input can take the form of keywords or
multi-word phrases, entities and/or categories, where the latter two are identi ed
by having the user choose from auto-completion suggestions. Similar to health
forum posts, users are asked to pose a question composed of a short post title
and a post body containing a description of the individual case. This input is
then processed as follows:
      </p>
      <p>The user question is cast into a keyword query.</p>
      <p>The query is expanded with informative entities and their semantic categories
identi ed in the full text of the case description (see below).</p>
      <p>The expanded query is issued to ElasticSearch.</p>
      <p>LongLife: a Platform for Personalized Search for Health and Life Sciences</p>
      <p>Fig. 1: LongLife Search Interface
The result ranking is computed by LongLife's customized scoring function
that considers the personalized query expansion (see below).</p>
      <p>
        Personalized Query Expansion: We expand the initial keyword query with
biomedical entities extracted from the medical case description. Since UMLS
covers a broad spectrum of entities, we constrain, by default, the entity set to
symptoms, diseases, medical ndings and pharmacological substances. Each
entity is assigned an weight computed as the squared Pointwise Mutual Information
P M I2 to the document's domain. P M I2 between entities a and b is log pp((aa);pb()b2) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The domain is the health topic that the document belongs to (e.g., cancer,
diabetes, etc.). It is mostly derived from document meta-data, e.g., keywords eld
of PubMed articles or the names of sub-forums in health communities.
      </p>
      <p>
        Optionally, we further expand the query with the semantic types/categories
of entities obtained from DeepLife [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The selected categories do not only encode
typing information derived from UMLS, but also re ect relational facts harvested
from a large text collection. For example, for Ibuprofen we retrieve the categories
anti-in ammatory agent (type) and also treatment of fever (fact) among others.
Answer Scoring: Longlife uses a linear combination of TF-IDF-style scores.
We de ne a query Q = (T; E; C) where T is the set of user's question keywords,
E is the set of extracted entities from the case description and C is the set of
semantic categories for E. For document D = (Dt; De; Dc),
score(D;Q) = T Pt2T idf(t) tfp(tD;DTt) + E Pe2E P MI2(d;e)idf(e) tfp(eD;DEe) + C Pc2C idf(c) tfp(cD;DCc)
where d is the domain and pDfT;E;Cg are normalization factors. We tuned T =
1:0, E = 0:6, C = 0:1 via grid search with relevance labels from crowdsourcing.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Demo Scenarios</title>
      <p>LongLife supports both lay users and professionals to discover relevant
documents for their speci c queries within the entire corpus or the sub-corpus of
their choice (e.g., scienti c articles only or forum posts only). Figure 1 shows a
screenshot of the input functionality of our system. We illustrate the bene ts of
LongLife by the following two use-case scenarios.</p>
      <p>Lay User Scenario: Consider the patient with the case in Figure 1 searching
health forums for other users with similar experience. All she has to do is pose
the question and provide the description. LongLife automatically converts these
inputs into well-crafted query by inferring entities and categories and expanding
the query. The top results for this example search is shown in Figure 2.
Professional Scenario: Doctors and researchers are interested in clinical trials
and publications. LongLife provides an advanced search box for such experts,
where users can specify entities and categories of interest, via convenient
autocompletion. Another important feature is to specify vital parameters and lab
values of a patient, such as height, weight, age, heart rate and blood pressure.
These measurements are automatically mapped into medical entities such as
obesity, hypo/hypertension, tachycardia etc., and harnessed for result ranking.
Top results of scienti c articles for the search example of Figure 1 are shown in
Figure 3.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>F.</given-names>
            <surname>Role</surname>
          </string-name>
          et al.:
          <article-title>Handling the impact of low frequency events on co-occurrence based measures of word similarity</article-title>
          .
          <source>KDIR 2011</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Siu</surname>
          </string-name>
          et al.:
          <article-title>Fast entity recognition in biomedical text</article-title>
          .
          <source>Workshop on Data Mining for Healthcare at KDD</source>
          <year>2013</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ernst</surname>
          </string-name>
          et al.:
          <article-title>DeepLife: An Entity-aware Search, Analytics and Exploration Platform for Health and Life Sciences</article-title>
          .
          <source>ACL 2016</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Hasnain</surname>
          </string-name>
          et al.:
          <source>BioFed: Federated Query Processing over Life Sciences Linked Open Data. Journal of Biomedical Semantics 2017</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>G.</given-names>
            <surname>Zuccon</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Koopman:</surname>
          </string-name>
          <article-title>Tutorial on Health Search: From Consumers to Clinicians</article-title>
          .
          <source>WSDM</source>
          <year>2019</year>
          . https://github.com/ielab/health-search-tutorial/tree/wsdm2019
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>