<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Integrated Information Extraction and Facetted Search Applications in Nephrology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Danilo Schmidt</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hans-Jurgen Pro tlich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Sonntag</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Arti cial Intelligence (DFKI) 66123 Saarbrucken</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Nephrology Department Charite - Universitatsmedizin Berlin 10117 Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work focusses on our rst integration steps of complex and partly unstructured medical data into a clinical research database. Our main application is an integrated facetted search tool in nephrology based on automatic information extraction results from textual documents. We describe the details of our technical architecture which is based on open-source tools|to be replicated at other universities, research institutes, or hospitals.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>As medical records may cover a very long history of diseases (up to 30 years) and
include a vast number of diagnoses, symptoms, results, medications, and
laboratory values, we could highly bene t from advanced search capabilities in clinical
information systems to allow for the retrieval of relevant data. However, medical
information systems often su er from good search capabilities for data which
has many unstructured text parts. Therefore, concepts to implement knowledge
based systems, based on textual information extraction in medicine, are in focus
of many recent research initiatives [11].</p>
      <p>
        In this paper, we propose a three stage process: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) o ine textual
information extraction from medical records in transplant medicine; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the generation
of interesting facetted search capabilities on the results of the previous stage;
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) the combination of the information extraction results with structured
laboratory values (ongoing work). Such a facetted search application uses techniques
for accessing information organised according to a facetted medical classi cation
system, allowing users to explore a collection of diagnoses, symptoms, results,
medications, and laboratory values by applying multiple lters. Thus, facetted
search allows clinicians to analyse complex data sets along a medical and
cognitive (re ective) chain of decision-making; in particular, facetted search
applications allow physicians to identify groups of patients with similar attributes.
This can provide valuable decision support, when physicians are confronted with
situations where rare or complex diseases require a high degree of specialist
knowledge to lter and interpret (unstructured) medical data.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>The facetted search application is based on the nephrology database TBase R .
The web-based electronic patient record TBase R has been implemented in a
German kidney transplantation programme as a cooperation between the
Nephrology of Charite Universitatsmedizin Berlin and the AI Lab of the Institute of
Computer Sciences of the Humboldt University of Berlin [3,10]. Currently, TBase R
automatically integrates essential laboratory data (9.9 million values), clinical
pharmacology (237.000 prescribed medications), diagnostic ndings from
radiology, pathology and virology (146.000 ndings), and administrative data from
the SAP-system of the Charite (70.000 diagnoses, 25.000 hospitalisations). Two
groups of use cases for the application of facetted search in the medical eld,
and nephrology, can be identi ed: rst, the use in clinical research, and second,
the implementation in the individual treatment as a decision support system in
the clinical routine.</p>
      <p>Sacco [7] describes an approach of a guided interactive diagnostic system
based an dynamic taxonomies. Biron et al. [2] describe an information retrieval
system for computerised patient records We extend these approaches by a special
multi-facet functionality. Our approach shows the following main advantages:
{ In our facetted search application, the user may remove any restriction he
or she may have made in previous steps. This allows for a much better
navigation through the search space where related systems only allow the
subsequent thinning [8, 9].
{ The ranking of facet values by cardinality supports the survey of remaining
subsets.
{ We base automatically generated facets (e.g., disease/symptom relationships
and negations) on multi-term extraction and relation extraction, by
employing state-of-the-art, high-precision textual information extraction modules.</p>
      <p>Only recently, new text mining approaches on Web-based medical literature
have been proposed. For extracting adverse drug events from text [6] or
automatic symptom extraction from texts on rare diseases [4], for example. However,
clinical information extraction from patient records is still underrepresented and
underdeveloped in clinical settings. Earlier work includes evaluating context
features for medical relation mining on medical abstracts; the identi cation of
semantic relations, such as substance A treats disease B, remains a non-trivial
task [13]. Recent work and comparative baseline experiments include temporal
information extraction [5]. A special trend becomes apparent, the need for
ontology modelling of medical terminology and corresponding information extraction
results [12]. Because of enormous annotation costs, mainly unsupervised methods
are being used [1]. In industry and in the context of reliable clinical relevance,
however, very detailed (and labor-intensive) supervised rule-based approaches
represent the state-of-the-art.3
3 Here, we use our research project partner's solution (Averbis), which is based on
shallow text parsing, see https://averbis.com/en/research/</p>
    </sec>
    <sec id="sec-3">
      <title>System Architecture</title>
      <p>The annotated texts are transferred in XMI format4 and stored in a local
database at DFKI (see gure 1). Important components are the Solr search
platform, the information extraction module, and the facetted search and
presentation user interface modules. Solr5 is an open source enterprise search platform
used in many large websites and applications and is one of the most popular
enterprise search engines.6 Solr runs as a standalone full-text search server and uses
the Lucene Java search library at its core for full-text indexing and (facetted)
search. We chose the Solr system mainly because of some interesting features
like facetted navigation, a query language that supports structured and textual
search, the possibility for automatic result clustering based on Carrot27, its
scalability and extensibility through plug-ins, and its various APIs for input (text,
xml, JSON, etc.) and output (JSON, XML, PHP, python, etc.).</p>
      <p>The rst step in our process is o ine informative extraction. The text data
for our system originate from the TBase R database of Charite Berlin
containing medical information about nephrology patients. In the rst phase we only
4 http://www.omg.org/spec/XMI/
5 http://lucene.apache.org/solr/
6 http://db-engines.com/en/ranking/search+engine
7 http://project.carrot2.org/
used about 5000 unstructured, free texts (no meta data or structured data of
patients) of four types: 'Befunde" ( ndings), "Untersuchungen" (visits),
"Entlassungsbriefe" (clinical reports), and "Verlaufe" (progress reports). These free
texts are processed by the project partner Averbis, which anonymises the texts
and adds annotations based on several medical reference systems and dictionaries
(LOINC8, ICD109, ABDAMED10).</p>
      <p>
        A software module extracts the relevant medical tags and features and stores
these in a database structure similar to the i2b2 star structure (in order to
simplify the updates of the target system i2b211). The user interface to search
and explore the annotated text database by using facets is built as a web
service based on the Solr extension "solarium" for PHP systems.12 This extension
provides for an API to specify all parameters necessary to create complex Solr
requests. The presentation/validation user interface of the system (see gure 2)
consists of two parts: the upper part shows the original text with highlighted
annotations, the lower part contains tabs listing the di erent relevant annotations.
Clicking on an item in the lower part scrolls the text above to the corresponding
8 https://loinc.org/
9 http://www.icd-code.de/
10 http://www.wuv-gmbh.de/abdata-pharma-daten-service/datenangebot/abdamed/
11 https://www.i2b2.org/about/intro.html
12 http://www.solarium-project.org/
position. The original XMI contents representing the complete original
annotation information is shown in a pop-up window when a highlighted annotation in
the text is clicked. Accordingly, this page serves two di erent purposes: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the
presentation of the original text snipped found by the facetted search, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
the validation of the annotations.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Outlook</title>
      <p>
        We demonstrated that new facetted search applications in the use case of
transplant medicine in nephrology, based on open-source software tools and
exchangeable information extraction modules, are feasible and a very suitable
decisionsupport tool for the doctor: this type of a knowledge based system provides
physicians with a practicable tool for the analysis of medical data and decision
support for cohort selection. We developed a user interface for facetted search
which is based on the Solr Engine. In the next project phase, we will extend
the capabilities of the facetted search application, mainly including the
following aspects: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) integration of existing structural information about patients
and treatments which includes numerical values, in relation to laboratory values
or medications in particular; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) extending the user interface by adding visual
search and presentation techniques like "foamtree"13 to further facilitate the
users exploration of the search space; (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) the integration of facetted search into
special use cases moving towards individualised medicine [11].
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This research is part of the project "clinical data intelligence" (KDI) which is
founded by the Federal Ministry for Economic A airs and Energy (BMWi).
13 https://carrotsearch.com/foamtree-overview
5. Mkrtchyan, T., Sonntag, D.: Deep parsing at the CLEF2014 IE task. In: Working
Notes for CLEF 2014 Conference, She eld, UK, September 15-18, 2014. pp. 138{
146 (2014)
6. Odom, P., Bangera, V., Khot, T., Page, D., Natarajan, S.: Extracting adverse drug
events from text using human advice. In: Arti cial Intelligence in Medicine - 15th
Conference on Arti cial Intelligence in Medicine, AIME 2015, Pavia, Italy, June
17-20, 2015. Proceedings. pp. 195{204 (2015)
7. Sacco, G.: Guided interactive diagnostic systems. In: Computer-Based Medical</p>
      <p>Systems. pp. 117{122 (2005)
8. Sacco, G.: Dynamic taxonomies and guided searches. Journal of the American</p>
      <p>
        Society for Information Science and Technology 57(6), 792{796 (2006)
9. Sacco, G.: Dynamic taxonomies for intelligent information access. In:
KhosrowPour, M. (ed.) Encyclopedia of Information Science and Technology, pp. 3883{
3892. 3 edn. (2014)
10. Schroter, K.: Tbase2, a web-based electronic patient record. Fundamenta
Informaticae 43(
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1-4</xref>
        ), 343{353 (2000)
11. Sonntag, D., Tresp, V., Zillner, S., Cavallaro, A., Hammon, M., Reis, A., Fasching,
A.P., Sedlmayr, M., Ganslandt, T., Prokosch, H.U., Budde, K., Schmidt, D.,
Hinrichs, C., Wittenberg, T., Daumke, P., Oppelt, G.P.: The clinical data intelligence
project. Informatik-Spektrum Journal pp. 1{11 (2015)
12. Sonntag, D., Wennerberg, P., Buitelaar, P., Zillner, S.: Pillars of ontology treatment
in the medical domain. J. Cases on Inf. Techn. 11(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), 47{73 (2009)
13. Vintar, S., Todorovski, L., Sonntag, D., Buitelaar, P.: Evaluating context features
for medical relation mining. In: Proceedings of the ECML/PKDD Workshop on
Data Mining and Text Mining for Bioinformatics (2003)
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alicante</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Unsupervised entity and relation extraction from clinical records in italian</article-title>
          .
          <source>Computers in Biology and Medicine</source>
          <volume>72</volume>
          (
          <issue>1</issue>
          ),
          <volume>263</volume>
          {
          <fpage>275</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Biron</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metzger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pezet</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebban</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barthuet</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Durand</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>An Information Retrieval System for Computerized Patient Records in the Context of a Daily Hospital Practice: the Example of the Leon Berard Cancer Center (France)</article-title>
          .
          <source>Applied Clinical Informatics</source>
          <volume>5</volume>
          (
          <issue>1</issue>
          ),
          <volume>191</volume>
          {
          <fpage>205</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lindemann</surname>
          </string-name>
          , G.:
          <article-title>A web-based patient record for hospitals - the design of tbase2</article-title>
          . In: Bruch,
          <string-name>
            <surname>H.P.</surname>
          </string-name>
          (ed.)
          <source>New Aspects of Hight Technology in Medicine: Hannover (Germany)</source>
          , pp.
          <volume>409</volume>
          {
          <fpage>414</fpage>
          .
          <string-name>
            <surname>Monduzzi</surname>
            <given-names>Editore</given-names>
          </string-name>
          ,
          <source>International Proceedings Division</source>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Metivier</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Serrano</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charnois</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuissart</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Widlocher, A.:
          <article-title>Automatic symptom extraction from texts to enhance knowledge discovery on rare diseases</article-title>
          .
          <source>In: Arti cial Intelligence in Medicine - 15th Conference on Arti cial Intelligence in Medicine, AIME</source>
          <year>2015</year>
          , Pavia, Italy, June 17-20,
          <year>2015</year>
          . Proceedings. pp.
          <volume>249</volume>
          {
          <issue>254</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>