=Paper= {{Paper |id=Vol-1692/paperD |storemode=property |title=Querying standardized EHRs by a Search Ontology XML extension (SOX) |pdfUrl=https://ceur-ws.org/Vol-1692/paperD.pdf |volume=Vol-1692 |authors=Stefan Kropf,Alexandr Uciteli,Peter Krücken,Kerstin Denecke,Heinrich Herre |dblpUrl=https://dblp.org/rec/conf/odls/KropfUKDH16 }} ==Querying standardized EHRs by a Search Ontology XML extension (SOX)== https://ceur-ws.org/Vol-1692/paperD.pdf
Querying standardized EHRs by a Search Ontology XML
extension (SOX)
Stefan Kropf1*, Alexandr Uciteli1*, Peter Krücken2, Kerstin Denecke3, Heinrich Herre1
1 Institute for Medical Informatics, Statistics and Epidemiology (IMISE), Leipzig University
2 Institute of Pathology, Leipzig University
3 Institute for Medical Informatics, Bern University of Applied Sciences
* Contributed equally




ABSTRACT                                                                         specify domain concepts, search terms associated to the domain, and
                                                                                 rules describing domain concepts. In this way, it simplifies the def-
Motivation: The previously developed Search Ontology (SO) allows                 inition of search rules. The SO can be used for information retrieval
domain experts to formally specify domain concepts, search terms asso-           in any domain by extending it by the corresponding domain ontol-
ciated to a domain, and rules describing domain concepts. So far, Lucene         ogy.
search queries can be generated from information contained in the SO                In this work, we introduce an extension of the SO that enables the
and can be used for querying literature data bases or PubMed. However,           definition of queries on structured XML documents. Assuming that
                                                                                 we have structured and standardized XML documents, then we can
this is still insufficient, since these queries are not well suited for query-
                                                                                 query certain parts of the XML document by XPath expressions. The
ing XML documents because they are not following their structure.
                                                                                 development of such XPaths is time consuming for domain experts,
However, in the medical domain, many information items are coded in              but also for computer scientists. We suggest to use ontologies to sup-
XML. Thus, querying structured XML documents is crucial for retriev-             port domain experts in modelling XML queries.
ing similar cases or for identifying potential study participants. For ex-          Out of the ontology based query models, XPaths can be generated
ample, information items of patients with a similar tumor classification         automatically, which in turn can be applied to document corpora on
documented in a certain section of the respective pathology report need          XML database systems for searching similar cases or for the identi-
to be retrieved. This requires a precise definition of queries. In this pa-      fication of potential study participants. Even though the approach is
per, we introduce a concept for the generation of such queries using a           inherent independent from the underlying XML structure, we will
Search Ontology XML extension to enable semantic searches on struc-              demonstrate the approach on an example of querying standardized
tured data.                                                                      Electronic Health Records (EHRs) in the pathology domain.
                                                                                    To address the problem of creating structured queries for retriev-
Results: For a gain of precision, the paragraph of a document need to
                                                                                 ing documents, previous work considered the unification of different
be specified, in which a specific information item expressed in a query
                                                                                 XML structures on the conceptual level, on the one hand by the in-
is expected to appear. The Search Ontology XML Extension (SOX) con-              troduction of new query languages, e.g. CXPath (Camillo et al.,
nects search terms to certain sections in XML documents. The extension           2003) or XSEarch (Cohen et al., 2004), or on the other hand by in-
consists of a class which represents the XML structure and a relation            troducing conceptual ontologies (Cruz et al., 2004; Erdmann et al.,
between search terms and this XML structure. This enables an automatic           1999). In contrast to this unification approaches, the SOX approach,
generation of XPath expressions, which makes an efficient and precise            introduced in this paper, is strongly bound to the used XML struc-
search of structured pathology reports in XML databases possible. The            ture. Indeed, this strong binding on a structure is only meaningful
combination of standardized Electronic Health Records with an ontol-             when standardized XML based EHRs are used.
ogy based query method promises a gain of precision, a high degree of
interoperability and long term durability of both, XML documents and             2     METHODS
queries on XML documents.
* Contact: skropf@imise.uni-leipzig.de auciteli@imise.uni-leipzig.de             2.1     Overview

1    INTRODUCTION                                                                Figure 1 gives an overview on the basic approach presented in this
                                                                                 paper.
Since untagged information in health information systems (HISs) is
common, information access supported by automatic methods is dif-
ficult. It is still an open question how to accelerate the access to in-
formation captured in these systems or in Electronic Health Records
(EHRs). On the one hand, content must be structured by automatic
recognition processes. On the other hand, the structured data has to
be queried in a structured way.
                                                                                 Fig. 1. (1) The domain expert models the queries by the usage of SOX in
   This paper will focus on the query side, by introducing a new so-
                                                                                 Protégé. (2) Using an extended version of the OntoQueryBuilder Plugin, Pro-
lution of semantic meaningful queries on structured XML docu-
                                                                                 tége generates XPath expressions out of the ontology. (3) The expert applies
ments, defined by the Search Ontology (SO) XML Extension. The
                                                                                 the generated XPath expressions to an XML database, that (4) returns the
SO (Uciteli et al., 2014) has been developed to support full text
                                                                                 relevant documents.
search on unstructured documents. It allows an user to formally



                                                                                                                                                           1
S. Kropf, A. Uciteli, P. Krücken, K. Denecke, H. Herre



A domain expert is in the middle of the query formulation and re-        Listing 1. Lucene Query for an occlusion device complication; the
trieval process. He uses Protégé, the ontology editor of the Stanford    expression was generated by the plugin OntoQueryBuilder.
University (Musen 2015), for modeling a query using the Search             ("occlusion device" OR occluder) AND (("in-
Ontology (section 2.2) and SOX (section 3.1) as shown on the left        sufficient sealing"~2 OR "insufficient clo-
side in figure 1. By an adaption of the OntoQueryBuilder Plugin it       sure"~2 OR "incomplete sealing"~2 OR "incom-
will be possible to generate XPaths expressions. Additionally, the       plete closure"~2 OR "inadequate sealing"~2 OR
agent interacts with the XML database as shown on the right hand         "inadequate closure"~2))
side of figure 1, After recognizing section boundaries, unstructured
documents are stored on an XML database (section 2.3). Using the         The latter example indicates that the formulation of a query can be-
XPaths (sections 2.4, 3.2), the domain expert can retrieve relevant      come a complex task; the cross-product of only 10 adjectives with
XML documents.                                                           10 nouns results in 100 adjective substantive combinations. Hence,
                                                                         we have to manage Concepts and Terms by an appropriate ontology,
2.2    Search Ontology                                                   especially if we want to reuse concepts or if we want to generate
                                                                         cross-products of certain term combinations.
The formulation of structured queries can be very time consuming,           The SO is used in practice in the OntoVigilance project (On-
especially in safety-relevant domains like post market surveillance.     toVigilance Homepage 2016), where semantic searches have to be
A concept can be described in different ways, on the one hand by         managed within post market surveillance queries of medical de-
synonyms, on the other hand by complex phrases, which in turn con-       vices. In brief, domain experts can manage their domain search on-
sist of multiple terms. Because of that we distinguish Sim-              tology (DSO). By the usage of the developed plugin OntoQuery-
ple_Terms from Composite_Terms.                                          Builder a Lucene query can be generated.


                                                                         2.3    Standardized XML-based EHRs
                                                                         In this paper, we will concentrate on the special domain of pathol-
                                                                         ogy, where a lot of semi-structured information occurs in terms of
                                                                         pathology reports. We consider this information semi-structured, be-
                                                                         cause the pathologists structure their information by headers and
                                                                         keywords, but these structure is usually not technically imple-
                                                                         mented. In fact, pathology reports are based on certain section pat-
                                                                         terns and section-introducing keywords, like material, macroscopy
                                                                         or microscopy. We verified manually that documents originated
                                                                         from the Institute of Pathology of Leipzig, the sections introducing
                                                                         keywords like Material, Makroskopie or Mikroskopie were con-
Fig. 2. Overview Search Ontology
                                                                         stantly used for section tagging. Therefore, the reports can be struc-
                                                                         tured in sections by section boundary detection, which is not the
Composite_Terms. are made up of Simple_Terms, related by                 main focus of this article. Consequently, legacy data can be trans-
the Object Property has_part and are constrained by the addi-            formed into a structured format. Suitable standards for long term
tional Data Property max_distance, which defines the word dis-           persistence are EN 14822 a.k.a. HL7 RIM (EN 14822, 2006) and
tance between two Simple_Terms, where max_distance=0                     EN 13606 a.k.a. openEHR (EN 13606, 2012). Both standards are
represents that one word immediately follows after another word.         representable in XML, EN 14822 by the usage of CDA (Dolin et al.,
Writing variations, synonyms of abbreviations of the Sim-                2001) and EN 13606 by the usage of openEHR modeling tools
ple_Terms can be handled by the assignment of multiple labels to         (Kropf et al., 2015), which results in a standardized XML schema
the concrete individual of a Simple_Term.                                based on the openEHR XML schemas (Beale 2015).
   For instance, the complication of a medical device (e.g. occluder)       In this work, we will use pathology reports, mapped to standard-
is a reusable Search_Concept, which can be described_by                  ized EHRs by the usage of the openEHR archetypes openEHR-
several Search_Terms.                                                    EHR-OBSERVATION.lab_test-histopathology.v1 for structuring
   To such descriptions belongs among other things adjective             pathology data and openEHR-EHR-CLUSTER.tnm_staging_7th.v1
phrases like incomplete closure. Instead of the adjective, other terms   for structuring the TNM classification (Sobin et al., 2011) data. Both
with the same semantic meaning could be used; the noun could be          latter archetypes are available at the Clinical Knowledge Manager
replaced by any term which represents the meaning of closure. Out        (CKM) of openEHR [http://www.openehr.org/ckm/]. Consider the
of this definition, a query can be generated, which is in this example   following snippet of an XML based pathology EHR (cf. Listing 2)
the disjunction of all combinations of adjectives and nouns (cf. sec-    where we demonstrate the challenges of querying the content. They
ond disjunction in Listing 1).                                           are mainly due to the linguistic variability of natural language.




2
                                                                            Querying standardized EHRs by a Search Ontology XML extension




Listing 2. Simplified XML based pathology EHR snippet, containing a           Listing 3. Required XPath expressions for a search of EHRs which con-
macroscopy and a TNM classification part. The snippet was cut to the nec-     tains T2 as primary tumor classification in the first part and defined phrases
essary elements, which we want to address in the query in this paper,         of HE in the second part.
marked by a grey background: the macroscopy section (part of openEHR-
EHR-OBSERVATION.lab_test-histopathology.v1) and the primary tumor
                                                                              /Pathology/Tumour__-__TNM_Cancer_staging_7th_Edition/
classification (part of openEHR-EHR-CLUSTER.tnm_staging_7th.v1). The
                                                                              Primary_tumour__openBrkt_T_closeBrkt_/
doubling of the value tag is a result of the EN 13606 reference model, in
practice the two value tags have different namespace declarations.            value[contains(value,'T2')]


  
                                                                              value[matches(value,'keilförmig(\w)* ([\w]*\s){0,2}Hautexzidat')]
         
           Makroskopisch                                       or
                                                                       /Pathology/Macroscopic_findings/Overall_macroscopic_description/
         
                                                                        value[matches(value,'rundlich(\w)* ([\w]*\s){0,2}H.E.)]
                  Makroskopisch                                or
           
                                                                              /Pathology/Macroscopic_findings/Overall_macroscopic_description/
           
                  Ein ff. einfach fadenmarkiertes                      value[matches(value,'keilförmig(\w)* ([\w]*\s){0,2}H.E.)]
keilförmiges Hautexzidat von 0,8 x 0,6 x 0,3 cm.[…]                   or
           
                                            /Pathology/Macroscopic_findings/Overall_macroscopic_description/
                                                       value[matches(value,'rundlich(\w)* ([\w]*\s){0,2} Hautexzidat'')]
         
         
           Tumour - TNM Cancer staging 7th Edi-                        The first part of Listing 3 matches documents where the tumor class
tion                                                                  is T2. In the second part, each disjunction represents one adjective
         
                                       noun phrase. It consists of an adjective and a regular expression that
                                                                        reflects possible declension variations followed by the noun phrase
                  Primärtumor T
           
                                                                              representing Hautexzidat (HE). The expression ([\w]*\s){0,2} im-
                                                                       plies that between the adjective and the noun a maximum of two
                  pT2                                          words are allowed to match the pattern.
           
         
         […]
                                  3     RESULTS
  […]
  
                                                                              3.1     Extension of the Search Ontology
The TNM structured classification string in the XML snippet is                The output of the SO are Lucene queries, but they do not follow the
“pT2”. The sentence which introduces the overall macroscopy de-               structure of XML documents, thus, they are not applicable to XML
scription contains a noun Hautexzidat (HE) (en: excised skin mate-            documents. However, the SO delivers already a reusable framework
rial), marked by a bold font. Due to linguistic variability, this noun        which only has to be extended for enabling structured queries in
can vary, i.e. synonyms or abbreviations such as H.E. are used in             XML. By extending the SO with the SOX, queries are automatically
practice. In front of the noun there is an underlined adjective keilför-      producible out of the ontology, which can be executed on XML doc-
mig (en: cuneiform) for specifying the shape. Again, semantic and             uments. For this purpose, two elements were added to the SO, on the
linguistic variants of the term exist (e.g. rundlich (en: roundish)).         top level of the ontology the class XML_Structure and the Object
Furthermore, the order of the adjectives in the phrase could change:          Property in.
the order “Ein ff. rundliches fadenmarkiertes H.E.” was found and
is also valid.

2.4     XPath Queries
When EHRs are stored in structured XML, another query language
is more suitable than classical free text retrieval methods such as
Lucene (McCandless et al., 2010) or SOLR (Trey et al., 2014).
XPath expressions are following the structure of the EHRs and are
a W3C standardized method for addressing parts in XML documents
(XML Path Language (XPath), 2015). An example XPath Query is
shown in Listing 3 for querying T2 and phrases of HE from EHR
                                                                              Fig. 3. The Search Ontology XML Extension introduces the top level class
documents similar to those in Listing 2.
                                                                              XML_Structure and the relation in (dashed arrow).

                                                                              Figure 3 shows that Seach_Concepts are described_by
                                                                              Search_Terms, which belong to certain parts in the
                                                                              XML_Structure; in more detail, Search_Terms are linked to
                                                                              XML parts by the in relation. The subclass structure of



                                                                                                                                                          3
S. Kropf, A. Uciteli, P. Krücken, K. Denecke, H. Herre



XML_Structure represents the XML document structure.                        cept HE_Shapes, which is described_by the Compo-
Namespaces and tag names of the XML document are defined by                 site_Term HE_Phrase and is expected in the Over-
XML_Structure class labels. Figure 5 illustrates the SO for the             all_macroscopic_description.                  This        Compo-
described use case, where the documents follow the structure of             site_Term yields to a disjunction expression of all combinations
Listing 2 and XPaths of Listing 3 have to be generated as output.           of the labels of the HE_Form individuals with the labels of
   The modelling of the SO (illustrated in fig. 5) has to be done man-      HE_Term individuals, which is in essence a kind of a cross product.
ually by the domain expert. For querying HEs with different kinds             The generated XPaths can be used for structured queries and for
of shapes the Search_Concept HE_Shapes was defined, de-                     the integration in other XML techniques (XSLT or XQuery).
scribed_by HE_Phrase, which consists of two Sim-
ple_Terms (HE_Form and HE_Term). In the example of this
                                                                            4     DISCUSSION
paper, individuals of HE_Form can be adjectives like rundlich or
keilförmig; HE_Term has only one individual, the different writing          With an example on querying structured EHRs, we introduced an
variations (Hautexzidat, H.E.) can be handled by multiple label as-         extension of the Search Ontology to support querying structured
signments. Figure 5 illustrates also the usage of the in relation for       XML documents. The SOX approach can simplify the managing of
the specification of the position inside the XML document, for in-          a big pool of XPath expressions in one overarching DSO in practice.
stance is the term T2_Term expected in the associated section
Primary_Tumor. In a similar way, HE_Shapes are bound to the                 4.1     Standardized queries on standardized EHRs
XML_Structure. The following figure 4 shows the class defini-               Indeed, SPARQL queries on OWL based patient data would be more
tion of HE_Shapes in Protégé.                                               powerful than XPath expressions on XML, but a comprehensive and
                                                                            long term persistence storage of pathology data within semantic web
                                                                            technologies is only partially solved and still an open research ques-
                                                                            tion. Therefore, until there is no standardized domain ontology
                                                                            available, queries on standardized XML will be more stable and long
Fig. 4. Class definition of HE_Shapes                                       term durable. To put it in brief, the first requirement is a layer of
                                                                            standardized EHRs and tools which work at this layer, like the in-
In summary, out of the DSO XML extension, it is possible to gener-          troduced SOX. After that step a more powerful ontology layer is de-
ate automatically the required XPath expressions.                           mandable.
                                                                               We used the EN 13606 standardized XML in this work. However,
3.2     Automatic XPath generation                                          another option would be EN 14822 or even any other proprietary
                                                                            XML format. When the community comes to an agreement, which
For each Search_Concept one XPath can be generated automat-                 EHR standard will be used in German Health Information Systems
ically. This can be done by a scheduled adaption of the Lucene ex-          in future, not only the EHR would be interoperable, the usage of a
port plugin, the OntoQueryBuilder, which is already developed as            standardized query language implies that queries could be interop-
part of the OntoVigilance (OntoVigilance Homepage 2016) project.            erable too. Presupposed standardized EHRs would be used, it is im-
The first part of Listing 3 can be generated out of the                     aginable that queries on such EHRs can be interoperable and there-
Search_Concept T2, in which description the term T2_Term                    fore used in different hospitals. When openEHR is used, the depend-
is bound to the appropriate search location by the in relation. The         ing SOX or the resulting queries could be stored in a repository and
second part of Listing 3 is producible out of the Search_Con-               they could be linked to the belonging archetypes. Consequently, the




                    Fig. 5. Search Ontology for querying the concept primary tumor T2 and the concept HE_Shapes (manually defined)

4
                                                                           Querying standardized EHRs by a Search Ontology XML extension




query can be reused like the archetype itself, this would save devel-        REFERENCES
opment time and lead to quality intensification.
                                                                             Beale, T. (2015) openEHR reference-models XSDs release 1.0.2
                                                                               https://github.com/openEHR/reference-models/tree/master/mod-
4.2     Recognition methods vs. querying methods
                                                                               els/openEHR/Release-1.0.2/XSD [cited 2016-08-30]
Research in Natural Language Processing (NLP) delivers methods               Camillo, S. D., Carlos A. H. and dos Santos Mello, R. (2003) Que-
(Hahn et al., 2002; Friedman et al., 1999) for the recognition of clin-        rying heterogeneous XML sources through a conceptual schema.
ical information in medical documents. Nevertheless, it is unclear             International Conference on Conceptual Modeling, pp. 186-199.
when a reliable NLP system will automatically recognize and anno-            Cohen, S. et al. (2003) XSEarch: A semantic search engine for
tate free text pathology reports or any other free textual clinical doc-       XML. Proceedings of the 29th international conference on Very
uments in daily practice. It is important to think about practical so-         large data bases. Volume 29, pp. 45-56, VLDB Endowment.
lutions on the recognition side, but also on the query side. The SOX         Cruz, I. R., Xiao, H. and Hsu, F. (2004) An ontology-based frame-
delivers a practical solution on the query side by the connection of           work for XML semantic integration. Database Engineering and
Search Terms to parts in XML documents.                                        Applications Symposium. IDEAS'04. Proceedings. International,
   Listing 3 already respects declension forms of adjectives which             pp. 217-226, IEEE.
have a stable word stem. But for the development of a minimal SO,            Dolin, R. H. et al. (2001) The HL7 Clinical Document Architecture.
NLP methods like stemming are necessary too. Because of that, we               Journal of the American Medical Informatics Association 8(6),
have to think about the integration of such methods into ontological           pp. 552-569.
contemplations about querying structured information in the near fu-         EN 13606. (2012) Health informatics - Electronic health record
ture.                                                                          communication [Norm].
                                                                             EN 14822. (2006) Health informatics - General purpose information
4.3     Future ontological work                                                components [Norm].
XML elements are more than symbolic structures, they have to be              Erdmann, M. and Studer, R. (1999) Ontologies as conceptual mod-
considered in detail, last but not least they should have be bound to          els for XML documents. Proceedings of the 12th International
a top level ontology like the General Formal Ontology (GFO) (Herre             Workshop on Knowledge Acquisition, Modelling and Mange-
2010). In the SOX, the XML document structure was realized by                  ment (KAW’99), Banff, Canada.
is_a relations, because we wanted to model queries in tree struc-            Friedman, C. and Hripcsak, G. (1999) Natural language processing
tures in standard Protégé. Of course has_part relations would be               and its future in medicine. Academic Medicine 74(8), pp. 890-5.
semantically better. For this reason, we plan to develop a proper            Ghawi, R. and Cullot N. (2009) Building Ontologies from XML
plugin for the modeling of has_part relations in tree structures in            Data Sources. DEXA Workshops, pp. 480-4.
Protégé. In addition, an automatic conversion of XML documents               Grainger, T, Potter, T. and Seeley Y. (2014) Solr in action. Manning.
into a SOX XML_Structure tree is demandable; this would ac-                  Hahn, U., Romacker M. and Schulz S. (2002) MEDSYNDIKATE—
celerate the query development in Protégé. X2OWL can generate an               a natural language system for the extraction of medical infor-
OWL ontology from an XML data source (Ghawi et al., 2009) and                  mation from findings reports. International journal of medical in-
is a good starting point.                                                      formatics 67(1), pp. 63-74.
                                                                             Herre, H. (2010) General Formal Ontology (GFO): A foundational
                                                                               ontology for conceptual modelling. Theory and applications of
5     CONCLUSION                                                               ontology: computer applications. Springer Netherlands, pp. 297-
When EHRs are persisted in standardized XML, it is possible to                 345.
query them in a structured way. The introduced Search Ontology               Kropf, S, Chalopin, C. and Denecke, K. (2015) Template and Model
XML extension connects search terms to certain parts in XML doc-               Driven Development of Standardized Electronic Health Records.
uments and enables an ontology based definition of semantic                    Studies in health technology and informatics 216, pp. 30-4.
searches. Out of this, XPath expressions can be generated for que-           McCandless, M., Hatcher, E. and Gospodnetic, Otis. (2010) Lucene
rying XML database systems. Our solution supports the reuse ori-               in Action: Covers Apache Lucene 3.0. Manning Publications Co.
ented specification of complex and powerful XPath expressions                Musen, M.A. The Protégé project: A look back and a look forward.
without deep syntactic knowledge about XPath. The approach is                  AI Matters. Association of Computing Machinery Specific Inter-
open for additional extensions; parts of the ontology can be reused            est Group in Artificial Intelligence, 1(4), 2015, pp. 4-12
and adapted easily for other use cases.                                      OntoVigilance Homepage (2016) http://www.ontovigilance.org/
                                                                               [cited 2016-06-16]
ACKNOWLEDGEMENT                                                              Sobin, L. H., Gospodarowicz, M. K. and Wittekind C., eds. (2011)
                                                                               TNM classification of malignant tumours. John Wiley & Sons.
Thanks to Claire Chalopin, Wolf Müller, Katrin Schierle, Lars Voi-           Uciteli, Alexandr et al. (2014) Search Ontology, a new approach to-
tel, Christian Wittekind for their support, to the reviewers of ODLS           wards Semantic Search. GI-Jahrestagung, pp. 667-672.
for their constructive feedback, especially Dagmar Waltemath, and            XML Path Language (XPath) (2015) Version 1.0. W3C Recom-
the organizers, among which we mention Frank Loebe and Daniel                  mendation. https://www.w3.org/TR/xpath/ [cited 2016-06-09]
Schober, for arranging ODLS. This work was conducted using the
Protégé resource, which is supported by grant GM10331601 from
the National Institute of General Medical Sciences of the United
States National Institutes of Health.



                                                                                                                                                5