=Paper=
{{Paper
|id=Vol-410/paper-2
|storemode=property
|title=Exploiting Fast Classification of SNOMED CT for Query and Integration of Health Data
|pdfUrl=https://ceur-ws.org/Vol-410/Paper02.pdf
|volume=Vol-410
|dblpUrl=https://dblp.org/rec/conf/krmed/Lawley08
}}
==Exploiting Fast Classification of SNOMED CT for Query and Integration of Health Data==
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)
Exploiting Fast Classification of SNOMED CT for Query and
Integration of Health Data
Michael J. Lawley
Queensland University of Technology, Faculty of Information Technology, Brisbane, (Queensland), Australia
E-Health Research Centre, CSIRO ICT Centre, (Queensland), Australia
Abstract common in the health domain where terms often
By constructing local extensions to SNOMED we involve an implicit context of usage (e.g., lobe in
aim to enrich existing medical and related data the context of lung cancer) or implicit references to
stores, simplify the expression of complex queries, anatomical structures (e.g., colorectal cancer) or
and establish a foundation for semantic integra- related classes of diseases, injuries, or procedures.
tion of data from multiple sources. Accurately and consistently encoding these rela-
Specifically, a local extension can be constructed tionships in queries relies on the person formulat-
from the controlled vocabulary(ies) used in the ing the queries to understand them, thus creating
medical data. In combination with SNOMED, many opportunities for errors, omissions, and in-
this local extension makes explicit the implicit se- consistencies to occur. When multiple people are
mantics of the terms in the controlled vocabulary. constructing queries these risks are further exac-
By using SNOMED as a base ontology we can erbated.
exploit the existing knowledge encoded in it and By constructing the vocabularies so as to explicitly
simplify the task of reifying the implicit seman- represent the relationships between terms, queries
tics of the controlled vocabulary. Queries can now can directly and consistently exploit the relation-
be formulated using the relationships encoded in ships. Using an ad-hoc explicit representation of
the extended SNOMED rather than embedding these relationships helps, but may introduce new
them ad-hoc into the query itself. Additionally, problems in terms of consistency of usage and how
SNOMED can then act as a common point of in- the relationships are interpreted (see, for example,
tegration, providing a shared set of concepts for the Radiological Electronic Atlas of Malformation
querying across multiple data sets. Syndromes and Skeletal Dysplasias (REAMS) [2]).
Key to practical construction of a local extension Instead, using a well-understood formal mecha-
to SNOMED is appropriate tool support including nism for representing the relationships, such as
the ability to compute subsumption relationships Description Logic, can avoid these problems.
very quickly. Our implementation of the polyno- However we still have two problems to solve:
mial algorithm for EL+ in Java is able to classify 1. how do we deal with all the existing data sets
SNOMED in under 1 minute. that do not do this; and
INTRODUCTION 2. how do we mitigate the, potentially quite high1 ,
cost of explicitly representing all the relation-
Experience with integrating medical and related
ships?
data [1] shows that the use of controlled vocabu-
laries successfully modulates the amount of noise We can deal with both these problems by ex-
in the data. However, when querying the col- tending (as needed) an existing standard ontol-
lected data, any semantic relationships between ogy, such as the Systematized Nomenclature of
the terms that are relevant to the query (for ex- Medicine (SNOMED) [3], that already embodies
ample, specialisation/generalisation or part-of re- 1
lationships) need to be explicitly encoded in the Getting the modelling right, from scratch, requires
not only an excellent understanding of the concepts in-
query and/or accounted for in the interpretation volved as well as their relationships, but also an under-
of the query results. standing of how best to represent them in a particular
These kinds of implicit relationships are especially Description Logic formalism.
8
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)
many of the relationships we need. However, one important problem, and one identified in our work
of the main difficulties with this approach is that with skeletal dysplasias [2], is how to cope with er-
building an extension to SNOMED is not dissim- rors in the shared terminology.
ilar to maintaining and developing SNOMED it- Wade and Rosenbloom [9] report on the man-
self. That is, the sheer size of SNOMED has ual construction of what is almost a local exten-
meant that, until recently, very few tools could sion to SNOMED (they conceived the task as a
compute all of its subsumption relationships, and semi-formal mapping). In this work 2002 terms
even those that could would reportedly take sev- were mapped to combination of single and post-
eral hours. coordinated concepts of which about 75% were
Fortunately, recent work by Baader et al. [4, 5] equivalencies (20% of these were to single con-
on the tractable family of description logics EL cepts) and only 1% (26) were, in their words, “not
has shown that polynomial time classification al- mappable”. It is unclear why these terms were cat-
gorithms exist and are practical. Moreover despite egorized as such since they include, for example,
their relatively low expressive power, the EL fam- presyncope which could reasonably be related to
ily of description logics is suitable for represent- 3006004|disturbance of consciousness|, but it may
ing such real-world ontologies as SNOMED and be that the context of use of the terms was unavail-
offer additional expressiveness suitable for prop- able in order to properly discern their meaning.
erly representing partOf relationships and suffi- However, their work does demonstrate that the
cient conditions.2 Their implementation of this goal of producing a local extension to SNOMED
algorithm in Lisp is able to classify SNOMED in is feasible.
1,782 seconds [5] (approx. 30 minutes) which sug-
gests an optimised implementation in a lower-level PROBLEM DESCRIPTION
language may be fast enough for near real-time The problem of embedding domain semantics such
feedback in an editing tool. as specialisation/generalisation or part-of relation-
Thus, our goal is to provide tool support for defin- ships into queries is illustrated in the following.
ing a local extension to an existing standard formal For example, a query to find all performed proce-
ontology; a mapping from an existing set of terms dures involving a colectomy might enumerate all
that characterise an informal ontology to concepts such procedures:
in the formal ontology. In doing so we effectively
SELECT S.*
realise latent semantics in the existing medical
FROM Surgery S
data via the standard ontology. This should facil-
WHERE S.procedure = ’32003-00’
itate simpler and more robust queries and in turn
OR S.procedure = ’32003-01’
aid data integration, a special-case application of
OR S.procedure = ’32012-00’
querying where related medical data sets use se-
...
mantically overlapping, but distinct term sets.
which has the potential to accidently omit certain
RELATED WORK codes and will require updating if the terminology
There is a great deal of published work on using is updated with additional forms of colectomy.
ontologies for data integration (see Wache et al. [6] Alternatively, some kind of heuristic query could
for an overview), but it is mostly focussed on their be used:
use at the meta-data level; ontologies are used SELECT S.*
to describe, reason about and integrate database FROM Surgery S, ProcedureCodes C
schemas. While related to our goals, we are ad- WHERE S.procedure = C.code
dressing the more specific problem of semantic AND C.text LIKE ’%colectomy%’;
data integration or semantic translation. Stucken- which has the potential to miss a term that doesn’t
schmidt et al. [7] discuss an approach to this prob- follow the expected naming pattern (e.g., epiploec-
lem in the context of their Ontology Interchange tomy) or provide false matches where a compound
Language (OIL) [8]. In particular they raise the or composite name does not reflect a valid special-
question of whether it is feasible to find or cre- isation.
ate a sufficient shared terminology. In our domain If, however, the terms were encoded as concepts
of medical data we believe that SNOMED repre- in an ontology, the query is simple3 :
sents such a shared terminology. A possibly more 3
We envisage that the complete set of subsumption
2
See also http://webont.org/owl/1.1/ relationships would be stored in a database table to
tractable.html#2 support fast subsumption-based queries using only two
9
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)
SELECT S.* cancer, shown in Figure 1. We can map these, one-
FROM Surgery S, Ontology O to-one, to a set of concepts for a local ontology.
WHERE O.ancestor = 23968004
AND S.procedure = O.descendant;
Note also that SNOMED, unlike classification Procedure Code Meaning
schemes such as ICD-9 and ICD-10, support a (ICD-10-AM)
multi-parented generalisation hierarchy. 32000-00 Sig colectomy with stoma
formation
CONSTRUCTING LOCAL 32003-00 Sig colectomy with anasto-
EXTENSIONS mosis
In order to construct an ontology from an exist- 32003-01 Right hemicolectomy
ing terminology (or collection of terminologies) we 32005-00 Subtotal colectomy
take a multi-step approach: 32005-01 Ext right hemicolectomy
32006-00 Left hemicolectomy
1. Map each term from the controlled vocabulary 32012-00 Total colectomy
to a concept, factoring out any synonyms, to
32024-00 High anterior resection
produce P.
32025-00 Low anterior resection ex-
This is often a simple one-to-one mapping, but traperitoneal
it may be necessary to extend the mapping to 32026-00 Low anterior resection
include disambiguating data values when the coloanal anastomosis
same term is used to mean different things in 32028-00 Ultra low anterior resection
different contexts. 32030-00 Hartmann’s procedure
32039-00 Abdomino-perineal excision
2. Make any simple implicit relationships explicit,
adding them to P. 32051-00 Total proctocolectomy with
ileo-anal anastomosis
For example, generalisation, partOf, or hasLoca-
tion relationships. It may be necessary to intro- Figure 1: A Term-Set of Colorectal Cancer Proce-
duce new concepts to act as the generalisation dures
of two or more sibling concepts.
3. Specify relationships between these (local) con-
cepts and those in the chosen standard ontology The next step is to make any simple relationship
Q, adding them to P. explicit. In our case there are none that can be ex-
pressed using just the concepts we have currently
To be able to answer queries involving our new identified.
ontology we first need to classify Q ∪ P to identify
all the subsumption relationships it entails. Figure 2 describes the identified relationships be-
Note that, we should be careful that Q ∪ P repre- tween these terms and selected SNOMED con-
sents a conservative extension [10] of Q. That is, cepts as per step 3. Note that several concepts (for
Q ∪ P produces the same consequences over the example, 32028-00|ultra low anterior resection|),
set of concepts in Q as Q does by itself. We also have no exact equivalent in SNOMED, and that
need to ensure various integrity constraints (such one, 32051|total proctocolectomy with ileo-anal
as disjointness) are preserved in Q ∪ P. Thus we anastomosis| implies a composite of concepts.
would like to be able to interactively edit P while
Figure 3 shows a visualisation of the results of
exploiting the consequences of Q ∪ P in live feed-
classifying SNOMED augmented with the ontol-
back through the mapping tool. These kinds of
ogy from Figure 2. As can be seen, unifying
checks can be performed by classification of Q ∪ P
generalisation concepts such as 84604002|sigmoid
but this may not be viable if Q ∪ P is large, as is
colectomy| have been identified, and thus provide
the case when Q is SNOMED.
a strong foundation for constructing queries that
Colorectal Cancer Example span the various procedures. Additionally, since
SNOMED includes detailed anatomical concepts,
In this section we consider a sample set of ICD-10-
queries can now be composed in terms of anatom-
AM [11] terms for procedures relating to colorectal
ical features even though they did not exist in the
joins. original terminology.
10
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)
Procedure Relation SNOMED To support this kind of problem with reasonable
32000-00 ≡ 315327002 generality and decent query speed, we need to
32003-00 ≡ 315326006 generate a new column containing codes that are
32003-01 ≡ 235326000 mapped to the set of compound concepts that
32005-00 ≡ 43075005 correspond to the contextualised meaning of each
32005-01 ≡ 174071004 database row. Hence, as shown in Figure 5, the ta-
32006-00 ≡ 82619000 ble from Figure 4 would be extended with a Code
32012-00 ≡ 26390003 foreign-key column, and an additional table con-
32024-00 ≡ 400988008 taining the SNOMED expressions of the form4 :
32025-00 314592008 ∃ associatedProcedure.P
32026-00 314592008 ∃ laterality.L
32028-00 314592008 ∃ procedureContext.S
32030-00 ≡ 16564004 which gives us another ontology extension that we
32039-00 ≡ 265414003 can add to SNOMED.
32051-00 174059005 70172002 Finally, in order to be able to pose a subsumption-
based complex query involving composite concepts
Figure 2: Identified Relationships with SNOMED and have it evaluated at database join speeds, we
Concepts can employ the same strategy: extend the ontol-
ogy with a new fully-defined concept correspond-
ing to our query expression, re-classify, and per-
COMPLEX QUERIES AND form a join-based query using the new concept.
CONTEXT The need to construct compound expressions that
So far we have only considered simple query explicitly represent the context associated with a
scenarios where a single database column record in a database occurs any time the data
represents the concept we wish to query needs to be queried outside its original context.
(e.g., Surgery.procedure) and there already ex- This may happen in as trivial a case as when one
ists a concept that characterises the bound of the table in a database is joined with another, but
query (e.g., 2396804). the more general scenario occurs when integrating
data from multiple data sources.
Consider instead a table, as shown in Figure 4,
that stores both scheduled and performed proce-
dures while using another column to distinguish RESULTS
them, and which encodes laterality, if any, of the Classifying SNOMED
procedure in yet another column. Now imagine
we wish to query for all patients who have had an The practicality of creating local extensions of
amputation including the left hand. SNOMED is dependent on sufficient tool support
and, as mentioned previously, a cornerstone of this
Patient Date Status is fast classification. Indeed we believe that near
... ... ... real-time feedback in an editing environment, be
it an IDE for programming or a 3D architectural
Procedure Laterality modelling tool, can have a transformational effect
... ... on the authoring and editing process.
To this end, we have implemented snorocket, using
Figure 4: Table storing records with contextual a slightly altered form of the algorithm in [5] writ-
information split across columns ten in Java. We use several optimised Map and
Set data-structures tailored for ontologies with
roughly the same number of concepts and roles as
Patient Date ... Laterality Code SNOMED. This implementation is able to classify
... ... ...
4
Note that considerable experience with SNOMED
Code Equivalent SNOMED Expression and all its documentation may be required to construct
... ... suitable valid post-coordinated expressions like those
above. Tool support for this is clearly an important
issue and recent work in the IHTSDO Concept Model
Figure 5: Augmented table for representing con- SIG on producing a Machine Readable Concept Model
textualised concepts will be valuable for this.
11
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)
Figure 3: Visualisation of part of an extended SNOMED ontology
SNOMED in 54 seconds on a modern 2.4GHz In- Currently this work is in a preliminary state and
tel Core 2 Duo running Windows XP and Sun’s the correspondence with the variant described
Java 1.6.0 03. in [12] is unknown. However the performance
For a fairer comparison with CEL, which only of this incremental algorithm is very promising.
runs under Linux, we ran both snorocket and With P consisting of the 14 new concepts as de-
CEL on an older four-CPU Xeon 3.6GHz ma- fined as in Figure 2, incremental classification
chine running RedHat Linux 2.6.9 and Sun’s Java takes around 0.9s using our un-optimised imple-
1.6.0 04. The results, for several of the ontologies mentation.
available from http://lat.inf.tu-dresden.de/
~meng/toyont.html, are in Table 1. DISCUSSION
Clearly, being able to classify SNOMED in close Ideally, as a term set is developed, it would be ex-
to a minute is a substantial improvement over plicitly constructed as an ontology and, to avoid
roughly 23 minutes and brings us much closer to re-invention and promote interoperability, could
the near real-time feedback we are seeking. be developed as an extension of an existing stan-
dard ontology such as SNOMED. These exten-
Incremental Classification sion ontologies could then be shared and evolved
In our mapping scenario we observe that within their specialist community while still being
SNOMED (Q) is unchanging while the local ex- useful and usable in more general communities.
tension (P) is modified. If we can classify Q once One such example is an ontology for skeletal dys-
and record the result C(Q) then, due to the mono- plasias extracted from REAMS [13].
tonicity of the description logic, the classification It is thus useful to be able to represent these on-
of Q ∪ P, C(Q ∪ P), is a superset of C(Q). The tologies in a standard format such as OWL so
goal is then to derive C(Q ∪ P) given C(Q) (and, that they can be shared or manipulated using ex-
of course, Q and P) which should be much faster isting toolsets. Currently we use the OWL 1.1
than deriving C(Q ∪ P) from scratch. proposal [14] rather than OWL 1.0 since it sup-
Suntisrivaraporn [12] calls this Duo-Ontology ports the expression of the role axioms (to de-
Classification and presents a variation of the al- scribe role transitivity and right-identity). The
gorithm in [5] to do just this. We have indepen- particular subset we use is characterised by the
dently derived our own variant of this algorithm description logic EL+⊥ . OWL 1.1 is supported
along similar lines; the queue-processing core is by, for example, the latest development-release of
essentially unchanged but the initialisation of the Protégé (4.0 alpha).
queues is different to account for the work that has Unfortunately, OWL is not practical for represent-
already been done. ing large ontologies like SNOMED where an OWL
12
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)
SNOMED FULL-GALEN NOT-GALEN NCI
CEL 1391.9 368.9 5.4 1.8
snorocket 72.8 15.1 0.4 0.4
Table 1: Comparison of classification time for snorocket and CEL running on the same hardware.
XML representation is approximately 240MB [15], Address for Correspondence
about eight times the size of the equivalent KRSS Michael J. Lawley, Faculty of Information Technology,
representation. Moreover, due to the complexities University of Queensland, 126 Margaret Street Bris-
bane Qld 4000, Australia
inherent in parsing XML, it is much slower to load m.lawley@qut.edu.au
and parse than a simpler format such as KRSS.
One work-around for this, and something that References
would greatly benefit the e-health community, [1] D. Hansen, C. Daly, K. Harrop, M. O’Dwyer,
C. Pang, and J. Ryan-Brown. HDI: Research
would be for the International Health Terminology Software To Commercial Product. ASWEC 2005
Standards Development Organisation, the newly Industry Experience Papers, 2005.
formed governing body of SNOMED, to formally [2] I. Jakobsen, M.J. Lawley, A. Zankl, and
publish URIs for the concepts in SNOMED. This D. Hansen. Ontologies for Skeletal Dysplasias.
would allow tool vendors to “bake in” SNOMED MedInfo 2007 Workshop: MedSemWeb 2007,
2007.
to their tools, while still allowing other OWL-
based ontologies to reference SNOMED concepts [3] Snomed Clinical Terms. College of American
in a consistent and interoperable manner in order Pathologists, 2006. http://www.snomed.org.
to describe extensions to SNOMED. [4] F. Baader, C. Lutz, and B. Suntisrivaraporn.
CEL—a polynomial-time reasoner for life science
ontologies. In U. Furbach and N. Shankar,
CONCLUSION editors, Proceedings of the 3rd International
Joint Conference on Automated Reasoning (IJ-
Our preliminary work on producing local exten- CAR’06), volume 4130 of Lecture Notes in Artifi-
sions to SNOMED for semantic data integra- cial Intelligence, pages 287–291. Springer-Verlag,
2006.
tion is promising as is the performance of our
classifier. The current implementation is single- [5] F. Baader, C. Lutz, and B. Suntisrivara-
porn. Efficient Reasoning in EL+ . In
threaded and we anticipate a further speed in- Proceedings of the 2006 International Work-
crease from a multi-threaded implementation run- shop on Description Logics (DL2006), CEUR-
ning on a multi-core CPU. WS, 2006. http://lat.inf.tu-dresden.de/
research/papers/2006/BaaLutSun-DL-06.pdf.
We are currently integrating snorocket with a 3rd-
party SNOMED editing tool which requires spe- [6] H. Wache, T. Vögele, U. Visser, H. Stuck-
cific support for SNOMED’s use of role grouping enschmidt, G. Schuster, H. Neumann, and
S. Hübner. Ontology-based integration of
and the ability to distinguish between stated and information-a survey of existing approaches.
inferred relationships in the output of the clas- IJCAI-01 Workshop: Ontologies and Information
Sharing, 2001:108–117, 2001.
sifier, although this adds little overhead to the
classification time. In addition, we are prototyp- [7] H. Stuckenschmidt. Catalogue Integration: A
ing mapping tools specifically targeting the task of Case Study in Ontology-based Semantic Trans-
lation. Vrije Universiteit, Faculteit der Exacte
constructing local extensions of SNOMED from Wetenschappen, Divisie Wiskunde & Informat-
existing data. ica, 2000.
Finally, we are continuing work on our incremental [8] Dieter Fensel, Ian Horrocks, Frank van Harme-
form of the algorithm but have not yet tuned or len, Deborah L. McGuinness, and Peter F. Patel-
Schneider. OIL: An Ontology Infrastructure for
verified the implementation. Preliminary results the Semantic Web. IEEE Intelligent Systems,
indicate that this approach should be very useable 16(2), 2001.
when integrated with our mapping tool.
[9] G. Wade and S.T. Rosenbloom. Experiences
Mapping a Legacy Interface Terminology to
Acknowledgements SNOMED CT. In Proceedings of the SMCS
2006 - Semantic Mining Conference on SNOMED
CT, 2006. http://www.hiww.org/smcs2006/
The work described in this paper was carried out while proceedings/9WadeSMCS2006final.pdf.
on secondment to the CSIRO’s E-Health Research
Centre and the author would like to gratefully ac-
knowledge the support of David Hansen and the other [10] S. Ghilardi, C. Lutz, and F. Wolter. Did
members of the Health Data Integration team. I Damage my Ontology? A Case for
13
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)
Conservative Extensions in Description Log-
ics. In Patrick Doherty, John Mylopoulos,
and Christopher Welty, editors, Proceedings of
the Tenth International Conference on Prin-
ciples of Knowledge Representation and Rea-
soning (KR’06), pages 187–197. AAAI Press,
2006. http://lat.inf.tu-dresden.de/~clu/
papers/archive/kr06a.pdf.
[11] International Statistical Classification of Diseases
and Related Health Problems, Tenth Revision,
Australian Modification (ICD-10-AM). National
Centre for Classification in Health, 5th edition,
2006. http://www3.fhs.usyd.edu.au/ncch/4.
2.1.1.htm.
[12] Boontawee Suntisrivaraporn. Module extraction
and incremental classification: A pragmatic ap-
proach for EL+ ontologies. In Sean Bechhofer,
Manfred Hauswirth, Joerg Hoffmann, and Mano-
lis Koubarakis, editors, Proceedings of the 5th
European Semantic Web Conference (ESWC’08),
Lecture Notes in Computer Science. Springer-
Verlag, 2008. To appear.
[13] C. Hall and J. Washbrook. Radiological Atlas of
Malformation Syndromes and Skeletal Dysplasias
(REAMS) [software]. Oxford University Press,
CD-ROM, 1999.
[14] B. Motik, P.F. Patel-Schneider, and I. Horrocks.
OWL 1.1 Web Ontology Language. World Wide
Web Consortium, W3C Member Submission,
2006. http://www.w3.org/Submission/2006/
SUBM-owl11-owl_specification-20061219/.
[15] K. Spackman. An Examination of OWL and
the Requirements of a Large Health Care Ter-
minology. In Proceedings of the OWL: Ex-
periences and Directions Third International
Workshop (OWLED 2007), CEURWS, June
2007. http://owled2007.iut-velizy.uvsq.fr/
PapersPDF/submission_26.pdf.
14