Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) Exploiting Fast Classification of SNOMED CT for Query and Integration of Health Data Michael J. Lawley Queensland University of Technology, Faculty of Information Technology, Brisbane, (Queensland), Australia E-Health Research Centre, CSIRO ICT Centre, (Queensland), Australia Abstract common in the health domain where terms often By constructing local extensions to SNOMED we involve an implicit context of usage (e.g., lobe in aim to enrich existing medical and related data the context of lung cancer) or implicit references to stores, simplify the expression of complex queries, anatomical structures (e.g., colorectal cancer) or and establish a foundation for semantic integra- related classes of diseases, injuries, or procedures. tion of data from multiple sources. Accurately and consistently encoding these rela- Specifically, a local extension can be constructed tionships in queries relies on the person formulat- from the controlled vocabulary(ies) used in the ing the queries to understand them, thus creating medical data. In combination with SNOMED, many opportunities for errors, omissions, and in- this local extension makes explicit the implicit se- consistencies to occur. When multiple people are mantics of the terms in the controlled vocabulary. constructing queries these risks are further exac- By using SNOMED as a base ontology we can erbated. exploit the existing knowledge encoded in it and By constructing the vocabularies so as to explicitly simplify the task of reifying the implicit seman- represent the relationships between terms, queries tics of the controlled vocabulary. Queries can now can directly and consistently exploit the relation- be formulated using the relationships encoded in ships. Using an ad-hoc explicit representation of the extended SNOMED rather than embedding these relationships helps, but may introduce new them ad-hoc into the query itself. Additionally, problems in terms of consistency of usage and how SNOMED can then act as a common point of in- the relationships are interpreted (see, for example, tegration, providing a shared set of concepts for the Radiological Electronic Atlas of Malformation querying across multiple data sets. Syndromes and Skeletal Dysplasias (REAMS) [2]). Key to practical construction of a local extension Instead, using a well-understood formal mecha- to SNOMED is appropriate tool support including nism for representing the relationships, such as the ability to compute subsumption relationships Description Logic, can avoid these problems. very quickly. Our implementation of the polyno- However we still have two problems to solve: mial algorithm for EL+ in Java is able to classify 1. how do we deal with all the existing data sets SNOMED in under 1 minute. that do not do this; and INTRODUCTION 2. how do we mitigate the, potentially quite high1 , cost of explicitly representing all the relation- Experience with integrating medical and related ships? data [1] shows that the use of controlled vocabu- laries successfully modulates the amount of noise We can deal with both these problems by ex- in the data. However, when querying the col- tending (as needed) an existing standard ontol- lected data, any semantic relationships between ogy, such as the Systematized Nomenclature of the terms that are relevant to the query (for ex- Medicine (SNOMED) [3], that already embodies ample, specialisation/generalisation or part-of re- 1 lationships) need to be explicitly encoded in the Getting the modelling right, from scratch, requires not only an excellent understanding of the concepts in- query and/or accounted for in the interpretation volved as well as their relationships, but also an under- of the query results. standing of how best to represent them in a particular These kinds of implicit relationships are especially Description Logic formalism. 8 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) many of the relationships we need. However, one important problem, and one identified in our work of the main difficulties with this approach is that with skeletal dysplasias [2], is how to cope with er- building an extension to SNOMED is not dissim- rors in the shared terminology. ilar to maintaining and developing SNOMED it- Wade and Rosenbloom [9] report on the man- self. That is, the sheer size of SNOMED has ual construction of what is almost a local exten- meant that, until recently, very few tools could sion to SNOMED (they conceived the task as a compute all of its subsumption relationships, and semi-formal mapping). In this work 2002 terms even those that could would reportedly take sev- were mapped to combination of single and post- eral hours. coordinated concepts of which about 75% were Fortunately, recent work by Baader et al. [4, 5] equivalencies (20% of these were to single con- on the tractable family of description logics EL cepts) and only 1% (26) were, in their words, “not has shown that polynomial time classification al- mappable”. It is unclear why these terms were cat- gorithms exist and are practical. Moreover despite egorized as such since they include, for example, their relatively low expressive power, the EL fam- presyncope which could reasonably be related to ily of description logics is suitable for represent- 3006004|disturbance of consciousness|, but it may ing such real-world ontologies as SNOMED and be that the context of use of the terms was unavail- offer additional expressiveness suitable for prop- able in order to properly discern their meaning. erly representing partOf relationships and suffi- However, their work does demonstrate that the cient conditions.2 Their implementation of this goal of producing a local extension to SNOMED algorithm in Lisp is able to classify SNOMED in is feasible. 1,782 seconds [5] (approx. 30 minutes) which sug- gests an optimised implementation in a lower-level PROBLEM DESCRIPTION language may be fast enough for near real-time The problem of embedding domain semantics such feedback in an editing tool. as specialisation/generalisation or part-of relation- Thus, our goal is to provide tool support for defin- ships into queries is illustrated in the following. ing a local extension to an existing standard formal For example, a query to find all performed proce- ontology; a mapping from an existing set of terms dures involving a colectomy might enumerate all that characterise an informal ontology to concepts such procedures: in the formal ontology. In doing so we effectively SELECT S.* realise latent semantics in the existing medical FROM Surgery S data via the standard ontology. This should facil- WHERE S.procedure = ’32003-00’ itate simpler and more robust queries and in turn OR S.procedure = ’32003-01’ aid data integration, a special-case application of OR S.procedure = ’32012-00’ querying where related medical data sets use se- ... mantically overlapping, but distinct term sets. which has the potential to accidently omit certain RELATED WORK codes and will require updating if the terminology There is a great deal of published work on using is updated with additional forms of colectomy. ontologies for data integration (see Wache et al. [6] Alternatively, some kind of heuristic query could for an overview), but it is mostly focussed on their be used: use at the meta-data level; ontologies are used SELECT S.* to describe, reason about and integrate database FROM Surgery S, ProcedureCodes C schemas. While related to our goals, we are ad- WHERE S.procedure = C.code dressing the more specific problem of semantic AND C.text LIKE ’%colectomy%’; data integration or semantic translation. Stucken- which has the potential to miss a term that doesn’t schmidt et al. [7] discuss an approach to this prob- follow the expected naming pattern (e.g., epiploec- lem in the context of their Ontology Interchange tomy) or provide false matches where a compound Language (OIL) [8]. In particular they raise the or composite name does not reflect a valid special- question of whether it is feasible to find or cre- isation. ate a sufficient shared terminology. In our domain If, however, the terms were encoded as concepts of medical data we believe that SNOMED repre- in an ontology, the query is simple3 : sents such a shared terminology. A possibly more 3 We envisage that the complete set of subsumption 2 See also http://webont.org/owl/1.1/ relationships would be stored in a database table to tractable.html#2 support fast subsumption-based queries using only two 9 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) SELECT S.* cancer, shown in Figure 1. We can map these, one- FROM Surgery S, Ontology O to-one, to a set of concepts for a local ontology. WHERE O.ancestor = 23968004 AND S.procedure = O.descendant; Note also that SNOMED, unlike classification Procedure Code Meaning schemes such as ICD-9 and ICD-10, support a (ICD-10-AM) multi-parented generalisation hierarchy. 32000-00 Sig colectomy with stoma formation CONSTRUCTING LOCAL 32003-00 Sig colectomy with anasto- EXTENSIONS mosis In order to construct an ontology from an exist- 32003-01 Right hemicolectomy ing terminology (or collection of terminologies) we 32005-00 Subtotal colectomy take a multi-step approach: 32005-01 Ext right hemicolectomy 32006-00 Left hemicolectomy 1. Map each term from the controlled vocabulary 32012-00 Total colectomy to a concept, factoring out any synonyms, to 32024-00 High anterior resection produce P. 32025-00 Low anterior resection ex- This is often a simple one-to-one mapping, but traperitoneal it may be necessary to extend the mapping to 32026-00 Low anterior resection include disambiguating data values when the coloanal anastomosis same term is used to mean different things in 32028-00 Ultra low anterior resection different contexts. 32030-00 Hartmann’s procedure 32039-00 Abdomino-perineal excision 2. Make any simple implicit relationships explicit, adding them to P. 32051-00 Total proctocolectomy with ileo-anal anastomosis For example, generalisation, partOf, or hasLoca- tion relationships. It may be necessary to intro- Figure 1: A Term-Set of Colorectal Cancer Proce- duce new concepts to act as the generalisation dures of two or more sibling concepts. 3. Specify relationships between these (local) con- cepts and those in the chosen standard ontology The next step is to make any simple relationship Q, adding them to P. explicit. In our case there are none that can be ex- pressed using just the concepts we have currently To be able to answer queries involving our new identified. ontology we first need to classify Q ∪ P to identify all the subsumption relationships it entails. Figure 2 describes the identified relationships be- Note that, we should be careful that Q ∪ P repre- tween these terms and selected SNOMED con- sents a conservative extension [10] of Q. That is, cepts as per step 3. Note that several concepts (for Q ∪ P produces the same consequences over the example, 32028-00|ultra low anterior resection|), set of concepts in Q as Q does by itself. We also have no exact equivalent in SNOMED, and that need to ensure various integrity constraints (such one, 32051|total proctocolectomy with ileo-anal as disjointness) are preserved in Q ∪ P. Thus we anastomosis| implies a composite of concepts. would like to be able to interactively edit P while Figure 3 shows a visualisation of the results of exploiting the consequences of Q ∪ P in live feed- classifying SNOMED augmented with the ontol- back through the mapping tool. These kinds of ogy from Figure 2. As can be seen, unifying checks can be performed by classification of Q ∪ P generalisation concepts such as 84604002|sigmoid but this may not be viable if Q ∪ P is large, as is colectomy| have been identified, and thus provide the case when Q is SNOMED. a strong foundation for constructing queries that Colorectal Cancer Example span the various procedures. Additionally, since SNOMED includes detailed anatomical concepts, In this section we consider a sample set of ICD-10- queries can now be composed in terms of anatom- AM [11] terms for procedures relating to colorectal ical features even though they did not exist in the joins. original terminology. 10 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) Procedure Relation SNOMED To support this kind of problem with reasonable 32000-00 ≡ 315327002 generality and decent query speed, we need to 32003-00 ≡ 315326006 generate a new column containing codes that are 32003-01 ≡ 235326000 mapped to the set of compound concepts that 32005-00 ≡ 43075005 correspond to the contextualised meaning of each 32005-01 ≡ 174071004 database row. Hence, as shown in Figure 5, the ta- 32006-00 ≡ 82619000 ble from Figure 4 would be extended with a Code 32012-00 ≡ 26390003 foreign-key column, and an additional table con- 32024-00 ≡ 400988008 taining the SNOMED expressions of the form4 : 32025-00  314592008 ∃ associatedProcedure.P  32026-00  314592008 ∃ laterality.L  32028-00  314592008 ∃ procedureContext.S 32030-00 ≡ 16564004 which gives us another ontology extension that we 32039-00 ≡ 265414003 can add to SNOMED. 32051-00  174059005  70172002 Finally, in order to be able to pose a subsumption- based complex query involving composite concepts Figure 2: Identified Relationships with SNOMED and have it evaluated at database join speeds, we Concepts can employ the same strategy: extend the ontol- ogy with a new fully-defined concept correspond- ing to our query expression, re-classify, and per- COMPLEX QUERIES AND form a join-based query using the new concept. CONTEXT The need to construct compound expressions that So far we have only considered simple query explicitly represent the context associated with a scenarios where a single database column record in a database occurs any time the data represents the concept we wish to query needs to be queried outside its original context. (e.g., Surgery.procedure) and there already ex- This may happen in as trivial a case as when one ists a concept that characterises the bound of the table in a database is joined with another, but query (e.g., 2396804). the more general scenario occurs when integrating data from multiple data sources. Consider instead a table, as shown in Figure 4, that stores both scheduled and performed proce- dures while using another column to distinguish RESULTS them, and which encodes laterality, if any, of the Classifying SNOMED procedure in yet another column. Now imagine we wish to query for all patients who have had an The practicality of creating local extensions of amputation including the left hand. SNOMED is dependent on sufficient tool support and, as mentioned previously, a cornerstone of this Patient Date Status is fast classification. Indeed we believe that near ... ... ... real-time feedback in an editing environment, be it an IDE for programming or a 3D architectural Procedure Laterality modelling tool, can have a transformational effect ... ... on the authoring and editing process. To this end, we have implemented snorocket, using Figure 4: Table storing records with contextual a slightly altered form of the algorithm in [5] writ- information split across columns ten in Java. We use several optimised Map and Set data-structures tailored for ontologies with roughly the same number of concepts and roles as Patient Date ... Laterality Code SNOMED. This implementation is able to classify ... ... ... 4 Note that considerable experience with SNOMED Code Equivalent SNOMED Expression and all its documentation may be required to construct ... ... suitable valid post-coordinated expressions like those above. Tool support for this is clearly an important issue and recent work in the IHTSDO Concept Model Figure 5: Augmented table for representing con- SIG on producing a Machine Readable Concept Model textualised concepts will be valuable for this. 11 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) Figure 3: Visualisation of part of an extended SNOMED ontology SNOMED in 54 seconds on a modern 2.4GHz In- Currently this work is in a preliminary state and tel Core 2 Duo running Windows XP and Sun’s the correspondence with the variant described Java 1.6.0 03. in [12] is unknown. However the performance For a fairer comparison with CEL, which only of this incremental algorithm is very promising. runs under Linux, we ran both snorocket and With P consisting of the 14 new concepts as de- CEL on an older four-CPU Xeon 3.6GHz ma- fined as in Figure 2, incremental classification chine running RedHat Linux 2.6.9 and Sun’s Java takes around 0.9s using our un-optimised imple- 1.6.0 04. The results, for several of the ontologies mentation. available from http://lat.inf.tu-dresden.de/ ~meng/toyont.html, are in Table 1. DISCUSSION Clearly, being able to classify SNOMED in close Ideally, as a term set is developed, it would be ex- to a minute is a substantial improvement over plicitly constructed as an ontology and, to avoid roughly 23 minutes and brings us much closer to re-invention and promote interoperability, could the near real-time feedback we are seeking. be developed as an extension of an existing stan- dard ontology such as SNOMED. These exten- Incremental Classification sion ontologies could then be shared and evolved In our mapping scenario we observe that within their specialist community while still being SNOMED (Q) is unchanging while the local ex- useful and usable in more general communities. tension (P) is modified. If we can classify Q once One such example is an ontology for skeletal dys- and record the result C(Q) then, due to the mono- plasias extracted from REAMS [13]. tonicity of the description logic, the classification It is thus useful to be able to represent these on- of Q ∪ P, C(Q ∪ P), is a superset of C(Q). The tologies in a standard format such as OWL so goal is then to derive C(Q ∪ P) given C(Q) (and, that they can be shared or manipulated using ex- of course, Q and P) which should be much faster isting toolsets. Currently we use the OWL 1.1 than deriving C(Q ∪ P) from scratch. proposal [14] rather than OWL 1.0 since it sup- Suntisrivaraporn [12] calls this Duo-Ontology ports the expression of the role axioms (to de- Classification and presents a variation of the al- scribe role transitivity and right-identity). The gorithm in [5] to do just this. We have indepen- particular subset we use is characterised by the dently derived our own variant of this algorithm description logic EL+⊥ . OWL 1.1 is supported along similar lines; the queue-processing core is by, for example, the latest development-release of essentially unchanged but the initialisation of the Protégé (4.0 alpha). queues is different to account for the work that has Unfortunately, OWL is not practical for represent- already been done. ing large ontologies like SNOMED where an OWL 12 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) SNOMED FULL-GALEN NOT-GALEN NCI CEL 1391.9 368.9 5.4 1.8 snorocket 72.8 15.1 0.4 0.4 Table 1: Comparison of classification time for snorocket and CEL running on the same hardware. XML representation is approximately 240MB [15], Address for Correspondence about eight times the size of the equivalent KRSS Michael J. Lawley, Faculty of Information Technology, representation. Moreover, due to the complexities University of Queensland, 126 Margaret Street Bris- bane Qld 4000, Australia inherent in parsing XML, it is much slower to load m.lawley@qut.edu.au and parse than a simpler format such as KRSS. One work-around for this, and something that References would greatly benefit the e-health community, [1] D. Hansen, C. Daly, K. Harrop, M. O’Dwyer, C. Pang, and J. Ryan-Brown. HDI: Research would be for the International Health Terminology Software To Commercial Product. ASWEC 2005 Standards Development Organisation, the newly Industry Experience Papers, 2005. formed governing body of SNOMED, to formally [2] I. Jakobsen, M.J. Lawley, A. Zankl, and publish URIs for the concepts in SNOMED. This D. Hansen. Ontologies for Skeletal Dysplasias. would allow tool vendors to “bake in” SNOMED MedInfo 2007 Workshop: MedSemWeb 2007, 2007. to their tools, while still allowing other OWL- based ontologies to reference SNOMED concepts [3] Snomed Clinical Terms. College of American in a consistent and interoperable manner in order Pathologists, 2006. http://www.snomed.org. to describe extensions to SNOMED. [4] F. Baader, C. Lutz, and B. Suntisrivaraporn. CEL—a polynomial-time reasoner for life science ontologies. In U. Furbach and N. Shankar, CONCLUSION editors, Proceedings of the 3rd International Joint Conference on Automated Reasoning (IJ- Our preliminary work on producing local exten- CAR’06), volume 4130 of Lecture Notes in Artifi- sions to SNOMED for semantic data integra- cial Intelligence, pages 287–291. Springer-Verlag, 2006. tion is promising as is the performance of our classifier. The current implementation is single- [5] F. Baader, C. Lutz, and B. Suntisrivara- porn. Efficient Reasoning in EL+ . In threaded and we anticipate a further speed in- Proceedings of the 2006 International Work- crease from a multi-threaded implementation run- shop on Description Logics (DL2006), CEUR- ning on a multi-core CPU. WS, 2006. http://lat.inf.tu-dresden.de/ research/papers/2006/BaaLutSun-DL-06.pdf. We are currently integrating snorocket with a 3rd- party SNOMED editing tool which requires spe- [6] H. Wache, T. Vögele, U. Visser, H. Stuck- cific support for SNOMED’s use of role grouping enschmidt, G. Schuster, H. Neumann, and S. Hübner. Ontology-based integration of and the ability to distinguish between stated and information-a survey of existing approaches. inferred relationships in the output of the clas- IJCAI-01 Workshop: Ontologies and Information Sharing, 2001:108–117, 2001. sifier, although this adds little overhead to the classification time. In addition, we are prototyp- [7] H. Stuckenschmidt. Catalogue Integration: A ing mapping tools specifically targeting the task of Case Study in Ontology-based Semantic Trans- lation. Vrije Universiteit, Faculteit der Exacte constructing local extensions of SNOMED from Wetenschappen, Divisie Wiskunde & Informat- existing data. ica, 2000. Finally, we are continuing work on our incremental [8] Dieter Fensel, Ian Horrocks, Frank van Harme- form of the algorithm but have not yet tuned or len, Deborah L. McGuinness, and Peter F. Patel- Schneider. OIL: An Ontology Infrastructure for verified the implementation. Preliminary results the Semantic Web. IEEE Intelligent Systems, indicate that this approach should be very useable 16(2), 2001. when integrated with our mapping tool. [9] G. Wade and S.T. Rosenbloom. Experiences Mapping a Legacy Interface Terminology to Acknowledgements SNOMED CT. In Proceedings of the SMCS 2006 - Semantic Mining Conference on SNOMED CT, 2006. http://www.hiww.org/smcs2006/ The work described in this paper was carried out while proceedings/9WadeSMCS2006final.pdf. on secondment to the CSIRO’s E-Health Research Centre and the author would like to gratefully ac- knowledge the support of David Hansen and the other [10] S. Ghilardi, C. Lutz, and F. Wolter. Did members of the Health Data Integration team. I Damage my Ontology? A Case for 13 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) Conservative Extensions in Description Log- ics. In Patrick Doherty, John Mylopoulos, and Christopher Welty, editors, Proceedings of the Tenth International Conference on Prin- ciples of Knowledge Representation and Rea- soning (KR’06), pages 187–197. AAAI Press, 2006. http://lat.inf.tu-dresden.de/~clu/ papers/archive/kr06a.pdf. [11] International Statistical Classification of Diseases and Related Health Problems, Tenth Revision, Australian Modification (ICD-10-AM). National Centre for Classification in Health, 5th edition, 2006. http://www3.fhs.usyd.edu.au/ncch/4. 2.1.1.htm. [12] Boontawee Suntisrivaraporn. Module extraction and incremental classification: A pragmatic ap- proach for EL+ ontologies. In Sean Bechhofer, Manfred Hauswirth, Joerg Hoffmann, and Mano- lis Koubarakis, editors, Proceedings of the 5th European Semantic Web Conference (ESWC’08), Lecture Notes in Computer Science. Springer- Verlag, 2008. To appear. [13] C. Hall and J. Washbrook. Radiological Atlas of Malformation Syndromes and Skeletal Dysplasias (REAMS) [software]. Oxford University Press, CD-ROM, 1999. [14] B. Motik, P.F. Patel-Schneider, and I. Horrocks. OWL 1.1 Web Ontology Language. World Wide Web Consortium, W3C Member Submission, 2006. http://www.w3.org/Submission/2006/ SUBM-owl11-owl_specification-20061219/. [15] K. Spackman. An Examination of OWL and the Requirements of a Large Health Care Ter- minology. In Proceedings of the OWL: Ex- periences and Directions Third International Workshop (OWLED 2007), CEURWS, June 2007. http://owled2007.iut-velizy.uvsq.fr/ PapersPDF/submission_26.pdf. 14