=Paper=
{{Paper
|id=Vol-1515/regular7
|storemode=property
|title=Can SNOMED CT be squeezed without losing its shape?
|pdfUrl=https://ceur-ws.org/Vol-1515/regular7.pdf
|volume=Vol-1515
|dblpUrl=https://dblp.org/rec/conf/icbo/Lopez-GarciaS15
}}
==Can SNOMED CT be squeezed without losing its shape?==
Can SNOMED CT be Squeezed Without Losing its Shape?
Pablo López-Garcı́a∗, Stefan Schulz
Institute for Medical Informatics, Statistics and Documentation – Medical University of Graz
Auenbruggerplatz 2, 8036 Graz, Austria
ABSTRACT (2004); Seidenberg and Rector (2006)) and logic-based techniques
In biomedical applications where the size and complexity of (Cuenca Grau et al. (2008); Grau et al. (2009)).
SNOMED CT are challenging, using a more compact subset that Often, these modules are not balanced when it comes to
can act as a reasonable substitute is often preferred (e.g., in problem representing the original distribution or shape of sub-hierarchies
lists, using the CORE problem list subset of SNOMED CT, covering shown by the original ontology or terminology. For example, in the
95% of usage in less than 1% its size). Ontology modularization CORE subset of SNOMED CT, most concepts belong to the Clinical
is the area of research that studies how to extract such subsets, Finding, Procedure, Situation with Explicit Context, and Event sub-
also called modules or segments. In a special class of use cases hierarchies2 . The opposite case is also possible: in a previous study,
including ontology-based quality assurance, scaling experiments for we found out that especially when using graph-traversal techniques
real-time performance, and developing scalable testbeds for software resulting modules can excessively and uncontrollably grow and
tools, it is essential that modules are representative of SNOMED CT’s spread across sub-hierarchies (López-Garcı́a et al. (2012)).
sub-hierarchies in terms of concept distribution, therefore preserving These results are not surprising, because most prior work
the original shape of SNOMED CT. How to extract such balanced on ontology modularization has not focused on preserving the
modules remains unclear, as most previous work on ontology representativity of the sub-hierarchies of the original ontology, so
modularization has focused on the opposite problem: on extracting the shape of the original ontology is inevitably lost in the modules.
a representative module for a specific domain. In this study, we There is a special class of use cases, however, where it is essential
investigate to what extent extracting balanced modules that preserve that modules are representative of the sub-hierarchies of the original
the original shape of SNOMED CT is possible by presenting and ontology and therefore show a similar shape, such as:
evaluating an iterative algorithm. • In ontology-based quality assurance, where small but
representative samples of a huge ontology are to be inspected
1 INTRODUCTION (Agrawal et al. (2012));
The size and complexity of SNOMED CT1 constitute a problem • for obtaining a demonstration version that is understandable for
in many biomedical applications (Pathak et al. (2009)). Studies users or facilitates visualization;
have shown that it is often enough to use a subset of interest • for alignment with a highly constrained upper level ontology,
instead of the whole SNOMED CT. This is the case of problem such as the Basic Formal Ontology (BFO) (Smith et al.
lists, where the 16 874 terms of CORE2 have been shown to cover (2005)), especially the upcoming BFO 2.0 OWL version,
over 95% of usage (Fung et al. (2010)), when tagging medical which includes relations, DOLCE (Gangemi et al. (2002)) or
images (Wennerberg et al. (2011)), or when annotating texts from BioTopLite (Schulz and Boeker (2013)), where reasoning has
cardiology (López-Garcı́a et al. (2012)). to be tested on small subsets and in iterative debugging steps;
How to extract such subsets is studied by the area of research of
• for performing scaling experiments for real-time performance
ontology modularization (Stuckenschmidt et al. (2009)). Ontology
of a large OWL DL ontology;
modularization techniques are generally focused on obtaining a
minimal subset (also called module or segment) that maximally • for the description logics community, who welcomes scalable
covers a specific domain or that is representative for a particular testbeds for developing tools like editors and reasoners.
application. This is the case of the problem list or annotation cases To the knowledge of the authors, little research on ontology
mentioned above, or the study by Seidenberg and Rector (2006), modularization has focused on extracting balanced modules for such
where they described how they extracted a representative segment applications, where keeping the original shape of a large ontology
of the GALEN ontology (Rogers and Rector (1996)) for cardiology such as SNOMED CT regarding sub-hierarchies is a requirement.
using the seed concept ‘Heart’ as a signature. In this paper, we study the concept distribution of SNOMED CT’s
A signature is an initial set of concepts (called seeds) sub-hierarchies and we propose an evaluate an iterative algorithm
that bootstraps the modularization process, on which many for extracting balanced modules. Our main goal is to investigate to
ontology modularization techniques rely, including graph-traversal what extent it is possible to obtain modules that preserve the original
(Doran et al. (2007); d’Aquin et al. (2007); Noy and Musen shape of SNOMED CT in order to be used in our identified class of
use cases.
∗ Correspondence should be addressed to: pablo.lopez@medunigraz.at
1 International Health Terminology Standards Development Organization -
http://www.ihtsdo.org/snomed-ct/ (accessed 27 Feb 2015)
2 The CORE Problem List Subset from SNOMED CT - http://www.
nlm.nih.gov/research/umls/Snomed/core_subset.html
(accessed 27 Feb 2015).
Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 1
2 SUB-HIERARCHIES OVERVIEW 3 EXTRACTION OF BALANCED MODULES
Table 1 shows the main 18 sub-hierarchies of SNOMED CT and As remarked by d’Aquin et al. (2009), the process of extracting
their concept distribution. As can be seen, there are four sub- ontology modules should be guided by each domain or application.
hierarchies that each contain over 10% of SNOMED CT concepts In this section we present our definition of ontology modules, and
(Clinical Finding, Procedure, Organism, and Body Structure), the methodology followed to obtain them.
adding up to over 70% of the concepts. We used the July 2014
International Release of SNOMED CT, and we omitted the metadata
3.1 Balanced SNOMED CT Modules
concepts sub-hierarchy (SNOMED CT model). As input, we used the OWL-EL version of SNOMED CT obtained
using the Perl script included in the distribution as input (SCT ). For
our purposes, presented in the introduction, we define a balanced
Subhierarchy (Abbreviation) Concepts Distribution SNOMED CT module (M ) as a minimal collection of classes from
Clinical Finding (CF) 100 893 33.57% SCT that conform to the following requirements:
Procedure (PR) 53 914 17.94%
Organism (OR) 33 273 11.07% (a) All classes in M are hierarchically connected to SNOMED CT’s
Body Structure (BS) 30 685 10.21% root concept in the same way as in SCT .
Substance (SU) 24 021 7.99% (b) All classes in M share the same axiomatical class definition as
Pharmaceutical / Biologic Product 16 881 5.62% in SCT .
Qualifier Value (QV) 9 055 3.01% (c) Sub-hierarchies in M are distributed (approximately) in the
Observable Entity (OE) 8 307 2.76% same proportion as in SCT . In practical terms, when visualized
Social Context (SO) 4 703 1.56%
using a treemap, M should look similar to the treemap of
Physical Object (PO) 4 522 1.50%
SNOMED CT shown in Figure 1.
Situation with Explicit Context (SI) 3 695 1.23%
Event (EV) 3 673 1.22% (d) Our model is restricted to classes. SNOMED CT metadata
Environment or Geogr. Location (EG) 1 814 0.60% concepts are not subject to modularization.
Specimen (SN) 1 447 0.48%
Staging and Scales (ST) 1 309 0.44%
3.2 Module Construction from Seeds
Special concept (SP) 649 0.44% To create our module M , we followed a similar approach to
Record Artifact (RA) 227 0.22% Seidenberg and Rector (2006). Using their terminology, concepts
Physical Force (PF) 171 0.08% (in our case, classes) are represented as nodes in a graph, and
Table 1. Main sub-hierarchies of SNOMED CT. The metadata seed concepts are called target nodes. The strategy consists in
concepts sub-hierarchy (SNOMED CT model) was not considered. iteratively adding classes appearing in the right-hand expressions of
their definitions, starting from seeds in a initial signature. Figure 2
shows an example of a resulting module, where it can be seen that
(a) all classes are hierarchically connected to the root concept in the
As a useful way of visualizing concept distribution and for same way as in the original ontology (Figure 3), and (b) all classes
comparative purposes (see Section 4), the same information is share the same axiomatical class definition as the original ontology.
displayed in form of a treemap in Figure 1. The treemap represents
SNOMED CT’s hierarchical information as a set of rectangles,
where the area of each rectangle is proportional to the number of 1 Target Node
concepts in the sub-hierarchy. B C
Is-a link
(A is a B)
A Attribute link
9
10 15
16
17
Fig. 2: Strategy followed to build our module M , starting from the
seed concept (target node) 10. Figure 3 shows the original ontology
Fig. 1: SNOMED CT’s shape represented with a treemap. Sub- from which it was extracted.
hierarchies containing less than 10% of SNOMED CT concepts are
shown in acronyms (see Table 1).
2 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes
Target Node
B C
1 Is-a link
(A is a B)
A Attribute link
2 9
3 4 10 15 23
5 6 11 12 16 22 24 25
7 8 13 14 17 18
19 20 21
Fig. 3: Sample ontology, starting with a signature containing the seed node (target node) 10.
3.3 Seed Adjustment: An Iterative Algorithm The algorithm, at each iteration i is the following:
The strategy to build a module using seeds presented in the previous 1. A random signature SIGNi consisting of 2000 classes from
section guarantees requirements (a) and (b) from our definition of SCT is selected, following the same class sub-hierarchy
M , but does not guarantee requirement (c), i.e., that sub-hierarchies distribution as SCT , and ensuring at all sub-hierarchies in the
in M will be distributed (approximately) in the same proportion as signature contains at least one class.
in SCT . The reason is that there is no control over classes from 2. A module Mi is computed following the principles described in
other sub-hierarchies that are added in the process when following Subsection 3.2. Its sub-hierarchy distribution is calculated.
the right-hand expressions of the seeds. 3. Convergence is checked. If RSS >= 1, Steps 1 to 3 are repeated
Therefore, in order not to conflict with requirements (a) and after adjusting the scaling factor for the sub-hierarchy distribution
(b) when creating M , the only possibility is to carefully select of the signatures in the next iteration i + 1:
the initial signature that bootstraps the modularization algorithm. f (SCTSH )
f (SIGNi+1SHk ) = f (SIGNiSHk ) × f (Mi k) with
For that purpose, we investigated an iterative algorithm that SHk
dynamically adjusts the distribution of classes used as seeds in the f (MiSHk ) being the relative frequency of sub-hierarchy SHk
initial signature. Before presenting the algorithm, we introduce the measured in the resulting module in iteration i, Mi .
following notation:
4 RESULTS
• As introduced before, SCT represents the OWL EL version of In our experiments, the algorithm converged after 7 iterations,
SNOMED CT used as input. Sub-hierarchies are termed SHk . extracting a module M with 10 834 classes. Figure 4 (Page 4) shows
• M represents, the output module, whose sub-hierarchy the error after each iteration for sub-hierarchies with more than 1%
distribution (Table 1) should match SCT ’s as much as error, as well as the residual sum of squares.
possible. As can be seen in the table below the graph, the sub-hierarchies
Clinical Finding, Procedure, and Organism were under-represented
• SIGN , is the input signature, consisting of classes from SCT , in M , while Body Structure and Substance were over-represented.
that is used to boostrap the modularization process described in The same results can be confirmed graphically in the treemaps
Subsection 3.2. shown in Figure 5, at iterations 1, 3, and 7.
• Error(SHk ) = Size(MSHk ) − Size(SCTSHk ) indicates
the error on a per sub-hierarchy basis. Errors are calculated in
percentage terms (see distribution in Table 1).
1
P18 2
• RSS = 18 k=1 Error(SHk ) , where RSS represents
the residual sum of squares. Convergence of the algorithm is
defined when RSS < 1.
Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 3
Fig. 4: Execution of the algorithm, showing convergence in iteration 7.
(a) Module Shape - Iteration 1 (b) Module Shape - Iteration 3
(c) Module Shape - Iteration 7 (convergence) (d) Full SNOMED CT Shape (target)
Fig. 5: Visual comparison of the shape between the modules and SNOMED CT (d) in iterations 1 (a), 3 (b), and 7 (convergence, c). Clinical
Finding, Procedure, and Organism were under-represented, while Body Structure and Substance were over-represented.
4 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes
5 DISCUSSION ACKNOWLEDGMENTS
Our results suggest that it is difficult for ontology modules to meet The authors acknowledge ICBO reviewers for their elaborate
all of our modularization criteria without relaxing the constraints feedback and suggestions.
of how concepts in the modules are distributed by sub-hierarchies,
because modularization criteria are conflicting. In our experiments, REFERENCES
all obtained modules over-represented or under-represented some of Agrawal, A., Perl, Y., and Elhanan, G. (2012). Identifying problematic concepts in
SNOMED CT’s sub-hierarchies in different degrees. These results snomed ct using a lexical approach. Studies in health technology and informatics,
were partly expected, due to the nature of the modularization 192, 773–777.
approach that uncontrollably adds class definitions to preserve Cuenca Grau, B., Horrocks, I., Kazakov, Y., and Sattler, U. (2008). Modular reuse of
ontologies: Theory and practice. Journal of Artificial Intelligence Research, pages
SNOMED CT’s hierarchy and class definitions. 273–318.
The error figures that we obtained after convergence, however, Doran, P., Tamma, V., and Iannone, L. (2007). Ontology module extraction for ontology
never reached 8% for any sub-hierarchy and all our modules reuse: an ontology engineering perspective. In Proceedings of the sixteenth ACM
contained a fair representation of all of them. Furthermore, conference on Conference on information and knowledge management, pages 61–
70. ACM.
convergence was reached after only 7 iterations. Such modules
d’Aquin, M., Schlicht, A., Stuckenschmidt, H., and Sabou, M. (2007). Ontology
might be sufficient in many of the use cases that motivated their modularization for knowledge selection: Experiments and evaluations. In Database
creation, i.e., extracting modules that show an (approximately) and Expert Systems Applications, pages 874–883. Springer.
concept distribution to the one shown in SNOMED CT. d’Aquin, M., Schlicht, A., Stuckenschmidt, H., and Sabou, M. (2009). Criteria and
evaluation for ontology modularization techniques. In Modular ontologies, pages
67–89. Springer.
Fung, K. W., McDonald, C., and Srinivasan, S. (2010). The UMLS-CORE project: a
study of the problem list terminologies used in large healthcare institutions. Journal
6 CONCLUSIONS AND FUTURE WORK of the American Medical Informatics Association, 17(6), 675–680.
In this study, we have studied SNOMED CT sub-hierarchies Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., and Schneider, L. (2002).
and proposed and evaluated an iterative algorithm for extracting Sweetening ontologies with dolce. In Knowledge engineering and knowledge
management: Ontologies and the semantic Web, pages 166–181. Springer.
compact modules that preserve the shape of SNOMED CT that we
Grau, B. C., Horrocks, I., Kazakov, Y., and Sattler, U. (2009). Extracting modules
termed balanced modules. Extracting such modules has generally from ontologies: A logic-based approach. In Modular Ontologies, pages 159–186.
been neglected by work on ontology modularization, even though Springer.
there are many use cases where balanced modules constitute López-Garcı́a, P., Boeker, M., Illarramendi, A., and Schulz, S. (2012). Usability-driven
an extremely valuable tool, such as in ontology-based quality pruning of large ontologies: the case of snomed ct. Journal of the American Medical
Informatics Association, pages amiajnl–2011.
assurance, scaling experiments for real-time performance, or Noy, N. F. and Musen, M. A. (2004). Specifying ontology views by traversal. In The
developing scalable testbeds for software tools. Our proposed Semantic Web–ISWC 2004, pages 713–725. Springer.
algorithm and our resulting modules show that graph-traversal Pathak, J., Johnson, T. M., and Chute, C. G. (2009). Survey of modular ontology
ontology modularization techniques can effectively be used to create techniques and their applications in the biomedical domain. Integrated computer-
aided engineering, 16(3), 225–242.
balanced modules, if the concept distribution of the input signature
Rogers, J. and Rector, A. (1996). The galen ontology. Medical Informatics Europe
is dynamically and iteratively adjusted. (MIE 96), pages 174–178.
It is important to note that our algorithm and experiments are still Schulz, S. and Boeker, M. (2013). Biotoplite: An upper level ontology for the life
at an initial stage and some aspects need to be further explored and sciencesevolution, design and application. In GI-Jahrestagung, pages 1889–1899.
more carefully evaluated. As future work, we plan to further (a) Seidenberg, J. and Rector, A. (2006). Web ontology segmentation: analysis,
classification and use. In Proceedings of the 15th international conference on World
analyze how to select a minimal signature, (b) study how signature Wide Web, pages 13–22. ACM.
size influences the final size of the modules, and (c) improve the Smith, B., Kumar, A., and Bittner, T. (2005). Basic formal ontology for bioinformatics.
randomization process of the signature selection, e.g., by stratifying Journal of Information Systems, pages 1–16.
the randomization by node depth. Stuckenschmidt, H., Parent, C., and Spaccapietra, S. (2009). Modular Ontologies:
Concepts, Theories and Techniques for Knowledge Modularization. Springer-
Our current results, however, show that SNOMED CT can indeed
Verlag.
be squeezed without losing its shape, provided that we accept a Wennerberg, P., Schulz, K., and Buitelaar, P. (2011). Ontology modularization to
moderate (up to 8%) under- and over-representation of some of its improve semantic medical image annotation. Journal of biomedical informatics,
hierarchies. 44(1), 155–162.
Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 5