Can SNOMED CT be Squeezed Without Losing its Shape?
                                            Pablo López-Garcı́a∗, Stefan Schulz
               Institute for Medical Informatics, Statistics and Documentation – Medical University of Graz
                                          Auenbruggerplatz 2, 8036 Graz, Austria


ABSTRACT                                                                  (2004); Seidenberg and Rector (2006)) and logic-based techniques
   In biomedical applications where the size and complexity of            (Cuenca Grau et al. (2008); Grau et al. (2009)).
SNOMED CT are challenging, using a more compact subset that                 Often, these modules are not balanced when it comes to
can act as a reasonable substitute is often preferred (e.g., in problem   representing the original distribution or shape of sub-hierarchies
lists, using the CORE problem list subset of SNOMED CT, covering          shown by the original ontology or terminology. For example, in the
95% of usage in less than 1% its size). Ontology modularization           CORE subset of SNOMED CT, most concepts belong to the Clinical
is the area of research that studies how to extract such subsets,         Finding, Procedure, Situation with Explicit Context, and Event sub-
also called modules or segments. In a special class of use cases          hierarchies2 . The opposite case is also possible: in a previous study,
including ontology-based quality assurance, scaling experiments for       we found out that especially when using graph-traversal techniques
real-time performance, and developing scalable testbeds for software      resulting modules can excessively and uncontrollably grow and
tools, it is essential that modules are representative of SNOMED CT’s     spread across sub-hierarchies (López-Garcı́a et al. (2012)).
sub-hierarchies in terms of concept distribution, therefore preserving      These results are not surprising, because most prior work
the original shape of SNOMED CT. How to extract such balanced             on ontology modularization has not focused on preserving the
modules remains unclear, as most previous work on ontology                representativity of the sub-hierarchies of the original ontology, so
modularization has focused on the opposite problem: on extracting         the shape of the original ontology is inevitably lost in the modules.
a representative module for a specific domain. In this study, we            There is a special class of use cases, however, where it is essential
investigate to what extent extracting balanced modules that preserve      that modules are representative of the sub-hierarchies of the original
the original shape of SNOMED CT is possible by presenting and             ontology and therefore show a similar shape, such as:
evaluating an iterative algorithm.                                          • In ontology-based quality assurance, where small but
                                                                               representative samples of a huge ontology are to be inspected
1     INTRODUCTION                                                             (Agrawal et al. (2012));
The size and complexity of SNOMED CT1 constitute a problem                  • for obtaining a demonstration version that is understandable for
in many biomedical applications (Pathak et al. (2009)). Studies               users or facilitates visualization;
have shown that it is often enough to use a subset of interest              • for alignment with a highly constrained upper level ontology,
instead of the whole SNOMED CT. This is the case of problem                   such as the Basic Formal Ontology (BFO) (Smith et al.
lists, where the 16 874 terms of CORE2 have been shown to cover               (2005)), especially the upcoming BFO 2.0 OWL version,
over 95% of usage (Fung et al. (2010)), when tagging medical                  which includes relations, DOLCE (Gangemi et al. (2002)) or
images (Wennerberg et al. (2011)), or when annotating texts from              BioTopLite (Schulz and Boeker (2013)), where reasoning has
cardiology (López-Garcı́a et al. (2012)).                                    to be tested on small subsets and in iterative debugging steps;
   How to extract such subsets is studied by the area of research of
                                                                            • for performing scaling experiments for real-time performance
ontology modularization (Stuckenschmidt et al. (2009)). Ontology
                                                                              of a large OWL DL ontology;
modularization techniques are generally focused on obtaining a
minimal subset (also called module or segment) that maximally               • for the description logics community, who welcomes scalable
covers a specific domain or that is representative for a particular            testbeds for developing tools like editors and reasoners.
application. This is the case of the problem list or annotation cases       To the knowledge of the authors, little research on ontology
mentioned above, or the study by Seidenberg and Rector (2006),            modularization has focused on extracting balanced modules for such
where they described how they extracted a representative segment          applications, where keeping the original shape of a large ontology
of the GALEN ontology (Rogers and Rector (1996)) for cardiology           such as SNOMED CT regarding sub-hierarchies is a requirement.
using the seed concept ‘Heart’ as a signature.                              In this paper, we study the concept distribution of SNOMED CT’s
   A signature is an initial set of concepts (called seeds)               sub-hierarchies and we propose an evaluate an iterative algorithm
that bootstraps the modularization process, on which many                 for extracting balanced modules. Our main goal is to investigate to
ontology modularization techniques rely, including graph-traversal        what extent it is possible to obtain modules that preserve the original
(Doran et al. (2007); d’Aquin et al. (2007); Noy and Musen                shape of SNOMED CT in order to be used in our identified class of
                                                                          use cases.
∗ Correspondence should be addressed to: pablo.lopez@medunigraz.at
1 International Health Terminology Standards Development Organization -
http://www.ihtsdo.org/snomed-ct/ (accessed 27 Feb 2015)
2 The CORE Problem List Subset from SNOMED CT - http://www.

nlm.nih.gov/research/umls/Snomed/core_subset.html
(accessed 27 Feb 2015).


    Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes                                        1
2   SUB-HIERARCHIES OVERVIEW                                         3       EXTRACTION OF BALANCED MODULES
Table 1 shows the main 18 sub-hierarchies of SNOMED CT and           As remarked by d’Aquin et al. (2009), the process of extracting
their concept distribution. As can be seen, there are four sub-      ontology modules should be guided by each domain or application.
hierarchies that each contain over 10% of SNOMED CT concepts         In this section we present our definition of ontology modules, and
(Clinical Finding, Procedure, Organism, and Body Structure),         the methodology followed to obtain them.
adding up to over 70% of the concepts. We used the July 2014
International Release of SNOMED CT, and we omitted the metadata
                                                                     3.1      Balanced SNOMED CT Modules
concepts sub-hierarchy (SNOMED CT model).                            As input, we used the OWL-EL version of SNOMED CT obtained
                                                                     using the Perl script included in the distribution as input (SCT ). For
                                                                     our purposes, presented in the introduction, we define a balanced
    Subhierarchy (Abbreviation)           Concepts Distribution      SNOMED CT module (M ) as a minimal collection of classes from
    Clinical Finding (CF)                  100 893       33.57%      SCT that conform to the following requirements:
    Procedure (PR)                          53 914       17.94%
    Organism (OR)                           33 273       11.07%      (a) All classes in M are hierarchically connected to SNOMED CT’s
    Body Structure (BS)                     30 685       10.21%         root concept in the same way as in SCT .
    Substance (SU)                          24 021        7.99%      (b) All classes in M share the same axiomatical class definition as
    Pharmaceutical / Biologic Product       16 881        5.62%         in SCT .
    Qualifier Value (QV)                     9 055        3.01%      (c) Sub-hierarchies in M are distributed (approximately) in the
    Observable Entity (OE)                   8 307        2.76%         same proportion as in SCT . In practical terms, when visualized
    Social Context (SO)                      4 703        1.56%
                                                                        using a treemap, M should look similar to the treemap of
    Physical Object (PO)                     4 522        1.50%
                                                                        SNOMED CT shown in Figure 1.
    Situation with Explicit Context (SI)     3 695        1.23%
    Event (EV)                               3 673        1.22%      (d) Our model is restricted to classes. SNOMED CT metadata
    Environment or Geogr. Location (EG)      1 814        0.60%         concepts are not subject to modularization.
    Specimen (SN)                            1 447        0.48%
    Staging and Scales (ST)                  1 309        0.44%
                                                                     3.2      Module Construction from Seeds
    Special concept (SP)                       649        0.44%      To create our module M , we followed a similar approach to
    Record Artifact (RA)                       227        0.22%      Seidenberg and Rector (2006). Using their terminology, concepts
    Physical Force (PF)                        171        0.08%      (in our case, classes) are represented as nodes in a graph, and
      Table 1. Main sub-hierarchies of SNOMED CT. The metadata       seed concepts are called target nodes. The strategy consists in
    concepts sub-hierarchy (SNOMED CT model) was not considered.     iteratively adding classes appearing in the right-hand expressions of
                                                                     their definitions, starting from seeds in a initial signature. Figure 2
                                                                     shows an example of a resulting module, where it can be seen that
                                                                     (a) all classes are hierarchically connected to the root concept in the
  As a useful way of visualizing concept distribution and for        same way as in the original ontology (Figure 3), and (b) all classes
comparative purposes (see Section 4), the same information is        share the same axiomatical class definition as the original ontology.
displayed in form of a treemap in Figure 1. The treemap represents
SNOMED CT’s hierarchical information as a set of rectangles,
where the area of each rectangle is proportional to the number of        1                                                     Target Node
concepts in the sub-hierarchy.                                                                                                     B                    C
                                                                                                                   Is-a link
                                                                                                                  (A is a B)

                                                                                                              A                        Attribute link
                                                                                                    9


                                                                               10                  15


                                                                                              16


                                                                                         17


                                                                     Fig. 2: Strategy followed to build our module M , starting from the
                                                                     seed concept (target node) 10. Figure 3 shows the original ontology
Fig. 1: SNOMED CT’s shape represented with a treemap. Sub-           from which it was extracted.
hierarchies containing less than 10% of SNOMED CT concepts are
shown in acronyms (see Table 1).


2                          Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes
                                                                                                                                    Target Node

                                                                                                                                        B                    C
                                                       1                                                                Is-a link
                                                                                                                       (A is a B)

                                                                                                                  A                         Attribute link


                              2                                                               9


                     3                 4                        10                            15                  23


                               5           6               11        12              16            22        24         25


                         7         8              13            14             17             18


                                                                          19        20        21


                             Fig. 3: Sample ontology, starting with a signature containing the seed node (target node) 10.


3.3   Seed Adjustment: An Iterative Algorithm                                                 The algorithm, at each iteration i is the following:
The strategy to build a module using seeds presented in the previous                      1. A random signature SIGNi consisting of 2000 classes from
section guarantees requirements (a) and (b) from our definition of                          SCT is selected, following the same class sub-hierarchy
M , but does not guarantee requirement (c), i.e., that sub-hierarchies                      distribution as SCT , and ensuring at all sub-hierarchies in the
in M will be distributed (approximately) in the same proportion as                          signature contains at least one class.
in SCT . The reason is that there is no control over classes from                         2. A module Mi is computed following the principles described in
other sub-hierarchies that are added in the process when following                          Subsection 3.2. Its sub-hierarchy distribution is calculated.
the right-hand expressions of the seeds.                                                  3. Convergence is checked. If RSS >= 1, Steps 1 to 3 are repeated
   Therefore, in order not to conflict with requirements (a) and                            after adjusting the scaling factor for the sub-hierarchy distribution
(b) when creating M , the only possibility is to carefully select                           of the signatures in the next iteration i + 1:
the initial signature that bootstraps the modularization algorithm.                                                                            f (SCTSH )
                                                                                            f (SIGNi+1SHk ) = f (SIGNiSHk ) × f (Mi k) with
For that purpose, we investigated an iterative algorithm that                                                                                                    SHk

dynamically adjusts the distribution of classes used as seeds in the                          f (MiSHk ) being the relative frequency of sub-hierarchy SHk
initial signature. Before presenting the algorithm, we introduce the                          measured in the resulting module in iteration i, Mi .
following notation:
                                                                                          4       RESULTS
  • As introduced before, SCT represents the OWL EL version of                            In our experiments, the algorithm converged after 7 iterations,
    SNOMED CT used as input. Sub-hierarchies are termed SHk .                             extracting a module M with 10 834 classes. Figure 4 (Page 4) shows
  • M represents, the output module, whose sub-hierarchy                                  the error after each iteration for sub-hierarchies with more than 1%
    distribution (Table 1) should match SCT ’s as much as                                 error, as well as the residual sum of squares.
    possible.                                                                                As can be seen in the table below the graph, the sub-hierarchies
                                                                                          Clinical Finding, Procedure, and Organism were under-represented
  • SIGN , is the input signature, consisting of classes from SCT ,                       in M , while Body Structure and Substance were over-represented.
    that is used to boostrap the modularization process described in                      The same results can be confirmed graphically in the treemaps
    Subsection 3.2.                                                                       shown in Figure 5, at iterations 1, 3, and 7.
  • Error(SHk ) = Size(MSHk ) − Size(SCTSHk ) indicates
    the error on a per sub-hierarchy basis. Errors are calculated in
    percentage terms (see distribution in Table 1).
                1
                   P18                  2
  • RSS = 18         k=1 Error(SHk ) , where RSS represents
    the residual sum of squares. Convergence of the algorithm is
    defined when RSS < 1.


 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes                                                                   3
                                    Fig. 4: Execution of the algorithm, showing convergence in iteration 7.


                    (a) Module Shape - Iteration 1                                      (b) Module Shape - Iteration 3


             (c) Module Shape - Iteration 7 (convergence)                            (d) Full SNOMED CT Shape (target)

Fig. 5: Visual comparison of the shape between the modules and SNOMED CT (d) in iterations 1 (a), 3 (b), and 7 (convergence, c). Clinical
Finding, Procedure, and Organism were under-represented, while Body Structure and Substance were over-represented.


4                           Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes
5   DISCUSSION                                                            ACKNOWLEDGMENTS
Our results suggest that it is difficult for ontology modules to meet     The authors acknowledge ICBO reviewers for their elaborate
all of our modularization criteria without relaxing the constraints       feedback and suggestions.
of how concepts in the modules are distributed by sub-hierarchies,
because modularization criteria are conflicting. In our experiments,      REFERENCES
all obtained modules over-represented or under-represented some of        Agrawal, A., Perl, Y., and Elhanan, G. (2012). Identifying problematic concepts in
SNOMED CT’s sub-hierarchies in different degrees. These results              snomed ct using a lexical approach. Studies in health technology and informatics,
were partly expected, due to the nature of the modularization                192, 773–777.
approach that uncontrollably adds class definitions to preserve           Cuenca Grau, B., Horrocks, I., Kazakov, Y., and Sattler, U. (2008). Modular reuse of
                                                                             ontologies: Theory and practice. Journal of Artificial Intelligence Research, pages
SNOMED CT’s hierarchy and class definitions.                                 273–318.
   The error figures that we obtained after convergence, however,         Doran, P., Tamma, V., and Iannone, L. (2007). Ontology module extraction for ontology
never reached 8% for any sub-hierarchy and all our modules                   reuse: an ontology engineering perspective. In Proceedings of the sixteenth ACM
contained a fair representation of all of them. Furthermore,                 conference on Conference on information and knowledge management, pages 61–
                                                                             70. ACM.
convergence was reached after only 7 iterations. Such modules
                                                                          d’Aquin, M., Schlicht, A., Stuckenschmidt, H., and Sabou, M. (2007). Ontology
might be sufficient in many of the use cases that motivated their            modularization for knowledge selection: Experiments and evaluations. In Database
creation, i.e., extracting modules that show an (approximately)              and Expert Systems Applications, pages 874–883. Springer.
concept distribution to the one shown in SNOMED CT.                       d’Aquin, M., Schlicht, A., Stuckenschmidt, H., and Sabou, M. (2009). Criteria and
                                                                             evaluation for ontology modularization techniques. In Modular ontologies, pages
                                                                             67–89. Springer.
                                                                          Fung, K. W., McDonald, C., and Srinivasan, S. (2010). The UMLS-CORE project: a
                                                                             study of the problem list terminologies used in large healthcare institutions. Journal
6   CONCLUSIONS AND FUTURE WORK                                              of the American Medical Informatics Association, 17(6), 675–680.
In this study, we have studied SNOMED CT sub-hierarchies                  Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., and Schneider, L. (2002).
and proposed and evaluated an iterative algorithm for extracting             Sweetening ontologies with dolce. In Knowledge engineering and knowledge
                                                                             management: Ontologies and the semantic Web, pages 166–181. Springer.
compact modules that preserve the shape of SNOMED CT that we
                                                                          Grau, B. C., Horrocks, I., Kazakov, Y., and Sattler, U. (2009). Extracting modules
termed balanced modules. Extracting such modules has generally               from ontologies: A logic-based approach. In Modular Ontologies, pages 159–186.
been neglected by work on ontology modularization, even though               Springer.
there are many use cases where balanced modules constitute                López-Garcı́a, P., Boeker, M., Illarramendi, A., and Schulz, S. (2012). Usability-driven
an extremely valuable tool, such as in ontology-based quality                pruning of large ontologies: the case of snomed ct. Journal of the American Medical
                                                                             Informatics Association, pages amiajnl–2011.
assurance, scaling experiments for real-time performance, or              Noy, N. F. and Musen, M. A. (2004). Specifying ontology views by traversal. In The
developing scalable testbeds for software tools. Our proposed                Semantic Web–ISWC 2004, pages 713–725. Springer.
algorithm and our resulting modules show that graph-traversal             Pathak, J., Johnson, T. M., and Chute, C. G. (2009). Survey of modular ontology
ontology modularization techniques can effectively be used to create         techniques and their applications in the biomedical domain. Integrated computer-
                                                                             aided engineering, 16(3), 225–242.
balanced modules, if the concept distribution of the input signature
                                                                          Rogers, J. and Rector, A. (1996). The galen ontology. Medical Informatics Europe
is dynamically and iteratively adjusted.                                     (MIE 96), pages 174–178.
   It is important to note that our algorithm and experiments are still   Schulz, S. and Boeker, M. (2013). Biotoplite: An upper level ontology for the life
at an initial stage and some aspects need to be further explored and         sciencesevolution, design and application. In GI-Jahrestagung, pages 1889–1899.
more carefully evaluated. As future work, we plan to further (a)          Seidenberg, J. and Rector, A. (2006). Web ontology segmentation: analysis,
                                                                             classification and use. In Proceedings of the 15th international conference on World
analyze how to select a minimal signature, (b) study how signature           Wide Web, pages 13–22. ACM.
size influences the final size of the modules, and (c) improve the        Smith, B., Kumar, A., and Bittner, T. (2005). Basic formal ontology for bioinformatics.
randomization process of the signature selection, e.g., by stratifying       Journal of Information Systems, pages 1–16.
the randomization by node depth.                                          Stuckenschmidt, H., Parent, C., and Spaccapietra, S. (2009). Modular Ontologies:
                                                                             Concepts, Theories and Techniques for Knowledge Modularization. Springer-
   Our current results, however, show that SNOMED CT can indeed
                                                                             Verlag.
be squeezed without losing its shape, provided that we accept a           Wennerberg, P., Schulz, K., and Buitelaar, P. (2011). Ontology modularization to
moderate (up to 8%) under- and over-representation of some of its            improve semantic medical image annotation. Journal of biomedical informatics,
hierarchies.                                                                 44(1), 155–162.


 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes                                                             5