Can SNOMED CT be Squeezed Without Losing its Shape? Pablo López-Garcı́a∗, Stefan Schulz Institute for Medical Informatics, Statistics and Documentation – Medical University of Graz Auenbruggerplatz 2, 8036 Graz, Austria ABSTRACT (2004); Seidenberg and Rector (2006)) and logic-based techniques In biomedical applications where the size and complexity of (Cuenca Grau et al. (2008); Grau et al. (2009)). SNOMED CT are challenging, using a more compact subset that Often, these modules are not balanced when it comes to can act as a reasonable substitute is often preferred (e.g., in problem representing the original distribution or shape of sub-hierarchies lists, using the CORE problem list subset of SNOMED CT, covering shown by the original ontology or terminology. For example, in the 95% of usage in less than 1% its size). Ontology modularization CORE subset of SNOMED CT, most concepts belong to the Clinical is the area of research that studies how to extract such subsets, Finding, Procedure, Situation with Explicit Context, and Event sub- also called modules or segments. In a special class of use cases hierarchies2 . The opposite case is also possible: in a previous study, including ontology-based quality assurance, scaling experiments for we found out that especially when using graph-traversal techniques real-time performance, and developing scalable testbeds for software resulting modules can excessively and uncontrollably grow and tools, it is essential that modules are representative of SNOMED CT’s spread across sub-hierarchies (López-Garcı́a et al. (2012)). sub-hierarchies in terms of concept distribution, therefore preserving These results are not surprising, because most prior work the original shape of SNOMED CT. How to extract such balanced on ontology modularization has not focused on preserving the modules remains unclear, as most previous work on ontology representativity of the sub-hierarchies of the original ontology, so modularization has focused on the opposite problem: on extracting the shape of the original ontology is inevitably lost in the modules. a representative module for a specific domain. In this study, we There is a special class of use cases, however, where it is essential investigate to what extent extracting balanced modules that preserve that modules are representative of the sub-hierarchies of the original the original shape of SNOMED CT is possible by presenting and ontology and therefore show a similar shape, such as: evaluating an iterative algorithm. • In ontology-based quality assurance, where small but representative samples of a huge ontology are to be inspected 1 INTRODUCTION (Agrawal et al. (2012)); The size and complexity of SNOMED CT1 constitute a problem • for obtaining a demonstration version that is understandable for in many biomedical applications (Pathak et al. (2009)). Studies users or facilitates visualization; have shown that it is often enough to use a subset of interest • for alignment with a highly constrained upper level ontology, instead of the whole SNOMED CT. This is the case of problem such as the Basic Formal Ontology (BFO) (Smith et al. lists, where the 16 874 terms of CORE2 have been shown to cover (2005)), especially the upcoming BFO 2.0 OWL version, over 95% of usage (Fung et al. (2010)), when tagging medical which includes relations, DOLCE (Gangemi et al. (2002)) or images (Wennerberg et al. (2011)), or when annotating texts from BioTopLite (Schulz and Boeker (2013)), where reasoning has cardiology (López-Garcı́a et al. (2012)). to be tested on small subsets and in iterative debugging steps; How to extract such subsets is studied by the area of research of • for performing scaling experiments for real-time performance ontology modularization (Stuckenschmidt et al. (2009)). Ontology of a large OWL DL ontology; modularization techniques are generally focused on obtaining a minimal subset (also called module or segment) that maximally • for the description logics community, who welcomes scalable covers a specific domain or that is representative for a particular testbeds for developing tools like editors and reasoners. application. This is the case of the problem list or annotation cases To the knowledge of the authors, little research on ontology mentioned above, or the study by Seidenberg and Rector (2006), modularization has focused on extracting balanced modules for such where they described how they extracted a representative segment applications, where keeping the original shape of a large ontology of the GALEN ontology (Rogers and Rector (1996)) for cardiology such as SNOMED CT regarding sub-hierarchies is a requirement. using the seed concept ‘Heart’ as a signature. In this paper, we study the concept distribution of SNOMED CT’s A signature is an initial set of concepts (called seeds) sub-hierarchies and we propose an evaluate an iterative algorithm that bootstraps the modularization process, on which many for extracting balanced modules. Our main goal is to investigate to ontology modularization techniques rely, including graph-traversal what extent it is possible to obtain modules that preserve the original (Doran et al. (2007); d’Aquin et al. (2007); Noy and Musen shape of SNOMED CT in order to be used in our identified class of use cases. ∗ Correspondence should be addressed to: pablo.lopez@medunigraz.at 1 International Health Terminology Standards Development Organization - http://www.ihtsdo.org/snomed-ct/ (accessed 27 Feb 2015) 2 The CORE Problem List Subset from SNOMED CT - http://www. nlm.nih.gov/research/umls/Snomed/core_subset.html (accessed 27 Feb 2015). Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 1 2 SUB-HIERARCHIES OVERVIEW 3 EXTRACTION OF BALANCED MODULES Table 1 shows the main 18 sub-hierarchies of SNOMED CT and As remarked by d’Aquin et al. (2009), the process of extracting their concept distribution. As can be seen, there are four sub- ontology modules should be guided by each domain or application. hierarchies that each contain over 10% of SNOMED CT concepts In this section we present our definition of ontology modules, and (Clinical Finding, Procedure, Organism, and Body Structure), the methodology followed to obtain them. adding up to over 70% of the concepts. We used the July 2014 International Release of SNOMED CT, and we omitted the metadata 3.1 Balanced SNOMED CT Modules concepts sub-hierarchy (SNOMED CT model). As input, we used the OWL-EL version of SNOMED CT obtained using the Perl script included in the distribution as input (SCT ). For our purposes, presented in the introduction, we define a balanced Subhierarchy (Abbreviation) Concepts Distribution SNOMED CT module (M ) as a minimal collection of classes from Clinical Finding (CF) 100 893 33.57% SCT that conform to the following requirements: Procedure (PR) 53 914 17.94% Organism (OR) 33 273 11.07% (a) All classes in M are hierarchically connected to SNOMED CT’s Body Structure (BS) 30 685 10.21% root concept in the same way as in SCT . Substance (SU) 24 021 7.99% (b) All classes in M share the same axiomatical class definition as Pharmaceutical / Biologic Product 16 881 5.62% in SCT . Qualifier Value (QV) 9 055 3.01% (c) Sub-hierarchies in M are distributed (approximately) in the Observable Entity (OE) 8 307 2.76% same proportion as in SCT . In practical terms, when visualized Social Context (SO) 4 703 1.56% using a treemap, M should look similar to the treemap of Physical Object (PO) 4 522 1.50% SNOMED CT shown in Figure 1. Situation with Explicit Context (SI) 3 695 1.23% Event (EV) 3 673 1.22% (d) Our model is restricted to classes. SNOMED CT metadata Environment or Geogr. Location (EG) 1 814 0.60% concepts are not subject to modularization. Specimen (SN) 1 447 0.48% Staging and Scales (ST) 1 309 0.44% 3.2 Module Construction from Seeds Special concept (SP) 649 0.44% To create our module M , we followed a similar approach to Record Artifact (RA) 227 0.22% Seidenberg and Rector (2006). Using their terminology, concepts Physical Force (PF) 171 0.08% (in our case, classes) are represented as nodes in a graph, and Table 1. Main sub-hierarchies of SNOMED CT. The metadata seed concepts are called target nodes. The strategy consists in concepts sub-hierarchy (SNOMED CT model) was not considered. iteratively adding classes appearing in the right-hand expressions of their definitions, starting from seeds in a initial signature. Figure 2 shows an example of a resulting module, where it can be seen that (a) all classes are hierarchically connected to the root concept in the As a useful way of visualizing concept distribution and for same way as in the original ontology (Figure 3), and (b) all classes comparative purposes (see Section 4), the same information is share the same axiomatical class definition as the original ontology. displayed in form of a treemap in Figure 1. The treemap represents SNOMED CT’s hierarchical information as a set of rectangles, where the area of each rectangle is proportional to the number of 1 Target Node concepts in the sub-hierarchy. B C Is-a link (A is a B) A Attribute link 9 10 15 16 17 Fig. 2: Strategy followed to build our module M , starting from the seed concept (target node) 10. Figure 3 shows the original ontology Fig. 1: SNOMED CT’s shape represented with a treemap. Sub- from which it was extracted. hierarchies containing less than 10% of SNOMED CT concepts are shown in acronyms (see Table 1). 2 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes Target Node B C 1 Is-a link (A is a B) A Attribute link 2 9 3 4 10 15 23 5 6 11 12 16 22 24 25 7 8 13 14 17 18 19 20 21 Fig. 3: Sample ontology, starting with a signature containing the seed node (target node) 10. 3.3 Seed Adjustment: An Iterative Algorithm The algorithm, at each iteration i is the following: The strategy to build a module using seeds presented in the previous 1. A random signature SIGNi consisting of 2000 classes from section guarantees requirements (a) and (b) from our definition of SCT is selected, following the same class sub-hierarchy M , but does not guarantee requirement (c), i.e., that sub-hierarchies distribution as SCT , and ensuring at all sub-hierarchies in the in M will be distributed (approximately) in the same proportion as signature contains at least one class. in SCT . The reason is that there is no control over classes from 2. A module Mi is computed following the principles described in other sub-hierarchies that are added in the process when following Subsection 3.2. Its sub-hierarchy distribution is calculated. the right-hand expressions of the seeds. 3. Convergence is checked. If RSS >= 1, Steps 1 to 3 are repeated Therefore, in order not to conflict with requirements (a) and after adjusting the scaling factor for the sub-hierarchy distribution (b) when creating M , the only possibility is to carefully select of the signatures in the next iteration i + 1: the initial signature that bootstraps the modularization algorithm. f (SCTSH ) f (SIGNi+1SHk ) = f (SIGNiSHk ) × f (Mi k) with For that purpose, we investigated an iterative algorithm that SHk dynamically adjusts the distribution of classes used as seeds in the f (MiSHk ) being the relative frequency of sub-hierarchy SHk initial signature. Before presenting the algorithm, we introduce the measured in the resulting module in iteration i, Mi . following notation: 4 RESULTS • As introduced before, SCT represents the OWL EL version of In our experiments, the algorithm converged after 7 iterations, SNOMED CT used as input. Sub-hierarchies are termed SHk . extracting a module M with 10 834 classes. Figure 4 (Page 4) shows • M represents, the output module, whose sub-hierarchy the error after each iteration for sub-hierarchies with more than 1% distribution (Table 1) should match SCT ’s as much as error, as well as the residual sum of squares. possible. As can be seen in the table below the graph, the sub-hierarchies Clinical Finding, Procedure, and Organism were under-represented • SIGN , is the input signature, consisting of classes from SCT , in M , while Body Structure and Substance were over-represented. that is used to boostrap the modularization process described in The same results can be confirmed graphically in the treemaps Subsection 3.2. shown in Figure 5, at iterations 1, 3, and 7. • Error(SHk ) = Size(MSHk ) − Size(SCTSHk ) indicates the error on a per sub-hierarchy basis. Errors are calculated in percentage terms (see distribution in Table 1). 1 P18 2 • RSS = 18 k=1 Error(SHk ) , where RSS represents the residual sum of squares. Convergence of the algorithm is defined when RSS < 1. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 3 Fig. 4: Execution of the algorithm, showing convergence in iteration 7. (a) Module Shape - Iteration 1 (b) Module Shape - Iteration 3 (c) Module Shape - Iteration 7 (convergence) (d) Full SNOMED CT Shape (target) Fig. 5: Visual comparison of the shape between the modules and SNOMED CT (d) in iterations 1 (a), 3 (b), and 7 (convergence, c). Clinical Finding, Procedure, and Organism were under-represented, while Body Structure and Substance were over-represented. 4 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 5 DISCUSSION ACKNOWLEDGMENTS Our results suggest that it is difficult for ontology modules to meet The authors acknowledge ICBO reviewers for their elaborate all of our modularization criteria without relaxing the constraints feedback and suggestions. of how concepts in the modules are distributed by sub-hierarchies, because modularization criteria are conflicting. In our experiments, REFERENCES all obtained modules over-represented or under-represented some of Agrawal, A., Perl, Y., and Elhanan, G. (2012). Identifying problematic concepts in SNOMED CT’s sub-hierarchies in different degrees. These results snomed ct using a lexical approach. Studies in health technology and informatics, were partly expected, due to the nature of the modularization 192, 773–777. approach that uncontrollably adds class definitions to preserve Cuenca Grau, B., Horrocks, I., Kazakov, Y., and Sattler, U. (2008). Modular reuse of ontologies: Theory and practice. Journal of Artificial Intelligence Research, pages SNOMED CT’s hierarchy and class definitions. 273–318. The error figures that we obtained after convergence, however, Doran, P., Tamma, V., and Iannone, L. (2007). Ontology module extraction for ontology never reached 8% for any sub-hierarchy and all our modules reuse: an ontology engineering perspective. In Proceedings of the sixteenth ACM contained a fair representation of all of them. Furthermore, conference on Conference on information and knowledge management, pages 61– 70. ACM. convergence was reached after only 7 iterations. Such modules d’Aquin, M., Schlicht, A., Stuckenschmidt, H., and Sabou, M. (2007). Ontology might be sufficient in many of the use cases that motivated their modularization for knowledge selection: Experiments and evaluations. In Database creation, i.e., extracting modules that show an (approximately) and Expert Systems Applications, pages 874–883. Springer. concept distribution to the one shown in SNOMED CT. d’Aquin, M., Schlicht, A., Stuckenschmidt, H., and Sabou, M. (2009). Criteria and evaluation for ontology modularization techniques. In Modular ontologies, pages 67–89. Springer. Fung, K. W., McDonald, C., and Srinivasan, S. (2010). The UMLS-CORE project: a study of the problem list terminologies used in large healthcare institutions. Journal 6 CONCLUSIONS AND FUTURE WORK of the American Medical Informatics Association, 17(6), 675–680. In this study, we have studied SNOMED CT sub-hierarchies Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., and Schneider, L. (2002). and proposed and evaluated an iterative algorithm for extracting Sweetening ontologies with dolce. In Knowledge engineering and knowledge management: Ontologies and the semantic Web, pages 166–181. Springer. compact modules that preserve the shape of SNOMED CT that we Grau, B. C., Horrocks, I., Kazakov, Y., and Sattler, U. (2009). Extracting modules termed balanced modules. Extracting such modules has generally from ontologies: A logic-based approach. In Modular Ontologies, pages 159–186. been neglected by work on ontology modularization, even though Springer. there are many use cases where balanced modules constitute López-Garcı́a, P., Boeker, M., Illarramendi, A., and Schulz, S. (2012). Usability-driven an extremely valuable tool, such as in ontology-based quality pruning of large ontologies: the case of snomed ct. Journal of the American Medical Informatics Association, pages amiajnl–2011. assurance, scaling experiments for real-time performance, or Noy, N. F. and Musen, M. A. (2004). Specifying ontology views by traversal. In The developing scalable testbeds for software tools. Our proposed Semantic Web–ISWC 2004, pages 713–725. Springer. algorithm and our resulting modules show that graph-traversal Pathak, J., Johnson, T. M., and Chute, C. G. (2009). Survey of modular ontology ontology modularization techniques can effectively be used to create techniques and their applications in the biomedical domain. Integrated computer- aided engineering, 16(3), 225–242. balanced modules, if the concept distribution of the input signature Rogers, J. and Rector, A. (1996). The galen ontology. Medical Informatics Europe is dynamically and iteratively adjusted. (MIE 96), pages 174–178. It is important to note that our algorithm and experiments are still Schulz, S. and Boeker, M. (2013). Biotoplite: An upper level ontology for the life at an initial stage and some aspects need to be further explored and sciencesevolution, design and application. In GI-Jahrestagung, pages 1889–1899. more carefully evaluated. As future work, we plan to further (a) Seidenberg, J. and Rector, A. (2006). Web ontology segmentation: analysis, classification and use. In Proceedings of the 15th international conference on World analyze how to select a minimal signature, (b) study how signature Wide Web, pages 13–22. ACM. size influences the final size of the modules, and (c) improve the Smith, B., Kumar, A., and Bittner, T. (2005). Basic formal ontology for bioinformatics. randomization process of the signature selection, e.g., by stratifying Journal of Information Systems, pages 1–16. the randomization by node depth. Stuckenschmidt, H., Parent, C., and Spaccapietra, S. (2009). Modular Ontologies: Concepts, Theories and Techniques for Knowledge Modularization. Springer- Our current results, however, show that SNOMED CT can indeed Verlag. be squeezed without losing its shape, provided that we accept a Wennerberg, P., Schulz, K., and Buitelaar, P. (2011). Ontology modularization to moderate (up to 8%) under- and over-representation of some of its improve semantic medical image annotation. Journal of biomedical informatics, hierarchies. 44(1), 155–162. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 5