=Paper= {{Paper |id=Vol-410/paper-7 |storemode=property |title=Comparing SNOMED CT and the NCI Thesaurus through Semantic Web Technologies |pdfUrl=https://ceur-ws.org/Vol-410/Paper07.pdf |volume=Vol-410 |dblpUrl=https://dblp.org/rec/conf/krmed/Bodenreider08 }} ==Comparing SNOMED CT and the NCI Thesaurus through Semantic Web Technologies== https://ceur-ws.org/Vol-410/Paper07.pdf
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)




                                  Comparing SNOMED CT and the NCI Thesaurus
                                       through Semantic Web Technologies
                                                    Olivier Bodenreider
                             U.S. National Library of Medicine, NIH, Bethesda, Maryland, USA
                                                 olivier@nlm.nih.gov
             Objective: The objective of this study is to compare            ent purposes: the NCI Thesaurus (NCIt), used for the
             two large biomedical terminologies, SNOMED CT                   annotation of cancer research data, and SNOMED
             and the National Cancer Institute (NCI) Thesaurus,              CT, the largest clinical terminology used in electronic
             through Semantic Web technologies. Methods: The                 patient records. We take advantage of the fact that
             two terminologies are converted into the Resource               both ontologies were developed using Description
             Description Framework (RDF) and loaded into a                   Logic-based systems. Although most classes are not
             common triple store. The Unified Medical Language               defined with a set of necessary and sufficient condi-
             System (UMLS) is used to identify correspondences               tions, the set of relations in which a given concept is
             between concepts across terminologies. Concepts                 involved still provides a formal definition for this
             common to both terminologies are compared based                 concept, which can be used to compare it to other
             on shared relations to other concepts. Results: A               concepts. We also take advantage of the fact that both
             total of 20,369 pairs of equivalent SNOMED CT and               ontologies are represented in the Unified Medical
             NCI Thesaurus concepts were identified through the              Language System (UMLS), which asserts the equiva-
             UMLS. The highest proportion of shared relata is for            lence between concepts across biomedical ontologies.
             the superclasses traversed recursively (75% of the              Finally, we exploit Semantic Web technologies, such
             concepts share at least one superclass). Slightly more          as the Resource Description Framework (RDF) to
             than half of the concepts studied share at least one            carry out the comparison between these two ontolo-
             associative relation (direct relation or inherited from         gies.
             some ancestor). Conclusions: Overall, SNOMED CT                 The objective of this study is to compare the formal
             and NCI Thesaurus concepts exhibit a relatively                 definitions of SNOMED CT and NCIt concepts,
             small proportion of shared relata. Semantic Web                 using Semantic Web technologies. The assumption
             technologies, including RDF and triple stores, are              underlying this study is that two concepts, one from
             suitable for comparing large biomedical ontologies,             SNOMED CT and one from NCIt, when identified as
             at least from a quantitative perspective.                       equivalent in the UMLS, should have similar formal
                                                                             definitions. In other words, our hypothesis is that
                              INTRODUCTION                                   equivalent concepts from SNOMED CT and NCIt
                                                                             should have related concepts that are also equivalent.
             In the era of translational medicine, i.e., the applica-        To our knowledge, this is the first study to compare
             tion of the discoveries of basic research (made at the          biomedical ontologies on a large scale using RDF.
             bench) to clinical medicine (the patient’s bedside)
             and the refinement of research hypotheses based on
                                                                                               BACKGROUND
             clinical findings, basic researchers and healthcare
             practitioners need to exchange information back and             The general framework of this study is that of quality
             forth. In order to be processed efficiently, both re-           assurance in biomedical terminologies and ontolo-
             search data and clinical data must be annotated to              gies, which is known to be is a difficult task [1]. Sev-
             some reference terminology or ontology. Although                eral approaches to auditing terminologies have been
             some research ontologies and clinical ontologies have           proposed, including semantic methods [2], structural
             a significant degree of overlap, there has typically            methods [3] and linguistic and formal ontological
             been little coordination between the groups develop-            approaches [4]. Methods based on description logics
             ing them. As a consequence, the definitions – textual           have also been proposed, but have generally been
             or formal – provided in research ontologies and clini-          restricted to subsets of large medical ontologies [5].
             cal ontologies for the same biomedical entity may               Various methods have been applied to SNOMED CT
             vary significantly, which constitutes a hindrance to            [3, 4] and to the NCIt [6]. In contrast to these ap-
             the effective integration of data from basic research           proaches, we propose to evaluate SNOMED CT and
             and clinical practice.                                          the NCIt simultaneously and against each other. In
             The evaluation of biomedical terminologies for com-             other words, we want to cross-validate the definitions
             pleteness and accuracy remains largely an open re-              or assertions provided in one ontology for a given
             search question. In this paper, we propose to compare           entity with the definitions or assertions provided in
             two large biomedical ontologies developed for differ-           the other ontology for the same entity.




                                                                        37
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)




             The Semantic Web provides a common framework                     Ontology Language (OWL-DL) for its representation
             that enables the integration, sharing and reuse of data          [12]. Version 07.05e of the NCIt contains 58,869
             from multiple sources. Recent research in Semantic               active classes, 123 associative relationships and
             Web technologies has delivered promising results to              124,775 relations (subsumption and equivalence
             enable information integration across heterogeneous              relations, as well as restrictions in the OWL file). The
             knowledge sources, particularly in the biomedical                OWL file for the NCIt was downloaded from the
             domain [7]. Semantic Web technologies are a collec-              caCORE FTP site (ftp://ftp1.nci.nih.gov/pub/cacore/),
             tion of formalisms, languages and tools created to               under EVS.
             support the Semantic Web. Among them, the Re-
             source Description Framework (RDF) is a W3C-                     Unified Medical Language System
             recommended framework for representing data in a                 The Unified Medical Language System (UMLS) is a
             common format that captures the logical structure of             terminology integration system developed at the U.S.
             the data [8]. The RDF representational model uses a              National Library of Medicine [13]. The UMLS Meta-
             single schema in contrast to multiple heterogeneous              thesaurus is a repository of integrated biomedical
             schemas or Data Type Definitions (DTD) used to                   terms drawn from 143 biomedical vocabularies and
             represent data in XML by different sources. In con-              ontologies. Terms referring to the same entity in sev-
             junction with a single Uniform Resource Identifier               eral vocabularies are clustered together and given the
             (URI), all data represented in RDF form a single                 same concept unique identifier (CUI). Both
             knowledge repository that may be queried as one                  SNOMED CT (July 31, 2007) and NCIt (07.05e) are
             knowledge resource. An RDF repository consists of a              integrated in version 2007AC of the Metathesaurus,
             set of assertions or triples. Each triple comprises three        which provides a convenient way of identifying equi-
             entities namely, subject, predicate and object. A col-           valences between terms from these two ontologies.
             lection of triples forms a graph and can be stored in a          The UMLS is available for download from the UMLS
             specialized database called a triple store.                      Knowledge Source Server (http://umlsks.nlm.nih.-
                                                                              gov/). (A free license is required).
                                 MATERIALS
                                                                                                    METHODS
             SNOMED CT                                                        The method developed for comparing concepts from
             SNOMED CT is a concept system and an associated                  SNOMED CT and NCIt can be summarized as fol-
             terminology for healthcare [9].. It is managed by the            lows. The formal definition of concepts is extracted
             International Health Terminology Standards Devel-                from SNOMED CT and NCIt and converted to RDF
             opment Organisation (IHTSDO), a not-for-profit                   triples. Equivalence relations between SNOMED CT
             international standards body with nine member coun-              and NCIt concepts are extracted from the UMLS . All
             tries. Although its development is based on the De-              triples are loaded into a triple store. Additional triples
             scription Logic system KRSS, SNOMED CT is pro-                   are generated from inference rules applied to the
             vided as a set of relational tables corresponding to an          original knowledge base. The triple store is then que-
             “inferred view”, i.e., the set of non-redundant defin-           ried to compare the representation of concepts in
             ing relations for each concept. The July 2007 interna-           SNOMED CT and NCIt.
             tional release contains 310,311 active elements
             (309,175 concepts and 1,136 relationships, of which              Acquiring RDF triples
             only 61 are actually used to relate concepts) and                For each concept and relationship from SNOMED
             1,218,983 relations (pairs of semantically-related               CT and NCIt, we extract the following information:
             concepts). The source files for SNOMED CT                        original identifier, preferred name, source (SNOMED
             (sct_concepts and sct_relationships) were down-                  CT or NCIt), type (concept or relationship). RDF
             loaded from the UMLS Knowledge Source Server                     triples are created to represent this information, in
             (http://umlsks.nlm.nih.gov/).                                    which the subject is the concept itself. The predicates
                                                                              corresponding to the properties listed above are hasID,
             NCI Thesaurus                                                    hasName, hasSource and hasType, respectively. The
             The National Cancer Institute Thesaurus (NCIt) is a              object of these triples is a literal corresponding to, for
             “terminology based on current science that helps                 example, the concept name for the predicate hasName.
             individuals and software applications connect and                Triples are also created for representing the relations
             organize the results of cancer research” [10]. The               of each concept to other concepts from the same
             NCIt is produced by the National Cancer Institute,               source. The relationship indicated in the source is
             and is a key element of the cancer common ontologic              used as predicate for these triples, whose objects are
             representation environment (caCORE) [11]. The                    concepts. Similarly, triples are created for
             NCIt uses the description logic flavor of the Web                representing relations among relationships (e.g., sub-




                                                                         38
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)




             PropertyOf). Finally, we create triples to represent the        this triple store and ended up not using it. (The lack
             mapping of concepts to the UMLS Metathesaurus.                  of generalized transitive closure in the triple store was
             For each concept from SNOMED CT and NCIt, we                    compensated for by graph traversal functions in the
             create one triple with the predicate hasCUI and the             queries.)
             corresponding UMLS CUI as object literal.                       In practice, the only rule we created and applied to
             SNOMED CT. The fields ‘CONCEPTID’ and                           the store makes a concept from SNOMED CT
             ‘FULLYSPECIFIEDNAME’               from      the   table        equivalent to a concept from NCIt when both con-
             stc_concept were used to instantiate the properties             cepts are mapped to the same UMLS concept (i.e.,
             hasID and hasName, respectively. All nodes were as-             share the same UMLS CUI). This relation was im-
             signed the value ‘concept’ for the property hasType,            plemented by creating an owl:sameAs relationship be-
             except for the elements of the table stc_concept ac-            tween the two concepts, bidirectionally.
             tually corresponding to relationships, namely, Lin-
             kage concept (linkage concept) and its descendants,
             to which the value ‘relationship’ was assigned. All              SNOMED CT                                                          NCI Thesaurus
             nodes were assigned the value ‘SNOMEDCT’ for the
                                                                                                              S1                          N1
             property hasSource.                                                            S2
             NCI Thesaurus. The elements ‘code’ and ‘Pre-
                                                                                                                                    nr1
             ferred_Name’ from the ‘’ sections of the                              sr2             sr1
                                                                                                                                               nr2         N2
             OWL file were used to instantiate the properties hasID               S3
             and hasName, respectively. All nodes were assigned                             sr3         S0                            N0
             the value ‘concept’ for the property hasType. Analo-                                                                               nr3
             gously,       information    extracted      from     the                             sr4        sr5
                                                                                                                                    nr4
             ‘’ sections of the OWL file was                       S4                                                             N3
             used to create the corresponding triples for properties                                     S5
             (i.e., predicates). These nodes were assigned the                                                                            N4
             value ‘relationship’ for the property hasType. All
             nodes were assigned the value ‘NCI’ for the property
             hasSource.                                                                                             Equivalent concepts according to the UMLS
                                                                              Legend
                                                                              Legend                                Relationship between 2 concepts
             UMLS Metathesaurus. The table MRCONSO.RRF                                                              Shared relata of S and N
             from the UMLS distribution was used for acquiring
             the mapping between terms from SNOMED CT and                     Figure 1. Graph formed by the related concept of one
             the UMLS concepts, as well as between terms from                          pair of equivalent concepts (S0, N0)
             the NCIt and the UMLS concepts. We used the
             source abbreviation (SAB) to identify strings contri-           Querying the triple store
             buted by SNOMED CT (SAB = ‘SNOMEDCT’) or                        A set of queries was developed to explore the relata
             NCTt (SAB = NCI). We extracted the concept iden-                of those concepts that are equivalent between
             tifier in the source (SCUI) and UMLS concept unique             SNOMED CT and NCIt according to the UMLS.
             identifier (CUI) and created triples of the form (con-          More specifically, these queries explore the set of
             cept, hasCUI, CUI) for each pair (SCUI, CUI).                   relata of the SNOMED CT concept and that of the
                                                                             NCIt concept, and select from the two sets the relata
             Creating the triple store                                       identified as equivalent in the UMLS. For example,
             These triples generated from SNOMED CT, NCIt                    as illustrated in Figure 1, the concepts S0 from
             and the UMLS were represented in N-triple format                SNOMED CT and N0 from NCIt are equivalent ac-
             and loaded into the open source triple store Mulga-             cording to the UMLS. Among the relata of S0 (S1 to
             ra™ (http://mulgara.org/) in a linux environment.               S5) and N0 (N1 to N4), the pairs {S1, N1} and {S5, N3}
             Mulgara automatically indexes the triples, as well as           denote equivalent concepts and constitute the set of
             the subject, predicate and object elements of each              shared relata of {S0, N0}.
             triple.                                                         Each relation between two concepts (e.g., (S0, sr4,
                                                                             S4)) is represented as a triple in the RDF store and the
             Inference rules                                                 set of all relations forms a graph. Comparing the set
             Inference rules are typically added to a triple store in        of relata of two concepts can thus be expressed as a
             order to infer new RDF statements (i.e., triples) from          set of constraints on the graph. For example, {S1, N1}
             existing RDF statements. Mulgara provides a series              are shared relata of {S0, N0}, because there is a path
             of rules, which implement RDF Schema (RDFS)                     between S0 and N0, constituted of any link from S0 to
             entailment, including rules for the transitivity of the         S1, any link from N0 to N1, and a “UMLS equiva-
             relationships rdfs:subClassOf and rdfs:subPropertyOf. We        lence” link between S1 and N1.
             found the set of rules for RDFS impractical to use on




                                                                        39
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)




             The set of relata is not necessarily limited to direct                                              In practice, starting from the list of pairs of equivalent
             relata. Some relations can be traversed recursively in                                              concepts, we generated one query per pair for each
             order to explore, for example, the set of common                                                    type of relationship to be explored. The relata in
             ancestors (as opposed to common direct subclasses).                                                 common were recorded for each pair of equivalent
             Depending on the constraints put on the graph, vari-                                                concepts for each type of relationship explored. Fig-
             ous kinds of relationships can be explored, together                                                ure 2 shows a typical query used to explore (recur-
             or independently.                                                                                   sively) the common superclasses of two concepts.
             One of the major query languages for RDF stores is                                                  Figure 3 displays the output of this query, showing
             SPARQL. Mulgara currently provides no support for                                                   the 7 ancestors in common.
             SPARQL. Instead, it provides iTQLTM (Interactive
             Tucana Query LanguageTM), which is functionally                                                     Data analysis
             equivalent to SPARQL for most purposes.                                                             We analyzed the lists of shared relata resulting from
                                                                                                                 the queries from a quantitative perspective, in order
               select $n_sub $n_rel $n_obj $s_sub $s_rel $s_obj
                                                                                                                 to examine the distribution of the number of common
               from                                                     relata for the various kinds of relationships under
               where
               (
                                                                                                                 investigation.
                   # ---------- NCIT side ----------
                   walk(  $n_obj
                             and $n_sub_tmp  $n_obj)
                                                                                                                                       RESULTS
                   and $n_rel  
                   and $n_sub  
               )
               and
                                                                                                                 Triple store
               (                                                                                                 A total of 3,194,215 triples were created, 2,770,477
                   # ---------- SNCT side ----------
                   walk(  $s_obj
                                                                                                                 for SNOMED CT and 423,738 for NCIt. It took
                        and $s_sub_tmp  $s_obj)                                                  about 20 minutes to load these N-triples into Mulga-
                   and $s_rel  
                   and $s_sub  
                                                                                                                 ra, including the creation of indexes.
               )                                                                                                 The rule asserting the equivalence of SNOMED CT
               and $n_obj  $s_obj
               in 
                                                                                                                 and NCIt concepts when they share the same UMLS
               ;                                                                                                 CUI generated 40,738 additional triples (representing
                                                                                                                 the owl:sameAs relations bidirectionally). It took about
               Figure 2. iTQLquery used to explore the common su-
                 perclasses of the concepts C2986 from NCIt and                                                  5 minutes to apply this rule to the triple store.
                           46635009 from SNOMED CT                                                               Queries were executed in batches, one batch for each
                                                                                                                 set of equivalent concepts for a given kind of rela-
              [ ncit:C2986, rdfs:subClassOf, ncit:C2991, snct:46635009, snct:116680003, snct:64572001 ]          tionship. Executing a batch of queries took anywhere
              [ ncit:C2986, rdfs:subClassOf, ncit:C3009, snct:46635009, snct:116680003, snct:362969004 ]
              [ ncit:C2986, rdfs:subClassOf, ncit:C2985, snct:46635009, snct:116680003, snct:73211009 ]          between several minutes (for direct relations) to sev-
              [ ncit:C2986, rdfs:subClassOf, ncit:C27067, snct:46635009, snct:116680003, snct:17346000 ]
              [ ncit:C2986, rdfs:subClassOf, ncit:C53655, snct:46635009, snct:116680003, snct:126877002 ]
                                                                                                                 eral hours (when relations are allowed to be traversed
              [ ncit:C2986, rdfs:subClassOf, ncit:C2990, snct:46635009, snct:116680003, snct:53619000 ]
              [ ncit:C2986, rdfs:subClassOf, ncit:C26842, snct:46635009, snct:116680003, snct:3855007 ]
                                                                                                                 recursively).

               Figure 3. Results of the query in Figure 2 (aliases are                                           Overlap between SNOMED CT and NCIt con-
                           used in lieu of the full URIs)                                                        cepts
                                                                                                                 Of the 309,175 SNOMED CT concepts, 19,506
             Comparing the shared relata of concepts                                                             (6.3%) mapped to the same UMLS concept as some
             In order to compare the formal definitions of a con-                                                NCIt concept. Analogously, 14,054 (23.9%) of the
             cept S0 from SNOMED CT and N0 from NCIt, we                                                         58,869 NCIT concepts mapped to the same UMLS
             prepared queries to explore the following sets of                                                   concept as some SNOMED CT concept. A total of
             shared relata: all shared relata (including through                                                 20,369 pairs of SNOMED CT and NCIt concepts
             associative relations), shared superclasses, shared                                                 were identified in which the two concepts are deemed
             wholes (of which the entity is a part of), shared sub-                                              equivalent based on their mapping to the UMLS.
             classes and shared parts. More precisely, these kinds
             of relations were first explored directly to extract the                                            Quantitative results
             set of relata in direct relation to the original concepts,                                          The distribution of the number of relata for several
             and indirectly, allowing the recursive traversal of isa                                             types of relationships investigated is summarized in
             and part_of relationships. Finally, in order to account                                             Table 1. The first column (N) shows the total number
             for the inheritance of properties from a superclass to                                              of pairs of concepts for which both concepts have at
             its subclasses, we also explored the concepts in asso-                                              least one related concept for this relation. This num-
             ciative relation to any of the superclasses of the origi-                                           ber is used as the denominator for computing the
             nal concepts.                                                                                       percentage of pairs of equivalent concepts having a
                                                                                                                 given number of related concepts in common. The




                                                                                                            40
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)




             minimum, maximum and median number of shared                    Further research is needed to distinguish among pri-
             relata are presented in the last three columns. For             mitive concepts in both ontologies (e.g., Aneurismal
             example, the row “Dir. Superclass” corresponds to               bone cyst), concepts for which a relatively rich de-
             the shared direct parent classes (traversing isa in             scription is provided, but only in one ontology (e.g.,
             SNOMED CT and subClassOf in NCIt). N = 20,360                   the description provided for many cancers in NCIt is
             indicates that almost all concepts have at least one            typically richer than in SNOMED CT), and concepts
             ancestor. 18.4% of the pairs of equivalent concepts             defined in both ontologies, but with minimal overlap
             studied share a parent class and only 1.3% share two.           in their relata. We did not complete the comparison
             Over 80% of the pairs do not share any direct parents.          of shared descendants, but, even in the absence of a
             The row “Ind. Superclass” corresponds to the shared             rich description, a large proportion of shared descen-
             ancestors (traversing isa or subClassOf recursively).           dants can be a good indicator of consistency between
             Only 25% of the pairs of equivalent concepts studied            ontologies (e.g., Sulfonamide agents share 18 des-
             do not have any ancestors in common. The largest                cendants).
             number of ancestors in common is 22.
             Details about shared relata for other kinds of relation-        Semantic Web technologies
             ships are provided in the other rows of Table 1, in-            We found RDF to be suitable for comparing termino-
             cluding direct parent and child classes for the tax-            logical ontologies, especially when the two ontologies
             onomic relation (super/subclass) and for the mero-              are large and are not both available in OWL. While
             nomic relation (whole/part). The identification of              OWL classifiers are useful for consistency checking
             indirect relata involves the recursive traversal of             purposes, they tend to be limited in the number of
             taxonomic and meronomic relations and combination               classes they can handle. Moreover, the queries pre-
             of sucblassOf and associative relations.                        sented in this study arguably allow more flexibility
                                                                             than OWL DL classifiers.
                           EXTENDED EXAMPLE                                  The triple store approach also offers clear advantages
                                                                             over relational databases, as SQL provides no support
             In order to illustrate our approach to comparing on-
                                                                             for performing transitive closures (i.e., for performing
             tologies, we explore how Type 1 diabetes mellitus is
                                                                             joint operations recursively). While ad hoc programs
             represented in SNOMED CT and NCIt. As shown in
                                                                             (or stored procedures) embedding SQL queries can
             Figure 4, this concept has many relata both in
                                                                             be written against the database, we showed that sim-
             SNOMED CT and in NCIt, of which a large number
                                                                             ple queries against the RDF store were sufficient to
             are shared, including 7 shared ancestors (e.g., Dis-
                                                                             carry out this study. Because it supports the seamless
             order of pancreas) and 4 shared concepts in associa-
                                                                             traversal of complex graphs (recursive traversal of
             tive relation (e.g., Gastrointestinal System). Dotted
                                                                             one relationship and traversal of selected combina-
             lines represent indirect isa relations through concepts
                                                                             tions of relationships), RDF is an effective approach
             that are not shown. The equivalence between con-
                                                                             to comparing terminologies.
             cepts in SNOMED CT and NCIt assessed through the
                                                                             The comparison of large ontologies remains nonethe-
             UMLS is shown with grey links. Of note, two distinct
                                                                             less difficult. The inference engine of Mulgara could
             concepts in one ontology can be equivalent to one
                                                                             not apply the set of rules defined for RDFS, including
             concept in the other (e.g., Endocrine Pancreas and
                                                                             the transitivity of subClassOf to large, heavily hierar-
             Islet of Langerhans in NCIt vs. Endocrine pancreatic
                                                                             chical structures. However, the graph traversal func-
             structure in SNOMED CT).
                                                                             tions supported by the query language partially com-
                                                                             pensated for the absence of precomputed transitive
                                 DISCUSSION                                  closures.

             SNOMED CT and NCIt                                              Limitations and future work
             Overall, the two ontologies under investigation in this         This approach essentially provides a quantitative
             study were found to have a relatively small proportion          comparison between two ontologies and is insuffi-
             of relata in common, including when the properties              cient for fine-grained comparisons. Although we did
             (e.g., associative relations) are explored in the ances-        not study whether pairs of related concepts in both
             tors to simulate the inheritance of properties along isa        ontologies were linked by similar relations, the in-
             hierarchies. The highest proportion of shared relata is         formation could be easily extracted from the triple
             for the superclasses traversed recursively (75% of the          store. We also would like to test the structural consis-
             concepts share at least one superclass). Slightly more          tency of the combined ontologies (e.g., by testing the
             than half of the concepts studied share at least one            presence of cycles in isa relations in the RDF store
             associative relation (direct relation or inherited from         containing both SNOMED CT and NCIt). The advan-
             some ancestor).                                                 tage of using the UMLS perspective on concept equi-




                                                                        41
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)




             valence outweighs the potential bias it introduces                   6.  Ceusters W, Smith B, Goldberg L. A terminolog-
             with its “concept view”.                                                 ical and ontological analysis of the NCI Thesau-
                                                                                      rus. Methods Inf Med 2005;44(4):498-507
                                Acknowledgements
                                                                                  7. Ruttenberg A, Clark T, Bug W, Samwald M,
             This research was supported by the Intramural Re-                        Bodenreider O, Chen H, et al. Advancing transla-
             search Program of the National Institutes of Health                      tional research with the Semantic Web. BMC
             (NIH), National Library of Medicine (NLM). Our                           Bioinformatics 2007;8 Suppl 3:S2
             thanks go to Ramez Ghazzaoui who helped create the                   8. RDF: http://www.w3.org/RDF/
             triple store and Lee Peters who processed SNOMED
                                                                                  9. SNOMED CT: http://www.ihtsdo.org/
             CT.
                                                                                  10. de Coronado S, Haber MW, Sioutos N, Tuttle
                                     References                                       MS, Wright LW. NCI Thesaurus: using science-
                                                                                      based terminology to integrate cancer research
             1.     Rogers JE. Quality assurance of medical ontolo-                   results. Medinfo 2004;11(Pt 1):33-7
                    gies. Methods Inf Med 2006;45(3):267-74                       11. Phillips J, Chilukuri R, Fragoso G, Warzel D,
             2.     Cimino JJ. Auditing the Unified Medical Lan-                      Covitz PA. The caCORE Software Development
                    guage System with semantic methods. J Am Med                      Kit: streamlining construction of interoperable
                    Inform Assoc 1998;5(1):41-51                                      biomedical information services. BMC Med In-
             3.     Wang Y, Halper M, Min H, Perl Y, Chen Y,                          form Decis Mak 2006;6:2
                    Spackman KA. Structural methodologies for au-                 12. Golbeck J, Fragoso G, Hartel F, Hendler J, Ober-
                    diting SNOMED. J Biomed Inform                                    thaler J, Parsia B. The National Cancer Institute's
                    2007;40(5):561-81                                                 Thesaurus and Ontology. Web Semantics:
             4.     Ceusters W, Smith B, Kumar A, Dhaen C. On-                        Science, Services and Agents on the World Wide
                    tology-based error detection in SNOMED-CT.                        Web 2003;1(1):75-80
                    Medinfo 2004;11(Pt 1):482-6                                   13. Bodenreider O. The Unified Medical Language
             5.     Cornet R, Abu-Hanna A. Auditing description-                      System (UMLS): integrating biomedical termi-
                    logic-based medical terminological systems by                     nology. Nucleic Acids Res 2004;32(Database is-
                    detecting equivalent concept definitions. Int J                   sue):D267-70
                    Med Inform 2007




                Table 1. Distribution of the number of related concepts shared by pairs of equivalent concepts (N) for various kinds of
              relationships (top: direct relations, bottom: indirect relations, including recursive traversal and combination of sucblassOf
                                                                 and associative relations)
                                                                  Number of related concepts                                    med
                      Relationship         N                                                                     min    max
                                                      0       1         2          3       4       5      >5                     ian
                      Any             20,363      66.8%   21.1%      5.9%      2.9% 1.3%        0.7%    1.3%       0      47       0
                      Superclass      20,360      80.3%   18.4%      1.3%      0.0% 0.0%        0.0%    0.0%       0       4       0
             Dir.     Whole            1,004      96.2%    3.8%      0.0%      0.0% 0.0%        0.0%    0.0%       0       1       0
                      Subclass         3,699      48.9%   21.9%     15.2%      6.4% 2.8%        1.8%    2.9%       0      19       1
                      Part                76      57.9%   34.2%      7.9%      0.0% 0.0%        0.0%    0.0%       0       2       0
                      Superclass      20,360      25.0%   28.5%     18.7% 11.1% 5.5%            3.6%    7.7%       0      22       1
             Ind.     Whole            1,004      93.3%    6.1%      0.6%      0.0% 0.0%        0.0%    0.0%       0       2       0
                      Associative      6,548      46.3%   18.6%     11.3% 10.6% 6.8%            2.4%    4.1%       0      11       1




                                                                             42
                                                                                                                                                                     disease has
                                                                                                                                Gastrointestinal System               associated           Disease or Disorder                    Disease                         finding site       Structure of digestive system
                                                                                                                                       C12378                        anatomic site               C2991                           64572001                                                      86762007
                                                                                                                                                                                                                                                                                                                     R. Cornet, K.A. Spackman (Eds)




                                                                                                                            f                                                                                                                                                                                    f
                                                                                                                                               Pancreas                                 Gastrointestinal Disorder       Disorder of digestive system              finding site       Pancreatic structure
                                                                                                                                     e         C12393                                            C2990                           53619000                                                15776009           e
                                                                                                                                                              disease has associated
                                                                                                                                                                   anatomic site



                                                                                                                                         Endocrine Disorder                                Pancreatic Disorder              Disorder of pancreas                                  Disorder of endocrine system
                                                                                                                                                                                                                                                                                                                     Representing and sharing knowledge using SNOMED




                                                                                                                                              C3009                                             C26842                           3855007                                                   362969004




                                                                                                                                   Glucose Metabolism Disorder                         Endocrine Pancreas Disorder     Disorder of endocrine pancreas                            Disorder of glucose metabolism
                                                                                                                                            C53655                                               C27067                          17346000                                                  126877002


                                                                                                                                                   disease has primary




43
                                                                                                                                                      anatomic site
                                                                                                                                                                                            Diabetes Mellitus                Diabetes mellitus
                                                                                                                                         Endocrine Pancreas                                      C2985                          73211009




                                                     associative relationships
                                                                                                                                c            C32509
                                                                                                                                                                               Autoimmune                                                   Cell-mediated
                                                                                                                                                   disease has normal
                                                                                                                                                       tissue origin
                                                                                                                                                                                 Disease                                                  cytotoxic disorder

                                                                                                                                            Islet of Langerhans                                                                                            Allergic disorder
                                                                                                                                  d               C12608                                                                                                  of digestive system



                                                                                                                            d        d       Equivalent concepts                                                                                        finding site    Endocrine pancreatic structure      c
                                                                                                                                             (UMLS)                                                                                                                              78696007
                                                                                                                                                                                                                                                                                                                     Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)




                                                                                                                                                                                 Insulin Dependent Diabetes Mellitus      Diabetes mellitus type 1                                                          d
                                                                                                                                             Isa relationship                                  C2986                            46635009
                                                                                                                                                                                                                                                                        Immune hypersensitivity
                                                                                                                                             Assocative relationship
                                                                                                                                                                                                                                                                              reaction
                                                                                                                                                                                                                                                             due to

                                                                                                                           NCI Thesaurus                                                                                                                                                          SNOMED CT




     Figure 4. Representation of Type 1 diabetes mellitus in SNOMED CT and NCIt, showing shared relata for ancestors and