=Paper=
{{Paper
|id=Vol-1747/IT601_ICBO2016
|storemode=property
|title=Identifying Missing Hierarchical Relations in SNOMED CT from Logical Definitions Based on the Lexical Features of Concept Names
|pdfUrl=https://ceur-ws.org/Vol-1747/IT601_ICBO2016.pdf
|volume=Vol-1747
|authors=Olivier Bodenreider
|dblpUrl=https://dblp.org/rec/conf/icbo/Bodenreider16
}}
==Identifying Missing Hierarchical Relations in SNOMED CT from Logical Definitions Based on the Lexical Features of Concept Names ==
Identifying Missing Hierarchical Relations in SNOMED CT from Logical Definitions Based on the Lexical Features of Concept Names Olivier Bodenreider U.S. National Library of Medicine National Institutes of Health Bethesda, Maryland, USA olivier.bodenreider@nih.gov Abstract—Objectives. To identify missing hierarchical disease. Although logical definitions generally rely on relations in SNOMED CT from logical definitions based on the knowledge associated with concepts, we exploit the fact that lexical features of concept names. Methods. We first create such definitions can also be created from lexical features. logical definitions from the lexical features of concept names, which we represent in OWL EL. We infer hierarchical The objective of this investigation is to identify missing (subClassOf) relations among these concepts using the ELK hierarchical relations in SNOMED CT from logical definitions reasoner. Finally, we compare the hierarchy obtained from based on the lexical features of concept names. More lexical features to the original SNOMED CT hierarchy. We specifically, we propose to leverage description logics for review the differences manually for evaluation purposes. Results. representing the lexical features of concept names and infer Applied to 15,833 disorder and procedure concepts, our hierarchical relations based on these lexical features with a approach identified 559 potentially missing hierarchical reasoner. The hierarchical relations inferred from lexical relations, of which 78% were deemed valid. Conclusions. This features but not present in SNOMED CT are candidates for lexical approach to quality assurance is easy to implement, missing relations. efficient and scalable. Keywords—description logics; SNOMED CT; quality II. BACKGROUND assurance; lexical features. A. SNOMED CT I. INTRODUCTION Developed by the International Health Terminology Standard Development Organization (IHTSDO), SNOMED Quality assurance of large biomedical terminologies CT is the world’s largest clinical terminology. With 320,000 remains an active area of research [1]. For example, recent active concepts, it provides broad coverage of clinical investigations of SNOMED CT have highlighted issues in its medicine, including findings, diseases, and procedures for use hierarchical structure and demonstrated their detrimental in electronic medical records [9]. consequences (e.g., [2]). SNOMED CT provides a preferred name and synonyms for Both lexical features and logical definitions have been used each concept (“descriptions” in SNOMED CT parlance). The for quality assurance purposes. Approaches based on lexical “fully specified name” is guaranteed to be unique for each features generally exploit the presence of specific words in concept and consists of the preferred term followed by a SNOMED CT terms or contrast sets of words for terms across semantic tag (e.g., Blepharorrhaphy (procedure) (388008)). In concepts to suggest relations among concepts (e.g., [3-6]). For addition to names, all concept have a logical definition, based example, the concepts Asthma and Acute asthma can be on definitional characteristics of the concept (not on the lexical represented by the sets of words {asthma} and {acute, features of the concept names). For example, asthma}, respectively. Since {asthma} is a proper subset of {acute, asthma}, the principles of lexical semantics suggest Class: Blepharorrhaphy that Acute asthma is more specific than Asthma [7]. EquivalentTo: Approaches based on logical definitions often rely on a Suture of eyelid description logics reasoner for analyzing the facts in the and (Method some Closure - action) ontology (e.g., [8]). The logical definitions found in SNOMED and (Procedure site - Direct some Structure of CT are sets of axioms (facts), i.e., logical statements relating palpebral fissure) concepts through “roles” (relationships), representing and (Using device some Surgical suture, device) biomedical knowledge. For example, the axiom “Acute asthma, Clinical Course, Sudden onset AND/OR short In SNOMED CT, the logical definitions are processed with duration” is part of the logical definition of Acute asthma and a description logic reasoner for consistency validation and to provides a formal representation of the acute aspect of the generate the hierarchical structure by inferring subClassOf to doing so. While ad hoc programming is usually necessary relations among the concepts. for comparing bags of words, our work demonstrates it can also be supported effectively by a DL reasoner. To our The version of SNOMED CT used in this work is the U.S. knowledge, this is the first attempt to generate logical edition dated March 2016. definitions based on the lexical features of concept names in SNOMED CT for quality assurance purposes. B. Description logics Description logics (DL) are a family of knowledge III. METHODS representation languages often used as ontology languages, and defined as a trade-off between expressivity and tractability Our method for identifying missing hierarchical relations [10]. Reasoners are computer programs that can check the from SNOMED CT can be summarized as follows. We first consistency of the facts asserted in the ontology and infer create logical definitions from the lexical features of concept relations among ontology classes based on these facts (i.e., names, which we represent in the web ontology language, infer hierarchical (subClassOf) relations). OWL. We infer hierarchical (subClassOf) relations among these concepts using a reasoner. Finally, we compare the Among the various flavors of DL languages available, the hierarchy obtained from lexical features to the original EL family offers sufficient expressivity for the simple SNOMED CT hierarchy. We review the differences manually definitions resulting from lexical features, as well as scalability for evaluation purposes. In this preliminary investigation, we to a large number of classes [11]. The reasoners developed for applied this approach to a significant subset of the Clinical EL (e.g., ELK [12]) offer impressive performance. Finding hierarchy rooted with the concept Disorder of head As illustrated above, SNOMED CT relies on DL for (disorder) (118934005) and a smaller subset of the Procedure representing the logical definitions it provides for its concepts. hierarchy rooted with the concept Operative procedure on It also makes use of a reasoner for testing the consistency of head (procedure) (89901005). these definitions across the whole ontology, as well as for inferring the hierarchy of concepts. In this work, we apply the A. Creating logical definitions based on the lexical features reasoner not to the logical definitions provided by SNOMED of concept names CT to represent biomedical knowledge, but rather to the For each concept under investigation, we extract the fully definitions we generate from the lexical features of the terms of specified name, which consists of the preferred term (e.g., SNOMED CT concepts. “Disorder of head”) followed by a semantic tag in parentheses (e.g. “disorder”). For each concept C with fully specified name C. Quality assurance of biomedical ontologies “w1 w2 … wn (T)”, where {w1, w2, … wn} is the set of words in Approaches to quality assurance in biomedical ontologies the preferred term and where T is the semantic tag, we create a can be classified into lexical, structural and semantic logical definition of the following form (expressed in the approaches [13]. Lexical approaches rely on the lexical simplified OWL syntax known as Manchester syntax [16]): features of terms; structural approaches analyze the hierarchical structure of ontologies; and semantic approaches Class: C exploit the relations among concepts (including logical EquivalentTo: definitions). Examples of lexical and semantic approaches T applied to quality assurance in SNOMED CT were presented and (has_word some w1) earlier in the introduction. (Structural approaches are less and (has_word some w2) relevant to this work and will not be discussed here.) … Of note, while DL techniques are generally used in the and (has_word some wn) context of semantic approaches, in this work, we leverage a DL reasoner for the implementation of a lexical approach to QA, For example, the class definition for the concept Complete since our logical definitions are created on the basis of lexical ablepharon (disorder) (708541009) is shown in Fig. 1. features. The compositionality of terms in biomedical ontologies is well documented and has been exploited for quality assurance purposes (e.g., [14, 15]). However, Mungall used ad hoc programming (in Prolog) rather than a DL reasoner to infer relations among terms. Our approach is also much simpler in that it only relies on sets of words and only attempts to elicit hierarchical relations. D. Specific contribution Fig. 1. Class definition for the concept Complete ablepharon (disorder) The specific contribution of this work is not in leveraging the compositionality of biomedical terms for suggesting In practice, we use a simple script to create an OWL file relations, but rather in proposing a description logics approach that contains the class definitions for all the concepts under investigation. The words “the” and “of”, present in a large proportion of terms, are omitted when generating the class no hierarchical structure) under the classes created for the definitions. semantic tags (Fig. 2). After ELK has run, inferred subClassOf axioms among the SNOMED CT concepts have been added to Of note, the OWL constructs used in these definitions the ontology and the concepts are no longer displayed as a flat (namely class equivalence and existential quantification to a list (Fig. 3). For example, the three concepts Ablepharon class expression) are compatible with the OWL 2 EL profile (disorder) (13401001), Complete ablepharon (disorder) [11]. (708541009), and Partial ablepharon (disorder) (45484000) are listed under disorder in the asserted hierarchy (Fig. 2), but Complete ablepharon (disorder) and Partial ablepharon (disorder) are subclasses of Ablepharon (disorder) in the inferred hierarchy (Fig. 3). Since the subClassOf relations are inferred from lexical features, we need to filter out complex terms with prepositional phrases to avoid generating wrong subClassOf relations. For example, for Dementia due to Parkinson's disease (disorder) (101421000119107), a subClassOf relation is inferred to both Dementia (disorder) (52448006) and Parkinson's disease (disorder) (49049000). Similarly, for Goniopuncture without goniotomy (procedure) (202727004), a subClassOf relation is inferred to both Goniopuncture (procedure) (265293008) and Goniotomy (procedure) (265292003). While this behavior is expected from the reasoner, it is not desirable, because Dementia due to Parkinson's disease (disorder) is not a kind of Parkinson's disease (disorder) as suggested by the prepositional expression “due to”. Similarly, Goniopuncture without goniotomy (procedure) specifically excludes Goniotomy (procedure). In practice, to avoid generating such Fig. 2. Asserted hierarchy – Ablepharon (disorder) prior to running the wrong subClassOf relations, we filter out the relations reasoner (no inferred subclasses) generated when the name of the most specific (“child”) concept contains any of the following words: “and”, “or”, “and/or”, “with”, “without”, “from”, “due to”, “secondary to”, “except”, “by”, “after”, “revision” and “ligation for”. C. Comparing the hierarchy inferred from lexical features to the original hierarchy To analyze which relations from the inferred hierarchy are not already in the original SNOMED CT hierarchy (i.e., the hierarchy found in the SNOMED CT distribution), we need to generate these two sets of hierarchical relations and compute the difference between them. Using Protégé, we export the inferred subClassOf axioms to a file in RDF format for comparison to the original hierarchical relations in SNOMED CT. Using a simple script, we write the original hierarchical relations in SNOMED CT to RDF for the subhierarchies under investigation. In practice, because the inferred relations can be between any two classes, we enrich the original hierarchy with the transitive closure of subClassOf relations. We load the files for the two sets of relations, inferred and original, into the triple store Virtuoso and use a SPARQL query to compute the Fig. 3. Inferred hierarchy – Ablepharon (disorder) after the reasoner has run set of hierarchical relations from the inferred set that is not part (two inferred subclasses: Complete ablepharon (disorder) and Partial of the hierarchical relations originally in SNOMED CT ablepharon (disorder)) (transitively closed). The SPARQL 1.1 operator MINUS makes such comparison between two graphs extremely easy. B. Inferring subClassOf relations from lexical features D. Evaluation We load this OWL file in the Protégé ontology editor (5.0 beta), in which we have installed the plugin for the ELK We manually review for validity a random subset of 100 reasoner [12], specially optimized for classifying OWL 2 EL inferred relations that are not present in the original SNOMED ontologies. Prior to running the reasoner, the SNOMED CT CT hierarchy (transitively closed). concepts imported into Protégé appear as a flat list (i.e., with IV. RESULTS In addition to the evaluation, we performed a cursory review of the 559 potentially missing hierarchical relations, A. Creating logical definitions based on the lexical features among which we identified a few patterns. In 31 cases, the of concept names missing relation was between “carcinoma in situ of” and “carcinoma of ” (or “ carcinoma”), for of the subhierarchy rooted with the concept Disorder of head example, between Carcinoma in situ of palate (disorder) (disorder) (118934005) and for the 3795 concepts (1899 (92670007) and Palate carcinoma (disorder) (274084007). distinct words) of the subhierarchy rooted with the concept Another such patterns was found in 23 cases between Operative procedure on head (procedure) (89901005). “congenital ” and the unqualified disorder, for example, between Congenital anterior staphyloma (disorder) (253230008) and Anterior staphyloma (disorder) (231888000). B. Inferring subClassOf relations from lexical features Running the ELK reasoner took a few seconds and resulted B. Technical significance in the creation of 7079 inferred subClassOf relations among the concepts of the subhierarchy rooted with the concept The novel aspect of this work is to use a DL approach to Disorder of head (disorder). Similarly, 1357 relations were lexical similarity. In practice, it means that no ad hoc inferred in the subhierarchy rooted with the concept Operative programming is required for identifying partial ordering procedure on head (procedure). relations among sets of words for terms in an ontology reflecting hierarchical relations among the corresponding concepts. Instead, logical definitions created from lexical C. Comparing the hierarchy inferred from lexical features to features can simply be represented in DL formalism and run the original hierarchy through a reasoner to infer the relevant subClassOf relations. After subtracting from the inferred subClassOf relations As shown here, this approach is easy to implement, efficient created by the reasoner those subClassOf relations already and scalable. The only programming required is for serializing present in the original version of SNOMED CT (transitively the logical definitions in the appropriate DL format. closed), we obtained 1210 inferred subClassOf relations for the Disorder of head (disorder) hierarchy and 242 inferred Moreover, given that SNOMED CT already uses DL subClassOf relations for the Operative procedure on head techniques for representing its logical definitions based on (procedure) hierarchy. Of these, 469 subClassOf relations for biomedical knowledge and an EL reasoner for inferring its disorders and 90 for procedures met our criteria for review hierarchy, it can be expected that the IHTSDO could easily (i.e., the name of the child concept does not contain any of the integrate the lexical approach to quality assurance proposed prepositional and other expressions listed earlier). here. Finally, having two kinds of logical definitions (from D. Evaluation biomedical knowledge and from lexical features) represented The random subset of 100 inferred subClassOf relations we in the same formalism would make it possible to integrate them reviewed comprises 83 disorders and 17 procedures. Overall, into the same framework, for example to test the consistency 78 relations were deemed valid, 19 invalid and 3 questionable between the two kinds of definitions. (i.e., these relations seem to have face validity, but may not be compliant with SNOMED CT editorial policies). Examples of C. Limitations and future work such relations are listed in Table I. This preliminary investigation is limited to two subhierarchies of SNOMED CT for diseases and procedures. V. DISCUSSION However, we also generated definitions and inferred hierarchy for the whole SNOMED CT and did not notice any scalability A. Findings issues. We did not leverage SNOMED CT synonyms for As expected, a vast majority of the hierarchical relations creating logical definitions, but this should be a natural suggested lexically were already present in the original extension of this investigation. In future work, we also would SNOMED CT hierarchy (transitively closed). Specifically, like to normalize terms before creating the definitions, since only 1210 of the 7079 hierarchical relations for disorders normalization is common approach to managing term variation (17%) and 242 of the 1357 hierarchical relations for procedures [17]. (18%) were not already represented in SNOMED CT. This bag-of-word approach to comparing terms tends to However, it was somewhat surprising to us to see that a generate more false positives than a linguistically motivated large number of potentially missing hierarchical relations had approach, where the head of the noun phrase would be required been generated from this simple technique based on lexical to be the same in two hierarchically related concepts, as we did features. Assuming 80% of the 559 hierarchical relations in other work [18]. In fact, many of the errors detected during generated are correct, we discovered 447 missing hierarchical the evaluation correspond to cases where the specific term is relations among the 15,883 concepts under investigation. linked to a term that does not contain the head of the noun Interestingly, the proportion is roughly the same for disorders phrase of the specific term. However, the bag-of-word and procedures. approach is much easier to implement than linguistically motivated approaches, and we showed that false positives can [3] O. Bodenreider, et al., “Assessing the consistency of a biomedical be mitigated in part by filtering out complex terms. terminology through lexical knowledge,” Int J Med Inform, vol. 67, no. 1-3, 2002, pp. 85-95. In this preliminary investigation, we performed a limited [4] K.E. Campbell, et al., “A "lexically-suggested logical closure" metric for evaluation. Given the encouraging results, we plan to extend medical terminology maturity,” Proc AMIA Symp, 1998, pp. 785-789. the investigation to the entirety of SNOMED CT, evaluate the [5] E. Mikroyannidi, et al., “Analysing Syntactic Regularities and results more thoroughly, and share them with the SNOMED Irregularities in SNOMED-CT,” J Biomed Semantics, vol. 3, no. 1, 2012, pp. 8. CT developers at the IHTSDO. [6] E. Pacheco, et al., “Detecting Underspecification in SNOMED CT Finally, the lexical approach to quality assurance proposed concept definitions through natural language processing,” AMIA Annu here could also complement structural approaches, such as the Symp Proc, vol. 2009, 2009, pp. 492-496. lattice-based approach we proposed earlier [19]. [7] D.A. Cruse, Lexical semantics, Cambridge University Press, 1986, p. xiv, 310. [8] K. Dentler and R. Cornet, “Intra-axiom redundancies in SNOMED CT,” D. Generalization Artif Intell Med, vol. 65, no. 1, 2015, pp. 29-34. This approach to identifying missing hierarchical relations [9] IHTSDO, “SNOMED CT,” 2016. would be applicable not only to the entirety of SNOMED CT, [10] F. Baader, et al., “Description logics,” Handbook on ontologies, but to other biomedical ontologies as well. More specifically, it International handbooks on information systems, S. Staab and R. Studer, could be applied to any biomedical ontology for which concept eds., Springer, 2004, pp. 3-28. names and hierarchical relations are available (i.e., most [11] W3C, “OWL 2 Web Ontology Language Profiles (Second Edition),” ontologies). The same approach could also be applied to the 2012; https://www.w3.org/TR/owl2-profiles/#OWL_2_EL. creation of partial mappings. [12] Y. Kazakov, et al., “The Incredible ELK: From Polynomial Procedures to Efficient Reasoning with EL Ontologies,” Journal of Automated Reasoning, vol. 53, no. 1, 2013, pp. 1-61. ACKNOWLEDGMENT [13] X. Zhu, et al., “A review of auditing methods applied to the content of controlled biomedical terminologies,” J Biomed Inform, vol. 42, no. 3, This work was supported by the Intramural Research 2009, pp. 413-425. Program of the NIH, National Library of Medicine. This work [14] C.J. Mungall, “Obol: integrating language and meaning in bio- was conducted using the Protégé resource, which is supported ontologies,” Comp Funct Genomics, vol. 5, no. 6-7, 2004, pp. 509-520. by grant GM10331601 from the National Institute of General [15] P.V. Ogren, et al., “The compositional structure of Gene Ontology Medical Sciences of the United States National Institutes of terms,” Pac Symp Biocomput, 2004, pp. 214-225. Health. We would like to thank Dr. GQ Zhang for providing [16] W3C, “https://www.w3.org/TR/owl2-manchester-syntax/,” 2012. motivation and encouragement for this investigation. [17] A.T. McCray, et al., “Lexical methods for managing variation in biomedical terminologies,” Proc Annu Symp Comput Appl Med Care, 1994, pp. 235-239. REFERENCES [18] F. Dhombres and O. Bodenreider, “Interoperability between phenotypes [1] J. Geller, et al., “Special issue on auditing of terminologies,” J Biomed in research and healthcare terminologies--Investigating partial mappings Inform, vol. 42, no. 3, 2009, pp. 407-411. between HPO and SNOMED CT,” J Biomed Semantics, vol. 7, 2016, [2] A.L. Rector, et al., “Getting the foot out of the pelvis: modeling pp. 3. problems affecting use of SNOMED CT hierarchies in practical [19] G.Q. Zhang and O. Bodenreider, “Large-scale, Exhaustive Lattice-based applications,” J Am Med Inform Assoc, vol. 18, no. 4, 2011, pp. 432- Structural Auditing of SNOMED CT,” AMIA Annu Symp Proc, vol. 440. 2010, 2010, pp. 922-926. TABLE I. EXAMPLES OF SUBCLASSOF RELATIONS INFERRED FROM LEXICAL FEATURES Hierarchy Child ID Child name Parent ID Parent name Valid Procedure 239405007 Alveolar bone graft to mandible (procedure) 178493006 Alveolar bone graft (procedure) yes Disorder 402819001 Basal cell carcinoma of skin of lip (disorder) 269515006 Carcinoma of lip (disorder) yes Disorder 92670007 Carcinoma in situ of palate (disorder) 274084007 Palate carcinoma (disorder) yes Disorder 232225005 Chronic bacterial otitis externa (disorder) 53295002 Chronic otitis externa (disorder) yes Disorder 700278007 Congenital vascular anomaly of eyelid (disorder) 69973000 Vascular anomaly of eyelid (disorder) yes Procedure 31230008 Electrocoagulation of retina for repair of tear (procedure) 450698009 Repair of retina (procedure) yes Disorder 40571009 Hallucinogen intoxication delirium (disorder) 50320000 Hallucinogen intoxication (disorder) no Disorder 609209009 Infection of preauricular sinus (disorder) 204271000 Preauricular sinus (disorder) no Disorder 237664006 Pituitary stalk compression hyperprolactinemia (disorder) 237723009 Pituitary stalk compression (disorder) no Procedure 440303005 Suture of tongue to lip for micrognathia (procedure) 3889008 Suture of lip (procedure) no