Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) Comparing the Effects of Two Semantic Terminology Models on Classification of Clinical Notes: A Study of Heart Murmur Findings Guoqian Jiang, Ph.D. and Christopher G. Chute, M.D., Dr. P.H. Division of Biomedical Informatics, Mayo Clinic College of Medicine, Rochester, MN (mailto:Jiang.Guoqian@mayo.edu) Abstract For example, the latest version of the International Objectives: We compared the effects of two semantic Classification of Nursing Practice (ICNP) uses a terminology models on classification of clinical notes 7-Axis model to support the representation of nursing through a study in the domain of heart murmur findings. concepts and integrates the domain concepts of nursing Methods: One schema was established from the in a manner suitable for computer processing [5]. existing SNOMED CT model (S-Model) and the other One of the main goals of the semantic terminology was from a template model (T-Model) which uses base models is to support capturing structured clinical concepts and non-hierarchical relationships to information that is crucial for computer programs such characterize the murmurs. A corpus of clinical notes as information retrieval systems and decision support (n=309) was collected and annotated using the two tools [6]. Structured recording has the potential to schemas. The annotations were coded for a decision improve information retrieval from a patient database tree classifier for text classification task. The standard in response to clinically relevant questions [1]. information retrieval measures of precision, recall, However, functional difference in retrieval f-score and accuracy and the paired t-test were used for performance has not been clearly demonstrated evaluation. Results: The performance of S-Model was between these two different semantic terminology better than the original T-Model (p<0.05 for recall and models. f-score). A revised T-Model by extending its structure In this study, we focus upon the specific domain of and corresponding values performed better than heart murmur findings. Two schemas were established S-Model (p<0.05 for recall and accuracy). Conclusion: from two different semantic terminology models for We discovered that content coverage is a more evaluation: one schema is extracted from the existing important factor than terminology model for SNOMED CT model (S-Model) and the other is a classification; however a templatestyle facilitates template model (T-Model) extracted from a content gap discovery and completion. concept-dependent attributes model recently published by Green, et al [7]. The objectives of the study are to Introduction annotate the real clinical notes using the two schemas While modern terminologies have advanced well and to compare and evaluate the effects of two models beyond simple one-dimensional subsumption on classification of the clinical notes. relationships through the introduction of composite expressions, there is an emerging convergence of Methods and Materials approaches toward the use of a concept-based clinical Defining the annotation schemas terminology with an underlying formal semantic We defined two schemas for both S-Model and terminology model (STM) [1]. SNOMED CT, the most T-Model and represented the two schemas in Protégé comprehensive clinically oriented medical terminology (version 3.2 beta), which is an ontology editing system, currently adopts a foundation based on a environment and was developed by Stanford Medical description logic (DL) model and the underlying Informatics [8]. DL-based structure to formally represent the meanings For the S-Model, we established a schema by of concepts and the interrelationships between concepts extracting concept trees from the existing [2-3]. The existing SNOMED CT model is mainly sub-hierarchy of heart murmur findings in January pre-coordination oriented, i.e. containing many 2006 version of SNOMED CT (see Fig. 1). One root pre-coordinated terms, and also supports concept is “Heart murmur (SCTID_88610006)” which post-coordination. For example, a compositional includes 86 sub-concepts of pre-coordinated terms of expression “[ hypophysectomy (52699005) ] + heart murmur findings. The other root concept is [ transfrontal approach (65519007) ]” could be used to “Anatomical concepts (SCTID_257728006)” which describe a more specific clinical statement than that includes two parts relevant to our schema. One part is only using the term “hypophysectomy (52699005)”. the concept “Cardiac internal structure For a specific domain, a template model having a (SCTID_277712000)” and its sup-concepts. The other semantic structure with a coherent class of terms can be part contains only those anatomical concepts appearing used as a formal representation [4]. This kind of model in our clinical notes corpus on the basis of a manual is mainly post-coordination oriented and a list of review. For all heart murmur concepts, two semantic atomic terms is organized within a semantic structure. attributes derive from SNOMED CT context model for 59 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) heart murmur findings that frame post-coordination. instances of “anatomical concepts One is “procedure site” that represents the auscultation (SCTID_257728006)” and the values of the latter one site of a heart murmur and the other is “finding site” were set as the instances of “Cardiac internal structure that represents the potential etiological site of a heart (SCTID_277712000)”. murmur. The values of the former one were set as the Fig. 1 Schema of SNOMED CT Model (S-Model) for heart murmur findings represented in Protégé Fig. 2 Schema of Template Model (T-Model) for heart murmur findings represented in Protégé 60 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) For the T-Model, a schema was established from a Annotation software and Annotators concept-dependent attributes model published in a A general purpose text annotation tool, Knowtator [9], recent paper of Green, et al [7]. In this schema (see Fig. was used to map text contents to our schema. 2), one root concept is “heart murmur” which had eight Knowtator is a Java plug-in for Protégé and mainly semantic attributes, consisting of “has cardiac cycle used for creating gold-standard training and evaluation timing”, “has murmur configuration”, “has murmur corpora for natural language processing (NLP) systems. duration”, “has murmur intensity”, “has murmur pitch”, The annotation schemas described in section above “has murmur quality”, “has point of maximum were instantiated in Knowtator. intensity”, “radiates towards”. The corresponding One author (GJ) performed the annotation task and values of these eight attributes were set as the then the other author (CGC) verified the annotations sub-concepts of the other root concept “cardiac for 10% of all documents. Differences were mutually murmur characteristic values”. We adopted the model adjudicated and lessons generalized to the remaining attributes are directly from Green’s model, as well as 90% of cases. their values (kindly provided by Green, interpersonal Coding for machine learning classification communication). We coded the annotated corpora for classification using Preparing clinical notes corpus a machine learning classification algorithm. The target The Mayo Clinic has a repository of approximately category of the classification is binary, i.e. aortic twenty million clinical notes that consist of documents stenosis (AS) or non-AS. In other words, the goal of dictated by physicians that are subsequently transcribed the classification is to predict whether a document with and filed as part of the patient’s electronic medical a heart murmur description belongs to AS category or record. The following criteria were made to sample not. The annotations of each document were used as those notes. Firstly, we extracted notes with these the predictive features and coded as binary. criteria from Mayo repository in an automatic way: 1) We used a Weka implementation of the decision tree created between January 1, 2005 to January 31, 2005; (J4.8) [10], which is a well-known supervised approach 2) Having a heart murmur description in Physical to classification. Examination section; 3) age • 21; 4) Having a Hospital Outcome measures and statistical analysis International Classification of Disease Adaptation For the annotation task, we compared the description (HICDA) code of the Heart Valvular Disease, and 5) completeness between the two models. The annotators removing patients with a code for status prosthetic were asked to judge whether the heart murmur valve or complication of a prosthetic valve. Secondly, descriptions of each document could be described we flagged extracted documents containing a diagnosis completely through using the schema of a model while of aortic stenosis (AS), yielding 103 documents. they performed annotation task. If they judged a Thirdly, we randomly selected controls among the document as “incomplete”, they indicated a reason for extracted documents having no diagnosis of AS by the judgment. matching the following conditions: 1) no history of To evaluate the data retrieval task, we used the standard vavular surgeries; 2) matching gender and age within 1 evaluation metrics of precision, recall, f-score and year for each case (see Table 1). Two controls were accuracy. Precision is defined as the ratio of correctly retained for each case, totaling to 309 documents. assigned AS category (true positive) to the total hit Finally, we parsed out cardiac exam from the Physical number (true positives and false positives). Recall is Examination section of each document to create an the ratio of correctly assigned AS category (true annotation corpus. positive) to the number of target category in the test set (true positives and false negatives). The f-score Table 1. Control documents selection by matching with represents the harmonic mean of precision and recall. gender and age Accuracy is the ratio of correctly assigned categories (true positives and true negatives) to total number of Age Male Control Female Control Total instances in test dataset. 21-30 1 2 0 0 3 For S-Model, one dataset (SM) that contains the 31-40 0 0 0 0 0 annotations of both heart murmurs and anatomical 41-50 0 0 2 4 6 concepts was prepared. For T-Model, three datasets 51-60 4 8 0 0 12 were prepared. The first one (TM1) is that contains the annotations from Green’s original model. The other 61-70 7 14 5 10 36 two datasets are extension of TM1. We extended TM1 71-80 26 52 7 14 99 to create TM2 by completing the values for all eight 81-90 24 48 21 42 135 semantic attributes whenever a description appearing in 91- 2 4 4 8 18 the clinical notes corpus did not have a corresponding Total 64 128 39 78 309 value in TM1. For example, we added “upper sternal border”, “mid sternal border” and “lower sternal 61 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) border” into the schema because they appeared For comparison, the average number of annotations per frequently in our corpus to describe the auscultation document in S-Model was less than those in T-Model, areas and the original model only contains “sternal indicating that S-Model supports more abstract way for border”. description of heart murmur findings than T-Model. Building on TM2, we created our third model (TM3) Considering description completeness, 88 documents by adding a new semantic attribute “has inferences to (28%) in S-Model were judged as “incomplete”; in the (specific murmurs or etiological mentions)” to the root original T-Model, 201 documents (65%) were judged concept “heart murmur” and also completing its as “incomplete”. Thus, S-Model exhibits more corresponding values from those descriptions complete domain coverage than the original T-Model. appearing in the corpus. We re-annotated all documents The reasons for the incompleteness of four datasets using the extended models respectively. from two models were listed in Table 2. We found that Ten-fold cross validation for retrieval was performed S-Model (SM) could describe most of “auscultation 10 separate times over all four datasets and the paired area” and the original T-Model (TM1) could not. For t-test was performed to test the statistical significance “radiation”, both SM and TM1 could not describe it of performance measures between the dataset of well (we noticed that for SM, it is due to lacking of S-Model and three datasets of T-Model. semantic attribute for “Radiation”, whereas that in TM1 is due to lacking of appropriate values for Results “Radiation” attribute). In addition, SM could describe For annotations all “ejection murmur” mentions and part of “aortic In S-Model, we made 995 annotations across all 309 valve related” etiological mentions; TM1 could not. documents. The average number of annotations per The results indicated that the strict template model, per document is 3.2. Among the annotations, 728 belonged Green, assumes that observers are using strict to 33 different sub-concepts of heart murmur descriptions, and not making inferences to specific (88610006). Of the heart murmur annotations, 509 murmurs and etiological mentions, whereas SNOMED (70.0%) had the values of the attribute “procedure site” CT model accommodates partly the variability in filled and 6 (0.8%) had the values of the attribute inferences and strict descriptions, by providing terms “finding site” filled. that covers both. In T-Model, we made 1377 annotations against the original T-Model (TM1). The average number of Table 2 Frequency of reasons for the incompleteness of annotations per documents is 4.5. Among 335 discrete four datasets from two models heart murmur annotations, 89.9% include timing,  SM TM1 TM2 TM3 79.7% include intensity and 69.0% include points of Auscultation area 1 78 0 0 maximum intensity (POMI). (see Fig.3) 47 0 Radiation 47 0 Configuration 8 8 0 0 Fig. 3 The annotation distribution of the eight attributes Quality 7 5 0 0 for all 335 heart murmurs annotated in original T-Model. Specific murmurs Ejection murmur 0 107 107 0 Regurgitant murmur 3 3 3 0 100.0% 89.9% 2 2 Flow murmur 2 0 90.0% 79.7% 80.0% Etiological mentions 69.0% 70.0% Aortic valve related 19 25 25 0 60.0% Mitral valve related 4 4 4 0 50.0% 1 1 Pulmonary valve related 1 0 40.0% Septal defect 1 1 1 0 30.0% 15.8%14.9% 20.0% 11.3% 10.0% 4.2% 1.5% For TM2 and TM3, zero values in Table 2 indicated our 0.0% synthetic completion of the values of each corresponding attribute in T-Model. The description y Ra ty i g n n h n m sit in io tc io io li Po completeness of TM2 was corresponding up to 57.6%, Pi m at at at ua n te ur di ur Ti Q In ig D and that of TM3 up to 100%. Table 3 provided the nf Co examples (a AS case vs. a Non-AS case) to show how annotations were taken for all four schemas from two models. 62 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) For classification lesser than SM. The result indicates that the original As described in above section, four datasets (SM, TM1, T-Model using strict physical descriptions may not TM2 and TM3) from two models were formed for fully represent descriptions of heart murmur findings in evaluation. The results of the evaluation metrics of the clinical notes, negatively impacting functional four datasets were shown in Table 4. We found that the performance. classification performance of SM was better than TM1 The classification performance of TM3 was the (i.e. original Green’s model), with statistical significantly best among the datasets (p<0.05, paired significance identified for recall and f-score (p<0.05, t-test vs. SM). The result provided further evidence that paired t-test). We consider that the reason was probably inferences to specific murmurs and etiological that the TM1 did not contain a complete list of murmur mentions were important part of descriptions of heart characteristic values for many of its semantic murmur findings in real clinical notes, influencing the attributes. functional performance of the terminology model in The performance of TM2 was better than TM1, but still this specific domain. Table 3 The examples (AS Case vs. Non-AS Case) of annotations using four schemas AS Case Non-AS Case Textual Note Heart: Loud 3 to 4/6 systolic ejection murmur heard best at Heart: Regular rate and rhythmwith a 2/6 left upper sternal the right upper sternal border. Absent of S2. border systolic regurgitant murmur. P2 was slightly increased. There was an S4 but no S3. The apical impulse was not localizable. SM 15157000:Cardiac murmur - intensity grade III (VI) 36680007:Cardiac murmur - intensity grade II (VI) Annotation procedure site: [117144008:upper parasternal region] procedure site: upper parasternal region laterality: [24028007:right] laterality: [7771000:left] 25311008:Cardiac murmur - intensity grade IV (VI) 31574009: Systolic murmur procedure site: [117144008:upper parasternal region] procedure site: [117144008:upper parasternal region] laterality: [24028007:right] laterality: [7771000:left] 77197001: Ejection murmur procedure site: [117144008:upper parasternal region] laterality: [24028007:right] TM1 Heart murmur: Heart murmur: Annotation has cardiac cycle timing value: systolic timing has cardiac cycle timing value: systolic timing has murmur intensity value: intensity grade III/VI has murmur intensity value: intensity grade II/VI has murmur intensity value: intensity grade IV/VI has point of maximum intensity: sternal border (laterality: left) has point of maximum intensity: sternal border (laterality: right) TM2 Heart murmur: Heart murmur: Annotation has cardiac cycle timing value: systolic timing has cardiac cycle timing value: systolic timing has murmur intensity value: intensity grade III/VI has murmur intensity value: intensity grade II/VI has murmur intensity value: intensity grade IV/VI has point of maximum intensity: upper sternal border has point of maximum intensity: upper sternal border (laterality: left) (laterality: right) has murmur quality value: loud TM3 Heart murmur: Heart murmur: Annotation has cardiac cycle timing value: systolic timing has cardiac cycle timing value: systolic timing has murmur intensity value: intensity grade III/VI has murmur intensity value: intensity grade II/VI has murmur intensity value: intensity grade IV/VI has point of maximum intensity: upper sternal border has point of maximum intensity: upper sternal border (laterality: left) (laterality: right) has inferences to: regurgitant murmur has murmur quality value: loud has inferences to: ejection murmur 63 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) Table 4 The results of the evaluation metrics of the four datasets Precision Recall F-score Accuracy  (mean±sd) (mean±sd) (mean±sd) (mean±sd) SM 74.2% ±13.7% 59.4% ±15.6% 64.5% ±12.7% 79.0% ±6.1% TM1 67.5% ±14.9% *44.6% ±13.8% *52.1% ±11.5% 73.6% ±5.4% TM2 71.0% ±14.0% 53.2% ±18.9% 59.0% ±15.3% 76.9% ±6.8% TM3 80.0% ±12.2% *69.8% ±14.6% 73.5% ±10.4% *83.6% ±5.8% *p< 0.05 (paired t-test) terminology model depends not only on the full value Discussions set of its semantic structure, but also on the coverage of In this study, we developed an approach to compare the semantic structure itself. and evaluate the domain coverage (indicated by the Our second extension (TM3) of the T-Model adds a description completeness) of two semantic terminology semantic attribute together with its corresponding models and their effects on the classification of real values. This did overcome the limitation of semantic clinical notes. We found that the description structure of the original T-Model and achieves a completeness of the S-Model was better than the complete description for given corpus. In other words, original T-Model with original value set, the extended structure allows a systematic examination correspondingly the performance of the S-Model on of where content gaps exist (e.g. missing values of classification was also better. The extensions of references to specific murmurs and etiological T-Model that improved the description completeness, mentions) and also guides the “completion” of the did improve its performance on classification of terms or missing contents informed by the extended clinical notes. We clearly demonstrated that the domain structure. coverage of a terminology model was directly In S-Model, most of its contents are pre-coordinated, correlated with its performance on classification of with the post-coordination only possible for two clinical notes; this is not surprising. semantic attributes “procedure site” and “finding site”. We could see that the effect of a terminology model on We did not extend the SNOMED CT model in a similar its functional performance in a specific domain mainly fashion since the model is an international standard depends on its ability to represent the contents of the although we believe that performance would be domain. In other words, the key issue for a terminology improved were it also extended. However, the model is how to achieve complete domain coverage. If extension of the model would be more complicated two different terminology models could represent the than that of template model because it involves both contents of a domain to achieve the same coverage, pre-coordination and post-coordination. We consider their performances on classification of clinical notes that the template model would be more applicable for should have no difference. achieving complete domain coverage. An important In original T-Model, the description of a hear murmur implication of these experiments is that a templatestyle could be fully post-coordinated by a semantic structure terminology model more readily identifies gaps in of eight semantic attributes. With original value set, we coverage, and facilitates their completion for found that its description completeness was classification tasks. sub-optimal. In the paper from which the model was Knowtator was used as our annotation tool and derived [7], the authors stated that “to adequately satisfied our purpose well, demonstrating the following capture the full spectrum of cardiac murmur merits. The first merit is that Knowtator uses the descriptions, our model needed a complete list of Protégé ontology editing environment to build the murmur characteristics”. So our first extension (TM2) annotation schema. The frame-based knowledge completes the term values for all eight attributes of the representation system provides a flexible and original T-Model. The description of completeness was expressive way to efficiently make schemas of the two increased from 35.0% to 57.6%. model types in this study. The second merit is that Thus, adding axes content to each attribute within the Knowtator provides visualization of annotations, semantic structure did improve the domain coverage of making the annotation task and confirmation process the model; however, even with value completion, the simple and efficient. The third merit is that the Java original T-Model still could not achieve complete API of the system, which supports the annotation query description for given corpus. that exports our coding of annotations to a classifier Therefore, we consider that the domain coverage of a format automatically. 64 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) In order to improve the baseline performances on all The authors would like to thank Philip V. Ogren, standard evaluation measures, we performed control Serguei V.S. Pakhomov, Guergana K. Savova, Pauline selection of clinical notes using strict criteria. This Funk and James D. Buntrock for their support. design did improve baseline performances (data not shown). References We regard the evaluation in this study in its [1] Brown PJ, Sonksen P. Evaluation of the quality of comparative context across models; absolute measures information retrieval of clinical findings from a of precision and recall are subject to factors beyond the computerized patient database using a semantic scope of this study. A limitation of this study is that the terminological model. J Am Med Inform Assoc. 2000 annotations of clinical notes depends entirely on what Jul-Aug;7(4):392-403. clinicians decide to document for each patient, who [2] Spackman KA, Campbell KE. Compositional they may or may not know has AS at the time. The concept representation using SNOMED: towards local culture around documentation seems possible that further convergence of clinical terminologies. Proc these findings could be different on another corpus. AMIA Symp. 1998;:740-4. Second, we only collected a relatively small size of [3] Yu AC. Methods in biomedical ontology.J Biomed clinical notes corpus given that the intensive annotation Inform. 2006 Jun;39(3):252-66. tasks were required. We consider that the annotation [4] Zhou L, Tao Y, Cimino JJ, Chen ES, Liu H, Lussier corpus is valid as both authors have clinical medicine YA, Hripcsak G, Friedman C. Terminology model background. Ten-fold cross validation used in this discovery using natural language processing and study may facilitate the efficient use of the data and get visualization techniques. J Biomed Inform. 2006 the best liability estimate. This kind of annotation Dec;39(6):626-36. corpus may be used to train a machine learning based [5] URL: http://icn.ch/icnp.htm; last visited at annotation algorithm to build an automatic domain December 29, 2006. specific annotation tool. In addition, because it was [6] Rosenbloom ST, Miller RA, Johnson KB, Elkin PL, not our intention to evaluate which classifier performed Brown SH. Interface terminologies: facilitating direct better, we only used a Weka implementation of the entry of clinical data into electronic health record decision tree (J4.8) algorithm. systems. J Am Med Inform Assoc. 2006 In conclusion, the domain coverage of the two models May-Jun;13(3):277-88. and their performance on classification clearly differ [7] Green JM, Wilcke JR, Abbott J, Rees LP. when applied to real clinical notes. Our approach Development and evaluation of methods for structured provides an effective framework to evaluate the recording of heart murmur findings using coverage and functional performance of the semantic SNOMED-CT post-coordination. J Am Med Inform terminology models in a specific domain for potential Assoc. 2006 May-Jun;13(3):321-33. Epub 2006 Feb improvement. Future direction would focus on the 24. scalability of the approach and the evaluation of [8] URL: http://protege.stanford.edu/index.html; last interoperability among the different semantic visited at December 29, 2006. terminology models. [9] URL: http://bionlp.sourceforge.net/Knowtator/; last visited at December 29, 2006. Acknowledgements [10] URL: http://www.cs.waikato.ac.nz/ml/weka/; last This study is partly supported by NIH R01 LM07319. visited at December 29, 2006. 65