Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) Exploratory Reverse Mapping of ICD-10-CA to SNOMED CT Dennis Lee, M.Sc., Francis Lau, Ph.D. School of Health Information Science, University of Victoria, Victoria, B.C., Canada dlkh@uvic.ca, fylau@uvic.ca ABSTRACT been created for specific domains including the SNOMED-to-ICD-O map for oncology, the This paper describes the findings of an exploratory SNOMED-to-LOINC map for laboratory test results, study on reverse mapping of ICD-10-CA, the and those for nursing terminologies. Otherwise there Canadian Adaptation, to SNOMED CT. For this is limited experience in cross mapping from study a set of 5,000 most frequent ICD-10-CA codes SNOMED CT to existing classification systems to from the health ministry of a Canadian province was facilitate secondary uses. used. The methods included applying six mapping algorithms to each ICD-10-CA description to find the In this paper, we describe the initial findings of an matching SNOMED CT concepts, and comparing the exploratory study to create a reverse map from ICD- output against the UK SCT-ICD10 cross map for 10-CA to SNOMED CT. It originated as part of a accuracy. Overall, we found successful SNOMED CT Master of Science project by the lead author. We matches for ~63% of the ICD-10-CA codes. Issues contend that reverse mapping could be one way to requiring further attention include ways to increase produce the SNOMED CT to ICD-10-CA cross map. successful matches and independent validation of This paper describes the mapping algorithms and mapping output. This study provides a glimpse of the process used, the key results on matches found, and methods that could lead to a SNOMED CT to ICD- the lessons and implications from the study. 10-CA cross map. It should be of interest to those responsible for secondary use of discharge abstracts METHODS in epidemiological and statistical reporting. INTRODUCTION Overview of ICD-10-CA The ICD-10-CA is an enhanced version of the ICD- The Systematized Nomenclature of Medicine Clinical 10 published by the World Health Organization Terms (SNOMED CT) is a terminology system used (WHO). The ICD-10-CA has 23 chapters and is used to capture information relating to a patient’s for classifying morbidity, diseases, injuries and condition and care in a consistent manner. Currently, causes of death in Canada. It also covers non-disease there are ~376000 concepts in SNOMED CT, situations and conditions that pose a risk to health organized into 19 hierarchies such as clinical finding, including occupational and environmental factors, observations, body structure and social context. lifestyle and psycho-social circumstances. The ICD- There are another ~1 million commonly used terms 10-CA has an alphanumeric coding format of 3-6 to describe these concepts, and ~1.4 million semantic characters. The major difference between ICD-10 relationships to define the logical connections and ICD-10-CA is that the latter has two additional between concepts [1]. chapters: XXII on morphology of neoplasms and XXIII on provisional codes for research and While SNOMED CT is the terminology of choice for temporary assignment. There are also minor changes capturing details of a clinical encounter, it is in some chapters in the form of addition, subdivision, considered too fine grained for non-clinical purposes deletion and revision of selected ICD codes [4]. such as the reporting of resource use and billing. Many have advocated the need to link SNOMED CT Source Mapping Terms to established classification systems, such as the For this study, we obtained a set of 5,000 most International Statistical Classification of Diseases frequently reported ICD-10-CA codes and their long and Related Health Problems Version 10 (ICD-10), descriptions for the fiscal year of 2005/06 from the that are already used extensively in statistical health ministry of a Canadian province. These source reporting [2,3]. Currently there is a cross map from mapping terms were from inpatient separations in SNOMED CT to ICD-10 in the UK, and one to ICD- acute care settings including designated sub-acute 9-CM (Clinical Modification) in the United States. care facilities for patients that require more care and Neither of these maps have been validated externally, time before returning home. The profile of the and no map exists for ICD-10-CA, the Canadian discharge abstracts for the 5,000 ICD-10-CA codes Adaptation. There are other cross maps that have selected for the study is in Table 1. 44 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) published by Porter Stemming that reduces word Description Count variants to a single canonical form [7,8]. Total separations 2005/06 in province 364,977 Total diagnosis codes reported 1,481,285 Steps 1 to 6 Example Average no. of codes reported per separation 4.1 Remove genitive Hodgkin’s disease, NOS o Hodgkin Total discrete diagnosis codes (all) 10,529 diseases, NOS Frequency of top 5,000 diagnosis codes 1,460,730 Remove stop words Hodgkin diseases, NOS o Hodgkin % of total diagnosis in top 5000 codes 98.6% diseases, % of total discrete diagnosis in top 5000 codes 47.5% Convert to lowercase Hodgkin diseases, o hodgkin diseases, Total discrete most responsible diagnosis codes 6,651 Strip punctuation hodgkin diseases, o hodgkin diseases Table 1. Profile of the Discharge Abstracts Uninflect phrase hodgkin diseases o hodgkin disease Sort words hodgkin disease o disease hodgkin Mapping Algorithms Table 3a. UMLS six normalization steps[7, slide 20] After conducting a detailed review of the literature on cross mapping of terminology systems, we Step-2 Explanation adopted five related mapping algorithms and created Stop Frequent short words that do not affect the phrase: Web-based versions of these algorithms in to find words and, by, for, in, of, on, the, to, with, no, and (nos) matching SNOMED concepts for each of the ICD- Exclude Words that may change meaning of the word but if words ignored help to locate a term otherwise missed: 10-CA descriptions in the data set [5]. Four of the about, alongside, an, anything, around, as, at, algorithms are lexical techniques for exact-match, because, before, being, both, cannot, chronically, match-all-words-only, match-all-words and partial- consists, covered, does, during, every, find, from, match. The fifth is semantic matching that involves instead, into, more, must, no, not, only, or, properly, side, sided, some, something, specific, than, that, retrieving the current concepts based on entries in the things, this, throughout, up, using, usually, when, SNOMED historical relationship table if the initial while, without concepts found are inactive. These mapping SNOMED [X] – concepts with ICD-10 codes not in ICD-9 algorithms are summarized in Table 2. Prefixes [D] – concepts in ICD-9 XVI and ICD-10 SVII [M] – morphology of neoplasm concepts in ICD-O [SO] – concepts in OPCS-4 chapter Z in CTV3 Algorithm Explanation [Q] – temporary qualifying terms from CTV3 1. Exact match Exact string match where all words are [V] – concepts in ICD-9 and ICD-10 on factors same and in same sequence for both source influencing health status and contact with health and target terms, including punctuation services (V-codes and Z-codes) 2. Match all only String match where all words are same but not necessary in same order; additional Table 3b. Expanded UMLS normalization step-2 words not allowed in target term 3. Match all String match where all words are same but Reverse Mapping Process not necessary in same order; additional The reverse mapping of ICD-10-CA terms to words allowed in target term SNOMED CT concepts involved cycling through the 4. Partial match String match where one or more words in source term is found in target term mapping algorithms one at a time to find the best 5. Semantic match For inactive concepts found use historical candidate SNOMED CT concepts as the target terms. relationships of Was-A Same-As, May-Be- For each algorithm we always started with the A, Replaced-By to find current concepts original terms, then the UMLS normalized terms, 6. Unmappable Assigned when no match is found followed by the stemmed terms. In each cycle, we Table 2. Mapping algorithms used in this study would review the candidate concepts found to see if it was a match, and if so, what type of match it was Normalization Steps based on the algorithm applied. When no matching In addition to using the original SNOMED CT terms concepts were found, we would label the term as and the ICD-10-CA long descriptions in mapping, we unmappable. Our experience with the matching normalized all of these original terms to remove techniques was that, the sooner we could find a “noise” such as genitives and spelling errors using match in the cycle, i.e. first-match, the greater the Unified Medical Language System (UMLS) confidence we would have that the candidate concept normalization steps, as shown in Table 3a [6]. To is appropriate. The preferred order of matched terms improve successful mapping, we expanded step-2 to was always exact-match first, match-all-only, then remove both “stop words” and “exclude words,” as match-all, with partial-match last. Whenever inactive well as SNOMED prefixes, shown in Table 3b. For concepts were found a semantic-match was done to step-5 we included both the lookup and stemming find the current concepts through their historical methods to uninflect the phrase. The lookup method relationships. During mapping we tallied frequency uses the UMLS SPECIALIST Lexicon’s inflection statistics on the different types of matches with table with ~1 million entries, whereas the stemming summary/detailed outputs. Only the first-matches method uses the computational technique first were counted to determine the effectiveness of each mapping algorithm. 45 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) concepts based on the ICD term provided. The number of first-matches found for these match types Comparison with UK SCT-ICD10 Map by ICD Chapter are shown in the Appendix. One can To determine the accuracy of the mapping results see that the percentages of matches were very low for from this study, we compared our output with the UK Chapters IV Endocrine, nutritional and metabolic SNOMED CT to ICD-10 (SCT-ICD10) cross map. diseases at 36%; XIII Diseases of the musculoskeletal To do so, the 5,000 ICD-10-CA codes were matched system and connective tissue at ~36%; and XV with the TargetCodes of the SCT_CrossMapTargets Pregnancy, childbirth and the puerperium at ~4%. table from the July 2007 version of the IHTSDO Of the overall 3,160 ICD terms or ~63% that were distribution set [1]. While the UK cross map is from mapped to one or more SNOMED concepts, most SNOMED CT to ICD-10 and not ICD-10-CA, the were found by exact-match and match-all during the two ICD versions share many similar codes. Thus, if first-match. The profiles of first-matches found by the ICD-10-CA code was found among the each match type are briefly described below. TargetCodes of the UK map, we would look up the SCT_CrossMaps table to find the corresponding Exact Match – Table 5 shows 1,237 original ICD SNOMED concepts. If multiple similar SNOMED terms had exact-matches with 2,064 candidate concepts were found, they would be filtered to concepts. Another 364 ICD terms had exact-matches include only the unique SNOMED concepts. Each of with 527 concepts using the UMLS normalized the concepts found were then compared with our version, and 18 ICD with 34 concepts using the mapping output from matches found by the exact- stemmed version. In all, 2,625 candidate SNOMED match, match-all-only and match-all algorithms. concepts were found, which means that there were multiple exact matches for some of the ICD terms. RESULTS Exact Match First Match Target Original Term 1,237 2,064 Summary of Mapping Output UMLS Version 364 527 Of the 5,000 ICD-10-CA descriptions used in this Stemmed Version 18 34 study, we were able to match 1,619 source ICD terms Total 1,619 2,625 (32.38%) to 2,625 target SNOMED concepts by the Table 5. Exact match output exact-match technique. Next, we matched 63 ICD terms (1.26%) to 87 SNOMED concepts by match- Match All Only – Table 6 shows 33 original ICD all-only; another 1,478 ICD terms to 4,829 concepts terms had match-all-only with 48 candidate concepts; by match-all; and 1,839 ICD terms to ~25 million 29 UMLS normalized terms had 37 concepts, and 1 concepts by partial-match. One ICD term C8800 stemmed term had 2 only. In all, 87 candidate Waldenstr was umappable. A summary of the SNOMED concepts were found, which means that mapping output by match-type is shown in Table 4. there were multiple match-all-only for some terms. Match Type Source Target Percentage Match All Words Only First Match Target Exact match 1,619 2,625 32.38% Original Term 33 48 Match all only 63 87 1.26% UMLS Version 29 37 Match all 1,478 4,829 29.56% Stemmed Version 1 2 Partial match 1,839 24,950,238 36.78% Total 63 87 Unmappable 1 0 0.02% Table 6. Match all only output Total 5,000 24,957,779 100.00% Table 4. Summary of Mapping Output Match All Words – Table 7 shows 1,343 original ICD terms had match-all with 4,558 candidate Detailed Analysis of Mapping Output concepts; 114 UMLS normalized terms had 217 Each ICD term was cycled through all the matching concepts, and 21 stemmed terms had 54. In all, 4,829 techniques to determine the number of candidate SNOMED concepts were found, which means that target SNOMED concepts found for each match type. there were multiple match-all for some terms. The first-match reported for each match type excluded the target concepts already identified in Match All Words First Match Target previous iterations to avoid duplicate counting. We Original Term 1,343 4,558 UMLS Version 114 217 tracked not only the total matches but also which Stemmed Version 21 54 technique found the first match. The output produced Total 1,478 4,829 suggested exact-match, match-all-only and match-all Table 7. Match all words output could be considered as successful matches, since they returned one or more identical or similar SNOMED 46 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) Partial Match – Table 8 shows 1,839 ICD terms had partial-matches with 25 million SNOMED concepts. Dissimilar Results – Where both had dissimilar We found the results of partial matches to be more results, our output were more specific as each unpredictable than the previous match types. If a concept must contain all the words in the source source term was long and contains common words term. For 100 (82%) of these terms the UK map had such as disorder or procedure, the results returned more candidate concepts; for 9 terms (7.4%) both had could be numerous as only one word from the source same number of concepts; whereas for 13 (10.7%) term needed to be present in the target term. our mapping output had more concepts. An example is the ICD term S597 Multiple injuries of forearm, Partial Match First Match Target shown in Table 11, where both maps had four Original Term 1,839 24,950,238 concepts but none are similar. UMLS Version 0 0 Stemmed Version 0 0 ConceptId Fully Specified Name UK CA Total 1,839 24,950,238 122549002 Injury (disorder) ¥ Table 8. Partial match output 125596004 Injury of elbow (disorder) ¥ Comparison with SCT-ICD10 Map 210557006 Severe multi tissue damage lower ¥ Six comparisons were made between our mapping arm (disorder) 210558001 Massive multi tissue damage ¥ output and the UK map to see if: (a) both contained lower arm (disorder) the same results; (b) both contained similar results; 210860005 Injury of multiple blood vessels at ¥ (c) both contained dissimilar results; (d) only UK forearm level (disorder) map contained the results; (e) only our mapping 211290004 Multiple superficial injuries of ¥ output contained the results; (f) both had unmappable forearm (disorder) 212308001 Injury of multiple nerves at ¥ results. The overall results are shown in Table 9. forearm level (disorder) Only (b), (c) and (f) are illustrated in this paper. 212464002 Injury of multiple muscles and ¥ tendons at forearm level Type of comparison Frequency Percentage (disorder) Contained exactly same results 11 0.22% Table 11. Comparing both with dissimilar results Contained similar results 2,401 48.02% Contained dissimilar results 122 2.44% UK map with results only 896 17.92% Unmappable Results – These were in almost every Mapping outputs with results only 370 7.40% ICD chapter but most notable in XVII: Congenital Both had unmappable results 1,200 24.00% malformations, deformations and chromosomal Total 5,000 100.00% abnormalities; XIX: Injury, poisoning and certain Table 9. Comparing UK map and mapping outputs other consequences of external causes; and XIII: Diseases of the musculoskeletal system and Similar Results - Where both maps contained connective issue (Table 12). It is possible these ICD similar results, the UK map usually had more mapped terms have further refinement making it difficult to terms than our output, as shown in Table 10. An find concept and lexical matches. An example is the example is with the ICD term Q61.2 Polycystic ICD-10-CA term O2450 Pre-existing Type 1 diabetes kidney, autosomal dominant where the UK map had mellitus arising in pregnancy, which could be refined six SNOMED concepts but only four in ours. as: delivered with or without antepartum condition (1), delivered with postpartum complication (2), or Description Total antepartum condition or complication (3). UK map had more results than mapping outputs 2,125 Mapping outputs had more results than UK map 224 Chapter Range Freq % UK and mapping outputs had same no. of results 63 XVII: Congenital Q00-Q99 292 24.33% Total 2,401 malformations, deformations, ConceptId Fully Specified Name UK CA and chromosomal abnormalities 66091009 Congenital disease (disorder) ¥ XIX: Injury, poisoning and S00-T98 278 23.17% 204955006 Polycystic kidney disease ¥ certain other consequences of 204962002 Multicystic kidney (disorder) ¥ external causes 28728008 Polycystic kidney disease, adult ¥ ¥ XIII: Disease of the M00-M99 207 17.25% type (disorder) musculoskeletal system and 253878003 Adult type polycystic kidney ¥ ¥ connective tissue disease type I (disorder) IV: Endocrine, nutritional and E00-E90 119 9.92% 253879006 Adult type polycystic kidney ¥ ¥ metabolic diseases disease type II (disorder) XX: External causes of V01-Y98 60 5.00% 274567009 [EDTA] Polycystic kidneys, adult ¥ morbidity and mortality type (dominant) associated with 956 79.67% renal failure (disorder) Table 12. Unmappable ICD-10-CA terms Table 10. Comparing both with similar results 47 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) DISCUSSION Still, we contend there is merit in exploring the use of reverse mapping with lexical algorithms to identify candidate SNOMED concepts for a given set of ICD- Lessons and Issues 10-CA terms. Our next steps are to enhance the This study was our initial effort to apply a set of mapping algorithms to include contexts, incorporate mapping algorithms on a set of ICD-10-CA terms to these algorithms into the abstracting process, and find the matching target SNOMED concepts. Our conduct further field evaluation. Last, the idea of output showed most of the matches were found using applying reverse mapping to identify candidate the exact-match and match-all algorithms. The SNOMED CT concepts for a set of mapping terms match-all-words-only algorithm did not add a great can be a helpful approach when creating a cross map deal to the number of matches found, and the partial- from SNOMED CT to another terminology system. match was considered too unpredictable with respect to the candidate target concepts returned. Due to Implications space limitation, we did not report on additional This study provides a glimpse of the feasible matches found after normalization with UMLS and mapping methods that could eventually lead to a stemming techniques were applied to the original SNOMED CT to ICD-10-CA cross map for Canada. ICD terms, or those found by semantic matching. We believe the intent, methods and results of this current study should be of interest to those A major issue is how one should define “successful responsible for secondary use of patient discharge match.” In our output we had just over 60% of the abstracts in epidemiological and statistical reporting. matches found by exact-match and match-all, which The notion of reverse mapping is also highly we reviewed and deemed correct. However, more generalizable to include the encoding of local terms formal validation preferably by an independent that already exist in legacy systems within many source is needed. While our results showed health organizations to a reference terminology such successful matches in only ~63% of the 5,000 ICD- as SNOMED CT. 10-CA codes, we were surprised to find the UK cross map had similar successful matches of ~68% against Acknowledgments the same 5,000 ICD-10-CA codes (see Table 9). Equally intriguing were the different matches found We wish to thank the Provincial Ministry that between the two maps. Almost 50% of the concepts provided the 5,000 ICD-10-CA codes for the study. found were similar but not identical, whereas ~20% We also thank Ms. Robyn Kuropatwa in facilitating were dissimilar or found only in the UK map. One the process to obtain the ICD codes from the possible explanation is the minor differences that ministry. Funding support for this work was provided exist between ICD-10 and ICD-10-CA with respect by the Canadian Institutes for Health Research to the addition, subdivision, deletion and revision through its Strategic Training Initiative. Note that the made in some ICD-10-CA chapters. Another is that a views presented in this paper are those of the authors concept-based method was used to create the UK only and do not represent the official position of any cross map, which seemed to outperform the lexical Canadian government agencies. techniques in this study. One possible solution to improve mapping precision is to combine methods, REFERENCES such as the use of semantic and lexical mapping between SNOMED CT and ICD-9-CM by Fung.9 1. IHTSDO, International Health Terminology Standards Development Organization. SNOMED Another issue is the extent that our semi-automated Clinical Terms Technical Reference Guide. matching algorithms can aide in the cross-mapping International Release, July 2007. process by health records staff when encoding the 2. Bowman S. Coordination of SNOMED CT and inpatient discharge abstracts. The current abstracting ICD-10: Getting the Most out of Electronic process is mostly an intellectual and manual exercise. Health Record Systems. Perspectives in Health As such, explicit cross-mapping guidelines need to Information Management, Spring 2005. be established, including the use of any computer- 3. McBride S, Gilder R, et al. Data mapping. based mapping tools, to improve this abstracting Journal of American Health Information process. With our mapping algorithms, a consensus- Management Association 2006; 77(1): 44-48. based process is needed for the health record staff to 4. CIHI, Canadian Institute for Health Information. verify the accuracy of the ~63% successful matches. Canadian Coding Standards for ICD-10-CA and Guidelines are also needed to reconcile the remaining CCI for 2006. Ottawa, Canada. 2006. ~37% partially-matched terms.2,10 48 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) 5. Lee DHK. Reverse Mapping ICD-10-CA to 9. Fung KW, Bodenreider O, Aronson AR, Hole SNOMED CT. UVic Master of Science research WT, Srinivasan S. Combining lexical and project report, Oct 2007. Unpublished. semantic methods of inter-terminology mapping 6. National Library of Medicine. The SPECIALIST using the UMLS. In Kuhn K. et al. (Eds) Lexicon. MedInfo 2007, p605-610. IOS Press, 2007. http://lexsr3.nlm.nih.gov/LexSysGroup/Projects/ 10. Vikstrom A, Skaner Y, et al. Mapping of the Summary/lexicon.html categories of the Swedish primary health care 7. Kleinsorge R, Willis J, et al. UMLS Overview – version of ICD-10 to SNOMED CT concepts: Tutorial T12. AMIA Annual Symposium 2006. Rule development and intercoder reliability in a http://165.112.6.70/research/umls/pdf/AMIA_T1 mapping trial. BMC Medical Informatics and 2_2006_UMLS.pdf. Jan15/2006. Decision Making 2007;7:9. 8. Goldsmith JA, Higgins D, Soglasnova S. Automatic Language-specific Stemming in Information Retrieval. Springer-Verlag Berlin Heidelberg 2001. Appendix. Mapping Output for top 5,000 ICD-10-CA codes by ICD Chapter Chapter Title Range Source Exact Only All Total Percent I Certain infections and parasitic disease A00-B99 136 47 2 57 106 77.94% II Neoplasms C00-D48 343 174 58 232 67.64% III Diseases of the blood and blood-forming organs and certain disorders D50-D89 80 35 1 20 56 70.00% involving the immune mechanism IV Endocrine, nutritional and metabolic diseases E00-E90 225 56 1 24 81 36.00% V Mental and behavioural disorders F00-F99 218 66 3 141 210 96.33% VI Diseases of the nervous system G00-G99 196 75 1 56 132 67.35% VII Diseases of the eye and adnexa H00-H59 89 56 3 18 77 86.52% VIII Diseases of the ear and mastoid process H60-H95 42 24 11 35 83.33% IX Diseases of the circulatory system I00-I99 279 136 1 74 211 75.63% X Diseases of the respiratory system J00-J99 165 67 4 41 112 67.88% XI Diseases of the digestive system K00-K93 276 136 9 56 201 72.83% XII Diseases of the skin and subcutaneous tissue L00-L99 105 42 20 62 59.05% XIII Diseases of the musculoskeletal system and connective tissue M00-M99 383 78 1 61 140 36.55% XIV Diseases of the genitourinary system N00-N99 226 120 3 48 171 75.66% XV Pregnancy, childbirth and the puerperium O00-O99 313 5 1 6 12 3.83% XVI Certain conditions originating in the perinatal period P00-P99 169 57 17 47 121 71.60% XVII Congenital malformations, deformations, chromosomal abnormalities Q00-Q99 205 105 2 57 164 80.00% XVIII Symptoms, signs and abnormal clinical and laboratory findings not R00-R99 181 99 2 52 153 84.53% elsewhere classified XIX Injury, poisoning and certain other consequences of external causes S00-T98 691 175 8 169 352 50.94% XX External causes of morbidity and mortality V01-Y98 297 9 4 249 262 88.22% XXI Factors influencing health status and contact with health services Z00-Z99 333 29 199 228 68.47% XXII Morphology of neoplasms 8000/0- 28 28 28 100.00% 9989/1 XXIII Provisional codes for research and temporary assignment U00-U99* 20 14 14 70.00% Total 5,000 1,619 63 1,478 3,160 63.20% 49