Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) Using SNOMED-CT For Translational Genomics Data Integration Joel Dudley1-3, David P. Chen1-3, Atul J. Butte1-3, M.D., Ph.D., 1 Stanford Center for Biomedical Informatics Research, Department of Medicine, 2 Department of Pediatrics, Stanford University School of Medicine, Stanford, CA/USA 3 Lucile Packard Children’s Hospital, Palo Alto, CA/USA {jdudley,dpchen,abutte}@stanford.edu As industrial, governmental, and academic agencies available, and the tremendous volumes of such place increasing emphasis on translational research, measurements finding their way into the public biomedical researchers are now faced with entirely domain. The situation is further complicated by the new challenges in regards to both biomedical data fact that the majority of the public biomolecular data integration and knowledge discovery. There is now is annotated using unstructured free-text, making it both a strong need and a tremendous opportunity to difficult to discern the various biological and medical apply translational bioinformatics to address the contexts of the data in an automated fashion. In fundamental challenges in integrating the vast bodies previous work we demonstrated the feasibility of of -omics and clinical data. Here we report on our using controlled terminologies and straightforward preliminary work in utilizing SNOMED-CT as both a text-mining techniques to elucidate clinical, tool for translational data discovery, and a major environmental, and phenotypic contexts from free- component in a framework for the large-scale text annotations associated with public microarray integration of gene expression microarray data and data1, 2. The establishment of experimental context is clinical laboratory data. Annotations from critical to linking genes to environment, phenotype, microarray experiments in NCBI GEO were mapped and ultimately medicine. to SNOMED-CT terms using UMLS, and these mappings were joined to clinical laboratory data While most major types of biomolecular data can be using ICD9CM to SNOMED-CT mappings within found in the public domain, it is traditionally difficult UMLS. We find that microarray experiments for researchers to gain access to clinical data. This is characterizing 211 distinct diseases can be mapped unfortunate as the data generated on a daily basis by to clinical laboratory data measurements for 13,452 hospitals and clinicians is perhaps the richest source distinct patients. We maintain that this work of phenotypic biomarker data currently available. represents critical first steps in providing a Fortunately modern Electronic Health Record (EHR) foundation for large-scale translational data systems such as the Stanford Translational Research integration, and underlines the important role that Integrated Database Environment (STRIDE)3 and the controlled clinical terminologies, such as SNOMED- University of Virginia Health System Clinical Data CT, can play in addressing such problems. Repository (CDR)4 grant institutional researchers access to large volumes of de-identified, quantitative INTRODUCTION clinical data in digital form. In recent work, we demonstrated the utility in applying bioinformatics Our ability to generate high-quality biomolecular data methods to quantitative clinical data to draw new has advanced at considerably faster rate than our inferences about disease severity5, and elucidate ability to investigate the data generated. This novel biomarkers6. imbalance, driven primarily by rapid advances in high-throughput biological data acquisition Genome Wide Association studies have revealed that technologies and plummeting per-experiment costs, for many complex diseases, the pathogenesis of the has created an entire spectrum of informatics disease may be facilitated by relatively minor challenges that are, in many instances, as intangible changes across a large number of genes interacting and complex as the fundamental biological questions through as of yet poorly understood mechanisms7. that these technologies were designed to address. As These findings have therefore highlighted the a consequence, our ability to formulate and importance of linking biomolecular data with investigate important biological and medical phenotypic quantifications in order to uncover the questions is currently limited by our ability to full complexity of disease etiology. Recent work in manage and integrate the profusion of biomedical integrating these two data types has offered new data. insights into disease etiology and pathology with direct clinical implications. Segal and colleagues Problems in data integration are moving towards the correlated imaging traits from computed tomography forefront of biomedical research, driven foremost by (CT) images of liver cancers with gene expression the sheer diversity of measurement technologies now 91 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) data to reconstruct global expression signatures in The ICD terminology, evolved from a lineage that cancer tumors that are linked to diagnosis, prognosis spans more than 100 years, is the most widely and treatment8. A number of studies have utilized disease terminology, with widespread demonstrated the utility of patient microarrays in adoption among a large number of major healthcare identifying gene expression patterns linked to disease providers, the U.S. Federal Government, as well as diagnosis9, subtypes10, 11, outcome12, and treatment13, 14. the World Health Organization. Consequently, the As significant as the aforementioned findings are, majority of clinical data is codified using ICD codes. their underlying methods are limited by the fact that, Unfortunately the ICD is poorly suited for data in all instances, they require that the biomolecular integration as the approximately 14,000 unique terms and clinical data be derived from the same patient. codified by ICD is quite small compared to other Given the current high costs and logistical terminologies. Furthermore, the ICD is more a complexities involved in acquiring patient data in a compendium of diagnosis and procedure codes, as it clinical setting, it would be prohibitively expensive to lacks any significant hierarchical or relational scale the same approaches to address the broad structure. spectrum of human disease. Furthermore, such an approach implicitly eschews the great wealth of MeSH, which is used primarily for the purpose of public biomolecular data readily available. indexing publications, is only slightly larger than ICD in terms of size with more than 22,000 unique A major problem in integrating clinical and terms. However, the design of MeSH is much more biomolecular data derived from disparate sources is structured and diverse compared to ICD. MeSH to identify attributes by which they can be terms are arranged into a hierarchy of 14 distinct top- appropriately joined. This task is complicated by the level categories that organize terms by Anatomy, fact that the majority of biomolecular data is Disease, Chemicals and Drugs, and Geography annotated around the concepts of genes and gene among other things. MeSH also contains a set of products, whereas clinical data is centered on the qualifier terms that can be used to narrow the concept of a patient. We find one concept shared specificity of a descriptor term (e.g. among both clinical data and vast amounts of "Measles/epidemiology"). While MeSH possesses biomolecular data, and that is the concept of a many of the attributes desirable for translational data disease. Therefore it is possible to integrate integration, its attributes modest in comparison to anonymous biomolecular data characterizing an those of SNOMED-CT. aspect of a particular disease state with quantitative clinical data derived from patients being treated for SNOMED-CT was born from a medical terminology the same disease. lineage that traces back more than 75 years, and is currently in use by pathologists worldwide to perform Central to this approach is the need for a precise classifications of human disease15, 16. With comprehensive controlled disease terminology more than 340,000 unique biomedical concepts through which the biomedical and clinical data is organized into 19 relational hierarchies linked by joined in a systematic fashion. In general, we would more than 1.3 million relationships, it is by far the want this disease terminology to maximize three most expansive and expressive disease terminology primary criteria: coverage, defined by the number of in existence. The sheer number of concepts coupled unique disease terms defined; expressiveness, which with the rich relational architecture in SNOMED-CT is the richness of relationships between disease terms; offers attributes superior to other disease and resolution, which is the level of detail offered by terminologies. For example, SNOMED-CT the terminology structure. A deficiency in any of establishes that a clear cell carcinoma of the kidney is these could negatively impact the amount and both a malignant tumor of the kidney and a malignant diversity of data that could be integrated, and tumor of the retroperitoneum. The ICD version 9 potentially limit the types of analyses that can be (ICD-9) simply asserts that a malignant neoplasm of performed on the data downstream. There are a the kidney is a malignant neoplasm of the number of well-established disease terminologies in genitourinary organs, which is a much coarser active use that satisfy the above criteria to varying designation. Therefore assert that SNOMED-CT is degrees. Chief among these are the International currently the best-suited terminology for integrating Classification of Diseases (ICD), Medical Subject biomolecular and clinical data by disease. Headings (MeSH), and the Systemized Nomenclature of Medicine-Clinical Term (SNOMED-CT). Each of In this study we investigate the feasibility of using these is suited for data integration, yet each of them SNOMED-CT to integrate gene expression data from present particular pros and cons. a public microarray repository with de-identified clinical laboratory data obtained from a hospital EHR system by disease. We propose that SNOMED-CT is 92 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) well suited for this approach as it is the largest Clinically relevant microarray data was identified disease vocabulary currently available. We evaluate using a previously described method17. In brief, we the effectiveness of this approach based on the extent queried the NCBI Gene Expression Omnibus (GEO)18 of data successfully joined. to obtain all GEO DataSet experiments with associated PubMed identifiers. For each PubMed METHODS identifier we obtained the associated MeSH headings using NCBI eUtils. Each of the MeSH headings was A high level representation of the data integration mapped to a UMLS CUI using the MRCONSO table. approach is detailed in figure 1. The microarray Using the MRSTY table, we obtained the semantic experiment data was obtained from the NCBI GEO type identifier (TUI) for the mapped CUIs, and if any FTP site (downloaded 11/27/2007), which was parsed MeSH term is found to have a semantic type among into a relational structure and stored in a MySQL Injury or Poisoning (T037), Pathologic Function database. The de-identified clinical laboratory data (T046), Disease or Syndrome (T047), Mental or was obtained from the Lucile Packard Children’s Behavioral Dysfunction (T048), Experimental Model hospital via STRIDE as delimited text files. UMLS of Disease (T050), or Neoplastic Process (T191) then release 2007 AA was used as the vocabulary source. the associated experiment is determined to be The integration steps were performed as follows. disease-associated and therefore clinically relevant. This resulted in the positive identification of 737 disease-associated experiments. The disease-associated experiments are investigated by a second previously described text-mining technique that examines GEO DataSet (GDS) subset annotations to identify when a disease state is being compared to a normal control state2. GDS are higher- level representations of microarray experiment in which samples are organized into biologically informative collections known as subsets. The subsets are representative of the experimental axis under examination (figure 2). An attempt is made to map the free-text annotations associated with the GDS subsets to SNOMED-CT disease terms using UMLS concepts. These mappings are subsequently manually reviewed for accuracy, where erroneous codifications are corrected if found. Figure 1 – Schematic representation of the Figure 2 – Example of microarray data subsets approached used to join gene expression data with defined by GEO GDS experiments. clinical laboratory data. Annotations from GDS are first mapped to UMLS CUIs that map to at least one Mapping patient laboratory data to diseases SNOMED CT term, and the ICD9 CM codes from the patient records are mapped to SNOMED CT terms Clinical laboratory data for pediatric patients from using the relational architecture of UMLS. the Lucile Packard Children’s Hospital was obtained digitally from the STRIDE system. All of the Mapping microarray experiments to diseases laboratory measurements were received pre-encoded with ICD-9 codes. These ICD-9 codes were mapped to SNOMED-CT codes by first querying UMLS to 93 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) find the CUI identifier associated with the ICD-9 code. We then took advantage of the inter- We retrieved quantitative clinical laboratory data terminology mappings provided by the UMLS representing diagnostic biomarkers for 49,414 (MRMAP) table to translate the ICD-9 codes into patients across 9,997 distinct diagnosis codes. These SNOMED-CT concepts using associated CUIs. codes mapped to 20,049 distinct UMLS CUIs. It is interesting to note that in mapping ICD to UMLS we Joining the microarray and patient lab data by find that twice as many UMLS concepts as ICD-9 disease terms are found. This likely resulted from the fact that ICD-9 is generally a more high-level The GDS subsets with mappings to SNOMED-CT terminology, and therefore terms related to rare disease CUIs were joined with the clinical laboratory genetic disorders, for example, may only be data using the UMLS CUIs derived from mapping represented by one ICD-9 code, whereas UMLS may the ICD-9 codes to SNOMED-CT terms using the allow for more fine-grained attribution of specific UMLS MRMAP table. Of the 238 unique disease rare genetic disorders. concepts mapped to the microarray data, 90% were mapped to quantitative clinical laboratory data for at In joining the ICD-9 disease codes from the clinical least one patient. laboratory data to the microarray data using SNOMED-CT disease codes, we find that 211 of the RESULTS unique disease concepts annotating the microarray data can be mapped to clinical laboratory data. In Using automated methods, were able to identify 737 total, clinical laboratory data for 13, 452 patients was GDS microarray experiments in NCBI GEO related mapped to SNOMED-CT disease codes that were to human disease. The GDS subsets were used to annotate the microarray GDS experiments. investigated for terms related to UMLS concepts that Table 1 shows the top diseases by the number of were linked to a SNOMED-CT disease term, patients mapped. resulting in the identification of 238 unique human disease concepts. In total, 29,451 microarray samples SNOMED ICD9CM were codified with SNOMED-CT disease identifiers. Disease Ind Terms Terms Note however that method was restricted to include Follicular only those GDS for which a disease and normal lymphoma 4 3 136 control subset could be identified. This restriction Hamman-Rich ensures that a disease vs. normal vector of change can syndrome 4 2 18 be extracted from the data to establish a baseline disease expression signature for downstream Mycobacterial analysis. infection 3 2 26 Mixed hyperlipidemia 3 2 90 SNOMED ICD9CM Disease Ind Hepatoma 3 2 67 Terms Terms Allergic Fetal alcohol asthma 1 1 2240 syndrome 3 1 10 Asthma 1 1 2240 Diabetic Allergic nephropathy 3 2 30 asthma NEC 1 1 2240 Megakaryocytic Esophageal leukemia 2 2 125 Reflux 1 1 1895 Acute monocytic leukemia 2 1 7 H. pylori infection 1 2 1322 Status epilepticus 2 1 84 Colitis 1 1 1299 Table 2 – Top ten data mappings sorted by the Primary number of SNOMED-CT terms matched. Hypertension 1 1 1017 Hypertension 1 1 1017 As evident from the data listed in table 1, there are cases in which distinct SNOMED-CT terms will map Obesity 2 1 1010 to the same ICD-9 term. To explore the ambiguities Type 1 of mapping terms between the SNOMED-CT and diabetes 1 1 843 ICD-9 using CUIs, we investigated the overall Table 1 – Top ten data mappings ordered by the pattern of the mapping cardinalities. Table 2 shows number of patient lab records matched. cases in which a single UMLS CUI maps to multiple 94 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) SNOMED-CT terms. This could indicate that there in the mappings such that a highly specific disease is some degree of ambiguity in the SNOMED-CT to variant is mapped to a more generalized disease ICD-9 UMLS mappings, and perhaps a dampening of category. This could have a negative impact on the SNOMED-CT term resolution when using UMLS downstream utilization of the integrated data. The concepts. data in table 3 suggests that large source vocabularies like SNOMED-CT have been constrained and To better understand the influence of UMLS CUI compressed by the smaller vocabularies within definitions with regards to source identifier UMLS to the degree that original source vocabulary consolidation, we calculated summary statistics for resolution is lost. This may suggest and alternative several terminologies with UMLS and restricted the strategy in which the biomolecular samples are results to CUIs representing a disease. The summary labeled only with SNOMED-CT identifiers and the statistics are listed in table 3. translation between SNOMED-CT and ICD-9 is performed outside of UMLS CUI constraints. Total disease Identifiers per Source There are several caveats in the interpretation of the concepts concept SNOMED-CT 74,611 1.4 results. First off, the data sets were not generalized ICD-9-CM 12,631 1.1 in that the clinical laboratory data only represented NCI 12,257 1.0 pediatric patients and the microarray experiments MeSH 6,613 1.0 were limited to those in which a disease and a normal control distinction was evident. Furthermore, this Table 3 – Summary statistics for select disease study offered only a focus on SNOMED-CT and did terminologies sorted by total number of disease not apply the same techniques to the alternative concepts (CUI). disease terminologies mentioned to offer any quantitative comparison. Although the investigation DISCUSSION revealed that SNOMED-CT was capable of joining the two data types, it offers no statistical The profusion of large public data repositories of characterization of the joining to assess its overall genome-scale measures, coupled with the pressing quality and reliability. Of course we also imperative to translate such data into medicine, has acknowledge that the text mining aspects of this precipitated the need to develop informatics tools and approach are prone to errors, such as miscodings of techniques for integrating disparate forms of the data. biomolecular and clinical data. The purpose of this investigation was to explore the feasibility of using The results demonstrate that current and future SNOMED-CT for such integrative efforts. We translational data integration endeavors can leverage assessed the feasibility of SNOMED-CT as a existing clinical terminologies, such as SNOMED- translational joining factor by using it to integrate CT, to integrate clinical and biomolecular data types anonymous gene expression data from a public and shift valuable efforts to downstream discovery. microarray repository with de-identified clinical Furthermore, this study provides support for the laboratory data by disease. continued development and use of SNOMED-CT for translational data integration, and brings to light the We find that SNOMED-CT is effective as a disease importance inter-terminology mappings resources terminology for integrating these two types of such as UMLS. As demonstrated by our own work, biomolecular and clinical data. The cases in which and the work of others, the straightforward act of microarray data could not be mapped to clinical integrating data from the molecular and clinical laboratory data largely reflect the fact that only worlds can have profound and direct impact on pediatric data was used. The unmapped terms human health. contain diseases such as Parkinson’s disease, macular degeneration, Alzheimer’s disease and other Although our initial work focused on the integration diseases not generally found in children. Other failed of microarray data and patient lab data specifically, mappings represent relatively rare disorders, such as we are now working to expand the application of the Yersiniosis and Luteoma. Better mappings might be underlying system to integrate additional data types. obtained by leveraging the relational structure of In order to integrate new forms of biomolecular data UMLS to map terms that are parent or child into our current framework we must develop relationships to the disease terms. improved text-mining methods to map the underlying experimental data to SNOMED-CT identifiers. From The many-to-many and many-to-one SNOMED-CT the clinical perspective we will continue to integrate to ICD-9 mappings using UMLS CUIs do present an new data obtained from the STRIDE system and look interesting problem. These could lead to ambiguities to incorporate additional clinical data types as well. 95 Representing and sharing knowledge using SNOMED Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008) R. Cornet, K.A. Spackman (Eds) We must also develop methods to test and improve imaging. Nature biotechnology. 2007 the reliability of the clinical data, as hospital workers Jun;25(6):675-80. will inevitably miscode a small percentage of the 9. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee data. We must also account for the fact that the S, Yeang CH, Angelo M, et al. Multiclass cancer application of clinical codes is subject to a number of diagnosis using tumor gene expression non-scientific influences, such as hospital billing signatures. Proceedings of the National Academy policies, insurance companies, and pharmaceutical of Sciences of the United States of America. regulations. Any future work in this area should also 2001 Dec 18;98(26):15149-54. entail the development of statistical metrics to 10. Pandita A, Zielenska M, Thorner P, Bayani J, evaluate the joining terminology, such that a Godbout R, Greenberg M, et al. Application of principled decision can be made to identify the most comparative genomic hybridization, spectral appropriate terminology for a particular integration karyotyping, and microarray analysis in the scenario. identification of subtype-specific patterns of genomic changes in rhabdomyosarcoma. ACKNOWLEDGEMENTS Neoplasia (New York, NY. 1999 Aug;1(3):262- 75. This work was supported in part by the Lucile 11. Lapointe J, Li C, Higgins JP, van de Rijn M, Packard Foundation for Children’s Health, National Bair E, Montgomery K, et al. Gene expression Library of Medicine (K22 LM008261), National profiling identifies clinically relevant subtypes of Institute of General Medical Sciences (R01 prostate cancer. Proceedings of the National GM079719), National Human Genome Research Academy of Sciences of the United States of Institute (P50 HG003389), Howard Hughes Medical America. 2004 Jan 20;101(3):811-6. Institute, and the Pharmaceutical Research and 12. Chen HY, Yu SL, Chen CH, Chang GC, Chen Manufacturers of America Foundation. The authors CY, Yuan A, et al. A five-gene signature and would also like to thank Alex Skrenchuck for High clinical outcome in non-small-cell lung cancer. Performance Computing support. The New England journal of medicine. 2007 Jan 4;356(1):11-20. 13. Potti A, Dressman HK, Bild A, Riedel RF, Chan REFERENCES G, Sayer R, et al. Genomic signatures to guide 1. Butte AJ, Kohane IS. Creation and implications the use of chemotherapeutics. Nature medicine. of a phenome-genome network. Nature 2006 Nov;12(11):1294-300. biotechnology. 2006 Jan;24(1):55-62. 14. Komatsu M, Hiyama K, Tanimoto K, Yunokawa 2. Dudley J, Butte AJ. Enabling Integrative M, Otani K, Ohtaki M, et al. Prediction of Genomic Analysis of High-Impact Human individual response to platinum/paclitaxel Diseases Through Text Mining. Pacific combination using novel marker genes in ovarian Symposium on Biocomputing. 2008. cancers. Molecular cancer therapeutics. 2006 3. STRIDE. [http://stride.stanford.edu/STRIDE/] Mar;5(3):767-75. 4. CDR. [https://cdr.virginia.edu/] 15. SNOMED Intl. [http://www.snomed.org] 5. Chen DP, Weber SC, Constantinou PS, Ferris 16. Chute CG. Clinical classification and TA, Lowe HJ, Butte AJ. Clinical Arrays of terminology: some history and current Laboratory Measures, or "Clinarrays", Built from observations. J Am Med Inform Assoc. 2000 an Electronic Health Record Enable Disease May-Jun;7(3):298-303. Subtyping by Severity. AMIA Annual 17. Butte AJ, Chen R. Finding disease-related Symposium Proceedings. 2007. genomic experiments within an international 6. Chen DP, Weber SC, Constantinou PS, Ferris repository: first steps in translational TA, Lowe HJ, Butte AJ. Novel Integration of bioinformatics. AMIA Annual Symposium Hospital Electronic Medical Records and Gene proceedings / AMIA Symposium. 2006:106-10. Expression Measurements to Identify Genetic 18. Barrett T, Suzek TO, Troup DB, Wilhite SE, Markers of Maturation. Pacific Symposium on Ngau WC, Ledoux P, et al. NCBI GEO: mining Biocomputing. 2008. millions of expression profiles--database and 7. Pickrell J, Clerget-Darpoux F, Bourgain C. tools. Nucleic acids research. 2005 Jan Power of genome-wide association studies in the 1;33(Database issue):D562-6. presence of interacting loci. Genetic epidemiology. 2007 Nov;31(7):748-62. 8. Segal E, Sirlin CB, Ooi C, Adler AS, Gollub J, Chen X, et al. Decoding global gene expression programs in liver cancer by noninvasive 96