Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)


                      Using SNOMED-CT For Translational Genomics Data Integration
                              Joel Dudley1-3, David P. Chen1-3, Atul J. Butte1-3, M.D., Ph.D.,
                    1
                      Stanford Center for Biomedical Informatics Research, Department of Medicine,
                 2
                   Department of Pediatrics, Stanford University School of Medicine, Stanford, CA/USA
                                3
                                  Lucile Packard Children’s Hospital, Palo Alto, CA/USA
                                    {jdudley,dpchen,abutte}@stanford.edu
              As industrial, governmental, and academic agencies            available, and the tremendous volumes of such
              place increasing emphasis on translational research,          measurements finding their way into the public
              biomedical researchers are now faced with entirely            domain. The situation is further complicated by the
              new challenges in regards to both biomedical data             fact that the majority of the public biomolecular data
              integration and knowledge discovery. There is now             is annotated using unstructured free-text, making it
              both a strong need and a tremendous opportunity to            difficult to discern the various biological and medical
              apply translational bioinformatics to address the             contexts of the data in an automated fashion. In
              fundamental challenges in integrating the vast bodies         previous work we demonstrated the feasibility of
              of -omics and clinical data. Here we report on our            using controlled terminologies and straightforward
              preliminary work in utilizing SNOMED-CT as both a             text-mining techniques to elucidate clinical,
              tool for translational data discovery, and a major            environmental, and phenotypic contexts from free-
              component in a framework for the large-scale                  text annotations associated with public microarray
              integration of gene expression microarray data and            data1, 2. The establishment of experimental context is
              clinical laboratory data.          Annotations from           critical to linking genes to environment, phenotype,
              microarray experiments in NCBI GEO were mapped                and ultimately medicine.
              to SNOMED-CT terms using UMLS, and these
              mappings were joined to clinical laboratory data              While most major types of biomolecular data can be
              using ICD9CM to SNOMED-CT mappings within                     found in the public domain, it is traditionally difficult
              UMLS.       We find that microarray experiments               for researchers to gain access to clinical data. This is
              characterizing 211 distinct diseases can be mapped            unfortunate as the data generated on a daily basis by
              to clinical laboratory data measurements for 13,452           hospitals and clinicians is perhaps the richest source
              distinct patients.    We maintain that this work              of phenotypic biomarker data currently available.
              represents critical first steps in providing a                Fortunately modern Electronic Health Record (EHR)
              foundation for large-scale translational data                 systems such as the Stanford Translational Research
              integration, and underlines the important role that           Integrated Database Environment (STRIDE)3 and the
              controlled clinical terminologies, such as SNOMED-            University of Virginia Health System Clinical Data
              CT, can play in addressing such problems.                     Repository (CDR)4 grant institutional researchers
                                                                            access to large volumes of de-identified, quantitative
                               INTRODUCTION                                 clinical data in digital form. In recent work, we
                                                                            demonstrated the utility in applying bioinformatics
              Our ability to generate high-quality biomolecular data        methods to quantitative clinical data to draw new
              has advanced at considerably faster rate than our             inferences about disease severity5, and elucidate
              ability to investigate the data generated.        This        novel biomarkers6.
              imbalance, driven primarily by rapid advances in
              high-throughput      biological    data    acquisition        Genome Wide Association studies have revealed that
              technologies and plummeting per-experiment costs,             for many complex diseases, the pathogenesis of the
              has created an entire spectrum of informatics                 disease may be facilitated by relatively minor
              challenges that are, in many instances, as intangible         changes across a large number of genes interacting
              and complex as the fundamental biological questions           through as of yet poorly understood mechanisms7.
              that these technologies were designed to address. As          These findings have therefore highlighted the
              a consequence, our ability to formulate and                   importance of linking biomolecular data with
              investigate important biological and medical                  phenotypic quantifications in order to uncover the
              questions is currently limited by our ability to              full complexity of disease etiology. Recent work in
              manage and integrate the profusion of biomedical              integrating these two data types has offered new
              data.                                                         insights into disease etiology and pathology with
                                                                            direct clinical implications. Segal and colleagues
              Problems in data integration are moving towards the           correlated imaging traits from computed tomography
              forefront of biomedical research, driven foremost by          (CT) images of liver cancers with gene expression
              the sheer diversity of measurement technologies now


                                                                       91
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)


              data to reconstruct global expression signatures in                The ICD terminology, evolved from a lineage that
              cancer tumors that are linked to diagnosis, prognosis              spans more than 100 years, is the most widely
              and treatment8.        A number of studies have                    utilized disease terminology, with widespread
              demonstrated the utility of patient microarrays in                 adoption among a large number of major healthcare
              identifying gene expression patterns linked to disease             providers, the U.S. Federal Government, as well as
              diagnosis9, subtypes10, 11, outcome12, and treatment13, 14.        the World Health Organization. Consequently, the
              As significant as the aforementioned findings are,                 majority of clinical data is codified using ICD codes.
              their underlying methods are limited by the fact that,             Unfortunately the ICD is poorly suited for data
              in all instances, they require that the biomolecular               integration as the approximately 14,000 unique terms
              and clinical data be derived from the same patient.                codified by ICD is quite small compared to other
              Given the current high costs and logistical                        terminologies. Furthermore, the ICD is more a
              complexities involved in acquiring patient data in a               compendium of diagnosis and procedure codes, as it
              clinical setting, it would be prohibitively expensive to           lacks any significant hierarchical or relational
              scale the same approaches to address the broad                     structure.
              spectrum of human disease. Furthermore, such an
              approach implicitly eschews the great wealth of                    MeSH, which is used primarily for the purpose of
              public biomolecular data readily available.                        indexing publications, is only slightly larger than
                                                                                 ICD in terms of size with more than 22,000 unique
              A major problem in integrating clinical and                        terms. However, the design of MeSH is much more
              biomolecular data derived from disparate sources is                structured and diverse compared to ICD. MeSH
              to identify attributes by which they can be                        terms are arranged into a hierarchy of 14 distinct top-
              appropriately joined. This task is complicated by the              level categories that organize terms by Anatomy,
              fact that the majority of biomolecular data is                     Disease, Chemicals and Drugs, and Geography
              annotated around the concepts of genes and gene                    among other things. MeSH also contains a set of
              products, whereas clinical data is centered on the                 qualifier terms that can be used to narrow the
              concept of a patient. We find one concept shared                   specificity    of     a    descriptor    term      (e.g.
              among both clinical data and vast amounts of                       "Measles/epidemiology"). While MeSH possesses
              biomolecular data, and that is the concept of a                    many of the attributes desirable for translational data
              disease.     Therefore it is possible to integrate                 integration, its attributes modest in comparison to
              anonymous biomolecular data characterizing an                      those of SNOMED-CT.
              aspect of a particular disease state with quantitative
              clinical data derived from patients being treated for              SNOMED-CT was born from a medical terminology
              the same disease.                                                  lineage that traces back more than 75 years, and is
                                                                                 currently in use by pathologists worldwide to perform
              Central to this approach is the need for a                         precise classifications of human disease15, 16. With
              comprehensive controlled disease terminology                       more than 340,000 unique biomedical concepts
              through which the biomedical and clinical data is                  organized into 19 relational hierarchies linked by
              joined in a systematic fashion. In general, we would               more than 1.3 million relationships, it is by far the
              want this disease terminology to maximize three                    most expansive and expressive disease terminology
              primary criteria: coverage, defined by the number of               in existence. The sheer number of concepts coupled
              unique disease terms defined; expressiveness, which                with the rich relational architecture in SNOMED-CT
              is the richness of relationships between disease terms;            offers attributes superior to other disease
              and resolution, which is the level of detail offered by            terminologies.        For example, SNOMED-CT
              the terminology structure. A deficiency in any of                  establishes that a clear cell carcinoma of the kidney is
              these could negatively impact the amount and                       both a malignant tumor of the kidney and a malignant
              diversity of data that could be integrated, and                    tumor of the retroperitoneum. The ICD version 9
              potentially limit the types of analyses that can be                (ICD-9) simply asserts that a malignant neoplasm of
              performed on the data downstream. There are a                      the kidney is a malignant neoplasm of the
              number of well-established disease terminologies in                genitourinary organs, which is a much coarser
              active use that satisfy the above criteria to varying              designation. Therefore assert that SNOMED-CT is
              degrees. Chief among these are the International                   currently the best-suited terminology for integrating
              Classification of Diseases (ICD), Medical Subject                  biomolecular and clinical data by disease.
              Headings (MeSH), and the Systemized Nomenclature
              of Medicine-Clinical Term (SNOMED-CT). Each of                     In this study we investigate the feasibility of using
              these is suited for data integration, yet each of them             SNOMED-CT to integrate gene expression data from
              present particular pros and cons.                                  a public microarray repository with de-identified
                                                                                 clinical laboratory data obtained from a hospital EHR
                                                                                 system by disease. We propose that SNOMED-CT is


                                                                            92
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)


              well suited for this approach as it is the largest            Clinically relevant microarray data was identified
              disease vocabulary currently available. We evaluate           using a previously described method17. In brief, we
              the effectiveness of this approach based on the extent        queried the NCBI Gene Expression Omnibus (GEO)18
              of data successfully joined.                                  to obtain all GEO DataSet experiments with
                                                                            associated PubMed identifiers. For each PubMed
                                  METHODS                                   identifier we obtained the associated MeSH headings
                                                                            using NCBI eUtils. Each of the MeSH headings was
              A high level representation of the data integration           mapped to a UMLS CUI using the MRCONSO table.
              approach is detailed in figure 1. The microarray              Using the MRSTY table, we obtained the semantic
              experiment data was obtained from the NCBI GEO                type identifier (TUI) for the mapped CUIs, and if any
              FTP site (downloaded 11/27/2007), which was parsed            MeSH term is found to have a semantic type among
              into a relational structure and stored in a MySQL             Injury or Poisoning (T037), Pathologic Function
              database. The de-identified clinical laboratory data          (T046), Disease or Syndrome (T047), Mental or
              was obtained from the Lucile Packard Children’s               Behavioral Dysfunction (T048), Experimental Model
              hospital via STRIDE as delimited text files. UMLS             of Disease (T050), or Neoplastic Process (T191) then
              release 2007 AA was used as the vocabulary source.            the associated experiment is determined to be
              The integration steps were performed as follows.              disease-associated and therefore clinically relevant.
                                                                            This resulted in the positive identification of 737
                                                                            disease-associated experiments.

                                                                            The disease-associated experiments are investigated
                                                                            by a second previously described text-mining
                                                                            technique that examines GEO DataSet (GDS) subset
                                                                            annotations to identify when a disease state is being
                                                                            compared to a normal control state2. GDS are higher-
                                                                            level representations of microarray experiment in
                                                                            which samples are organized into biologically
                                                                            informative collections known as subsets. The
                                                                            subsets are representative of the experimental axis
                                                                            under examination (figure 2). An attempt is made to
                                                                            map the free-text annotations associated with the
                                                                            GDS subsets to SNOMED-CT disease terms using
                                                                            UMLS concepts. These mappings are subsequently
                                                                            manually reviewed for accuracy, where erroneous
                                                                            codifications are corrected if found.


              Figure 1 – Schematic representation of the                    Figure 2 – Example of microarray data subsets
              approached used to join gene expression data with             defined by GEO GDS experiments.
              clinical laboratory data. Annotations from GDS are
              first mapped to UMLS CUIs that map to at least one            Mapping patient laboratory data to diseases
              SNOMED CT term, and the ICD9 CM codes from the
              patient records are mapped to SNOMED CT terms                 Clinical laboratory data for pediatric patients from
              using the relational architecture of UMLS.                    the Lucile Packard Children’s Hospital was obtained
                                                                            digitally from the STRIDE system. All of the
              Mapping microarray experiments to diseases                    laboratory measurements were received pre-encoded
                                                                            with ICD-9 codes. These ICD-9 codes were mapped
                                                                            to SNOMED-CT codes by first querying UMLS to


                                                                       93
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)


              find the CUI identifier associated with the ICD-9
              code.    We then took advantage of the inter-                 We retrieved quantitative clinical laboratory data
              terminology mappings provided by the UMLS                     representing diagnostic biomarkers for 49,414
              (MRMAP) table to translate the ICD-9 codes into               patients across 9,997 distinct diagnosis codes. These
              SNOMED-CT concepts using associated CUIs.                     codes mapped to 20,049 distinct UMLS CUIs. It is
                                                                            interesting to note that in mapping ICD to UMLS we
              Joining the microarray and patient lab data by                find that twice as many UMLS concepts as ICD-9
              disease                                                       terms are found. This likely resulted from the fact
                                                                            that ICD-9 is generally a more high-level
              The GDS subsets with mappings to SNOMED-CT                    terminology, and therefore terms related to rare
              disease CUIs were joined with the clinical laboratory         genetic disorders, for example, may only be
              data using the UMLS CUIs derived from mapping                 represented by one ICD-9 code, whereas UMLS may
              the ICD-9 codes to SNOMED-CT terms using the                  allow for more fine-grained attribution of specific
              UMLS MRMAP table. Of the 238 unique disease                   rare genetic disorders.
              concepts mapped to the microarray data, 90% were
              mapped to quantitative clinical laboratory data for at        In joining the ICD-9 disease codes from the clinical
              least one patient.                                            laboratory data to the microarray data using
                                                                            SNOMED-CT disease codes, we find that 211 of the
                                   RESULTS                                  unique disease concepts annotating the microarray
                                                                            data can be mapped to clinical laboratory data. In
              Using automated methods, were able to identify 737            total, clinical laboratory data for 13, 452 patients was
              GDS microarray experiments in NCBI GEO related                mapped to SNOMED-CT disease codes that were
              to human disease.          The GDS subsets were               used to annotate the microarray GDS experiments.
              investigated for terms related to UMLS concepts that          Table 1 shows the top diseases by the number of
              were linked to a SNOMED-CT disease term,                      patients mapped.
              resulting in the identification of 238 unique human
              disease concepts. In total, 29,451 microarray samples                                SNOMED         ICD9CM
              were codified with SNOMED-CT disease identifiers.                   Disease                                       Ind
                                                                                                    Terms          Terms
              Note however that method was restricted to include             Follicular
              only those GDS for which a disease and normal                  lymphoma                         4            3    136
              control subset could be identified. This restriction
                                                                             Hamman-Rich
              ensures that a disease vs. normal vector of change can
                                                                             syndrome                         4            2     18
              be extracted from the data to establish a baseline
              disease expression signature for downstream                    Mycobacterial
              analysis.                                                      infection                        3            2     26
                                                                             Mixed
                                                                             hyperlipidemia                   3            2     90
                                  SNOMED         ICD9CM
                   Disease                                     Ind           Hepatoma                         3            2     67
                                   Terms          Terms
               Allergic                                                      Fetal alcohol
               asthma                        1            1   2240           syndrome                         3            1     10
               Asthma                        1            1   2240           Diabetic
               Allergic                                                      nephropathy                      3            2     30
               asthma NEC                    1            1   2240           Megakaryocytic
               Esophageal                                                    leukemia                         2            2    125
               Reflux                        1            1   1895           Acute monocytic
                                                                             leukemia                         2            1      7
               H. pylori
               infection                     1            2   1322           Status epilepticus               2            1     84
               Colitis                       1            1   1299          Table 2 – Top ten data mappings sorted by the
               Primary                                                      number of SNOMED-CT terms matched.
               Hypertension                  1            1   1017
               Hypertension                  1            1   1017          As evident from the data listed in table 1, there are
                                                                            cases in which distinct SNOMED-CT terms will map
               Obesity                       2            1   1010          to the same ICD-9 term. To explore the ambiguities
               Type 1                                                       of mapping terms between the SNOMED-CT and
               diabetes                      1            1     843         ICD-9 using CUIs, we investigated the overall
              Table 1 – Top ten data mappings ordered by the                pattern of the mapping cardinalities. Table 2 shows
              number of patient lab records matched.                        cases in which a single UMLS CUI maps to multiple


                                                                       94
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)


              SNOMED-CT terms. This could indicate that there               in the mappings such that a highly specific disease
              is some degree of ambiguity in the SNOMED-CT to               variant is mapped to a more generalized disease
              ICD-9 UMLS mappings, and perhaps a dampening of               category. This could have a negative impact on the
              SNOMED-CT term resolution when using UMLS                     downstream utilization of the integrated data. The
              concepts.                                                     data in table 3 suggests that large source vocabularies
                                                                            like SNOMED-CT have been constrained and
              To better understand the influence of UMLS CUI                compressed by the smaller vocabularies within
              definitions with regards to source identifier                 UMLS to the degree that original source vocabulary
              consolidation, we calculated summary statistics for           resolution is lost. This may suggest and alternative
              several terminologies with UMLS and restricted the            strategy in which the biomolecular samples are
              results to CUIs representing a disease. The summary           labeled only with SNOMED-CT identifiers and the
              statistics are listed in table 3.                             translation between SNOMED-CT and ICD-9 is
                                                                            performed outside of UMLS CUI constraints.
                                Total disease      Identifiers per
                  Source                                                    There are several caveats in the interpretation of the
                                  concepts            concept
              SNOMED-CT                74,611            1.4                results. First off, the data sets were not generalized
               ICD-9-CM                12,631            1.1                in that the clinical laboratory data only represented
                  NCI                  12,257            1.0                pediatric patients and the microarray experiments
                 MeSH                   6,613            1.0                were limited to those in which a disease and a normal
                                                                            control distinction was evident. Furthermore, this
              Table 3 – Summary statistics for select disease               study offered only a focus on SNOMED-CT and did
              terminologies sorted by total number of disease               not apply the same techniques to the alternative
              concepts (CUI).                                               disease terminologies mentioned to offer any
                                                                            quantitative comparison. Although the investigation
                                 DISCUSSION                                 revealed that SNOMED-CT was capable of joining
                                                                            the two data types, it offers no statistical
              The profusion of large public data repositories of            characterization of the joining to assess its overall
              genome-scale measures, coupled with the pressing              quality and reliability.        Of course we also
              imperative to translate such data into medicine, has          acknowledge that the text mining aspects of this
              precipitated the need to develop informatics tools and        approach are prone to errors, such as miscodings of
              techniques for integrating disparate forms of                 the data.
              biomolecular and clinical data. The purpose of this
              investigation was to explore the feasibility of using         The results demonstrate that current and future
              SNOMED-CT for such integrative efforts. We                    translational data integration endeavors can leverage
              assessed the feasibility of SNOMED-CT as a                    existing clinical terminologies, such as SNOMED-
              translational joining factor by using it to integrate         CT, to integrate clinical and biomolecular data types
              anonymous gene expression data from a public                  and shift valuable efforts to downstream discovery.
              microarray repository with de-identified clinical             Furthermore, this study provides support for the
              laboratory data by disease.                                   continued development and use of SNOMED-CT for
                                                                            translational data integration, and brings to light the
              We find that SNOMED-CT is effective as a disease              importance inter-terminology mappings resources
              terminology for integrating these two types of                such as UMLS. As demonstrated by our own work,
              biomolecular and clinical data. The cases in which            and the work of others, the straightforward act of
              microarray data could not be mapped to clinical               integrating data from the molecular and clinical
              laboratory data largely reflect the fact that only            worlds can have profound and direct impact on
              pediatric data was used. The unmapped terms                   human health.
              contain diseases such as Parkinson’s disease,
              macular degeneration, Alzheimer’s disease and other           Although our initial work focused on the integration
              diseases not generally found in children. Other failed        of microarray data and patient lab data specifically,
              mappings represent relatively rare disorders, such as         we are now working to expand the application of the
              Yersiniosis and Luteoma. Better mappings might be             underlying system to integrate additional data types.
              obtained by leveraging the relational structure of            In order to integrate new forms of biomolecular data
              UMLS to map terms that are parent or child                    into our current framework we must develop
              relationships to the disease terms.                           improved text-mining methods to map the underlying
                                                                            experimental data to SNOMED-CT identifiers. From
              The many-to-many and many-to-one SNOMED-CT                    the clinical perspective we will continue to integrate
              to ICD-9 mappings using UMLS CUIs do present an               new data obtained from the STRIDE system and look
              interesting problem. These could lead to ambiguities          to incorporate additional clinical data types as well.


                                                                       95
Representing and sharing knowledge using SNOMED
Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008)
R. Cornet, K.A. Spackman (Eds)


              We must also develop methods to test and improve                     imaging.      Nature        biotechnology.       2007
              the reliability of the clinical data, as hospital workers            Jun;25(6):675-80.
              will inevitably miscode a small percentage of the                9. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee
              data. We must also account for the fact that the                     S, Yeang CH, Angelo M, et al. Multiclass cancer
              application of clinical codes is subject to a number of              diagnosis using tumor gene expression
              non-scientific influences, such as hospital billing                  signatures. Proceedings of the National Academy
              policies, insurance companies, and pharmaceutical                    of Sciences of the United States of America.
              regulations. Any future work in this area should also                2001 Dec 18;98(26):15149-54.
              entail the development of statistical metrics to                 10. Pandita A, Zielenska M, Thorner P, Bayani J,
              evaluate the joining terminology, such that a                        Godbout R, Greenberg M, et al. Application of
              principled decision can be made to identify the most                 comparative genomic hybridization, spectral
              appropriate terminology for a particular integration                 karyotyping, and microarray analysis in the
              scenario.                                                            identification of subtype-specific patterns of
                                                                                   genomic changes in rhabdomyosarcoma.
                          ACKNOWLEDGEMENTS                                         Neoplasia (New York, NY. 1999 Aug;1(3):262-
                                                                                   75.
              This work was supported in part by the Lucile                    11. Lapointe J, Li C, Higgins JP, van de Rijn M,
              Packard Foundation for Children’s Health, National                   Bair E, Montgomery K, et al. Gene expression
              Library of Medicine (K22 LM008261), National                         profiling identifies clinically relevant subtypes of
              Institute of General Medical Sciences (R01                           prostate cancer. Proceedings of the National
              GM079719), National Human Genome Research                            Academy of Sciences of the United States of
              Institute (P50 HG003389), Howard Hughes Medical                      America. 2004 Jan 20;101(3):811-6.
              Institute, and the Pharmaceutical Research and                   12. Chen HY, Yu SL, Chen CH, Chang GC, Chen
              Manufacturers of America Foundation. The authors                     CY, Yuan A, et al. A five-gene signature and
              would also like to thank Alex Skrenchuck for High                    clinical outcome in non-small-cell lung cancer.
              Performance Computing support.                                       The New England journal of medicine. 2007 Jan
                                                                                   4;356(1):11-20.
                                                                               13. Potti A, Dressman HK, Bild A, Riedel RF, Chan
                                 REFERENCES                                        G, Sayer R, et al. Genomic signatures to guide
              1.   Butte AJ, Kohane IS. Creation and implications                  the use of chemotherapeutics. Nature medicine.
                   of a phenome-genome network. Nature                             2006 Nov;12(11):1294-300.
                   biotechnology. 2006 Jan;24(1):55-62.                        14. Komatsu M, Hiyama K, Tanimoto K, Yunokawa
              2.   Dudley J, Butte AJ. Enabling Integrative                        M, Otani K, Ohtaki M, et al. Prediction of
                   Genomic Analysis of High-Impact Human                           individual response to platinum/paclitaxel
                   Diseases Through Text Mining. Pacific                           combination using novel marker genes in ovarian
                   Symposium on Biocomputing. 2008.                                cancers. Molecular cancer therapeutics. 2006
              3.   STRIDE. [http://stride.stanford.edu/STRIDE/]                    Mar;5(3):767-75.
              4.   CDR. [https://cdr.virginia.edu/]                            15. SNOMED Intl. [http://www.snomed.org]
              5.   Chen DP, Weber SC, Constantinou PS, Ferris                  16. Chute     CG.      Clinical     classification     and
                   TA, Lowe HJ, Butte AJ. Clinical Arrays of                       terminology: some history and current
                   Laboratory Measures, or "Clinarrays", Built from                observations. J Am Med Inform Assoc. 2000
                   an Electronic Health Record Enable Disease                      May-Jun;7(3):298-303.
                   Subtyping by Severity. AMIA Annual                          17. Butte AJ, Chen R. Finding disease-related
                   Symposium Proceedings. 2007.                                    genomic experiments within an international
              6.   Chen DP, Weber SC, Constantinou PS, Ferris                      repository:    first     steps    in     translational
                   TA, Lowe HJ, Butte AJ. Novel Integration of                     bioinformatics. AMIA           Annual Symposium
                   Hospital Electronic Medical Records and Gene                    proceedings / AMIA Symposium. 2006:106-10.
                   Expression Measurements to Identify Genetic                 18. Barrett T, Suzek TO, Troup DB, Wilhite SE,
                   Markers of Maturation. Pacific Symposium on                     Ngau WC, Ledoux P, et al. NCBI GEO: mining
                   Biocomputing. 2008.                                             millions of expression profiles--database and
              7.   Pickrell J, Clerget-Darpoux F, Bourgain C.                      tools. Nucleic acids research. 2005 Jan
                   Power of genome-wide association studies in the                 1;33(Database issue):D562-6.
                   presence    of     interacting   loci.  Genetic
                   epidemiology. 2007 Nov;31(7):748-62.
              8.   Segal E, Sirlin CB, Ooi C, Adler AS, Gollub J,
                   Chen X, et al. Decoding global gene expression
                   programs in liver cancer by noninvasive


                                                                          96