Constructing a Lattice of Infectious Disease Ontologies from a Staphylococcus aureus Isolate Repository Albert Goldfain1,* Barry Smith2 and Lindsay G. Cowell3 1 Blue Highway Inc., Syracuse, NY, USA 2 University at Buffalo, Buffalo, NY, USA 3 University of Texas Southwestern Medical Center, Dallas, TX, USA ABSTRACT cus” and scattered exclusions such as “[041] Bacterial infec- A repository of clinically associated Staphylococcus aureus (Sa) iso‐ tion in conditions classified elsewhere and of unspecified lates is used to semi‐automatically generate a set of application ontolo‐ gies for specific subfamilies of Sa‐related disease. Each such applica‐ site. Excludes: septicemia (038.0 – 038.9)”. tion ontology is compatible with the Infectious Disease Ontology (IDO) The National Academies of Science have recently called and uses resources from the Open Biomedical Ontology (OBO) Found‐ for a new taxonomy of disease, along with informatics tools ry. The set of application ontologies forms a lattice structure beneath to support its construction (Committee on the Framework the IDO‐Core and IDO‐extension reference ontologies. We show how for Developing a New Taxonomy of Disease, 2011). In sup- this lattice can be used to define a strategy for the construction of a new taxonomy of infectious disease incorporating genetic, molecular, port of such a taxonomy, an information commons would be and clinical data. We also outline how faceted browsing and query of developed to store “bedside” clinical data collected during annotated data is supported using a lattice application ontology. clinical encounters, effectively treating each patient as a participant in a clinical study, and integrate this information 1 INTRODUCTION in a knowledge network that would formalize the relation- One of the more ambitious goals of current clinical and bi- ships between different disease data sets. The long-term omedical research is the personalization of medicine, in goal is to produce the new taxonomy of disease from a vali- which treatments are selected on the basis of patient-specific dated subset of the knowledge network. as well as disease-specific information. Recent advances in We believe that biomedical ontologies will be essential to high-throughput technologies have resulted in a push for the the construction of the envisioned taxonomy of disease, use of patient-specific information in care decisions, par- especially the ontologies in the Open Biomedical Ontology ticularly genomic and functional genomic data, but also (OBO) Foundry (Smith et al., 2007). The OBO Foundry proteomic, metabolomic, and cytometry data. It is widely (OBOF) represents a coordinated effort to construct refer- believed that the increased precision of personalized medi- ence biomedical ontologies according to best practices and cine will yield more effective treatments, with better out- principles and to use these ontologies as the basis for comes and fewer adverse side effects. OBOF-conformant application ontologies. The coordinated Personalized medicine requires that genomic (and other) development of these ontologies and their use of a common data be effectively classified and associated with known formalism increases data interoperability and consistency clinical phenotypes and disease types. Currently available for datasets annotated in their terms. The use of OBOF on- taxonomies of disease do not support this, however, and are tologies in construction of the new disease taxonomy can in general not well suited for integration and analysis of bring significant benefits. For example, the widespread use high-throughput molecular and cellular data with clinical of OBOF ontologies for data annotation would link the dis- data, such as the data found in electronic medical records. ease taxonomy to many existing databases and information Current disease taxonomies were developed primarily to resources, and their underlying formalism allows the dy- support diagnosis and reimbursement coding rather than as namic inference of different views and multiple intercon- biological representations of disease. As a consequence, nected hierarchies. In addition, many analysis algorithms for they are based on single, rigid hierarchies that do not reflect high-throughput data already utilize these ontologies. the complex interconnections between disease types; they The Infectious Disease Ontology (IDO) suite of ontolo- lack links to molecular- and cellular-level data and infor- gies is being developed within the OBO Foundry framework mation; and they lack the sort of formal structure that would and includes a hub – the IDO-Core – consisting of terms support their use for the kinds of computational analyses and relations relevant to infectious diseases generally, to- applied in biological and clinical research. For example, the gether with a set of disease-specific extensions derived International Classification of Disease (ICD) version 9 in- therefrom. The IDO ontologies are interoperable and jointly cludes catch-all codes such as “[041.19] Other Staphylococ- cover the infectious disease domain. Here we illustrate how the IDO ontologies can be used in the construction of a part * To whom correspondence should be addressed: agoldfain@blue- of the new taxonomy of disease and to integrate clinically highway.com relevant phenotypic and genotypic data. 1 Goldfain, Smith, and Cowell We take as our case study infectious diseases caused by Staphylococcus aureus (Sa) infection. We show how isolate data from the Network on Antimicrobial Resistance in Staphylococcus aureus (NARSA) can be annotated using IDO and its extensions. We then demonstrate a faceted browser in which both phenotypic and genotypic aspects of the IDO-annotated isolate data can be exposed and queried. Our goal is to provide a resource from which an IDO- conformant application ontology can be derived for a specif- ic Sa infectious disease type. Such application ontologies can be generated in a semi-automated way and collectively form a lattice structure beneath IDO-Core (described be- low). While our example narrowly focuses on properties of infectious agents, this effort is part of a larger effort to cre- ate an ontological representation of Sa diseases, and we be- lieve the same approach can be applied to host data and to the integration of host and pathogen data. 2 INFECTIOUS DISEASE ONTOLOGY IDO-Core includes terms relevant for infectious diseases generally, terms such as ‘host’, ‘infectious agent’, ‘fomite’, and ‘virulence factor’, and the relations between the corre- Fig 1. A possible lattice expansion of IDO sponding types. Disease- and pathogen-specific extensions are developed by extending the core to include terms and 2.1 OGMS/IDO Disease Model relations relevant to the corresponding infectious disease(s). The IDO ontologies represent disease according to the dis- For example, the IDO extension for Sa (IDO-Sa) includes order – disease – disease course framework provided by the terms such as ‘Staphylococcus aureus bacteremia’ and Ontology for General Medical Science (OGMS), in which a ‘Staphylococcal cassette chromosome mec’. disorder is the physical basis of a disease, which is itself a IDO extensions are currently being developed for influ- disposition to pathological processes realized in a disease enza, malaria, brucellosis, HIV, and Sa. Further extensions course. For example, in IDO-Sa we assert the following in will involve the creation of specific application ontologies OWL-DL: by IDO user groups. It will be necessary for these ontologies  Sa subClassOf obi:organism AND to import terms from several OBO Foundry ontologies, as ido:‘infectious agent’ well as from existing IDO extension ontologies. This will  SaI =def ido:‘infectious disorder’ AND give rise to a lattice structure beneath IDO core and its ex- has_part SOME Sa tensions, as illustrated in Figure 1. At the bottom of the lat- tice is IDO-ALL, the (pre-inference) closure of possible the  SaID =def ido:‘infectious disease’ AND IDO ontologies. has_material_basis_in SOME SaI. When a new application ontology is needed, its position  SaID realized_by ONLY SaIDC in the lattice will be determined by the terms it needs to im- where, ‘Staphylococcus aureus’ = Sa, ‘Sa Infectious Disor- port. IDO Core is agnostic to biological scale, host organ- der’ = SaI,‘Sa Infectious Disease’ = SaID, and ‘Sa Infec- ism, and disciplinary perspective, but it will be desirable for tious Disease Course’=SaIDC. some of the application ontologies in the lattice to hold The primary classification of Sa is as an organism, but Sa some of these fixed (e.g., genetic aspects of influenza in bacteria are also infectious agents because they have a dis- birds), thus serving as granular partitions of the domain on- position to cause infectious disease in some hosts. Note we tology they are extending. The lattice serves as a representa- define Sa infectious disorder as an infectious disorder that tion of some of the interdependencies in the existing IDO has Sa as part, but we do not assert “Sa part_of SOME SaI” set of ontologies and the intended overall domain coverage. because Sa can be among a host’s normal flora, for example on the skin or nasal mucosa. We use the shortcut relation has_material_basis here to establish a link between the disease (disposition) and the disorder (material entity) (Goldfain, Smith and Cowell, un- der review). An infectious disorder is both an infection (a material entity composed of infectious agents) and a disor- der (has reached the threshold of clinical significance to dispose a host to infectious disease). 2 Constructing a Lattice of Infectious Disease Ontologies from a Staphylococcus aureus Isolate Repository 2.2 Classifying Staphylococcus aureus diseases  SCCMecIV has_part SOME ‘ccr Type 2’ Infectious diseases can usefully be classified in terms of a number of differentia, including: host type, (sub-)species of More fine grained sequence information about the ccr and infectious agent, route of transmission, antibiotic resistance, mec complexes can be captured using SO terms and rela- and anatomical site of infection. tions. For many species of infectious agent, including Sa, a fur- ther classification into strain categories is useful. Many dif- 3 CASE STUDY ferent typing systems are used, including: Pulse Field Gel We will now show how a lattice of Sa isolates can be con- Electrophoresis (into strains), Multi-Locus Sequence Typing structed using IDO-Sa and isolate metadata indicating prop- (into sequence types), BURST Clustering (into clonal com- erties such as the mec and ccr gene complex types. The plexes), and gram staining (into gram positive and gram isolate lattice is then used as the basis for our desired lattice negative classes). Each of these typing systems is tied to a of infectious disease application ontologies. Ontologically particular type of assay that can be described using the On- speaking, isolates are particulars that instantiate the organ- tology for Biomedical Investigations (OBI). ism type Sa and have been extracted from a host organism. For our present purpose, we are interested in a typing Here we do not represented the distinctions between Sa as system specifically created to differentiate Sa isolates, the an ‘isolate’ or as part of a ‘cell culture’, however we believe Staphylococcal cassette chromosome mec (SCCmec) typing these terms are general enough to infectious disease re- system. SCCmec is further differentiated by its subparts: (a) search to warrant inclusion in IDO-Core. Cassette chromosome recombinases (ccr) and (b) mec gene The ontology generated for this case study is stored complex (mec). The SCCmec is a mobile genetic element across several OWL files. The full ontology, including ex- that carries the central determinant for broad-spectrum beta- ternal imports and automatically generated isolate infor- lactam antibiotic resistance encoded by the mecA gene mation is currently available in OWL-DL format at (Katayama, Ito and Hiramatsu, 2000). The genetic charac- http://www.awqbi.com/LATTICE/narsa-complete.owl. The teristics of SCCMec are of critical importance to the type of ontology was developed using Protege 4.1 and was checked treatment and Sa disease course an infected host may under- for inconsistency using the Hermit 1.3.5 and Fact++ reason- go. The International Working Group on the Staphylococcal ers. Cassette Chromosome elements1 maintains a list with defi- nitions of the latest known SCCmec types. At the time of 3.1 Resources this writing, there are 11 known SCCmec types. We include Wherever possible, we import and reuse terms (and URIs) this information in IDO-Sa by leveraging the Sequence On- from OBO Foundry ontologies via the MIREOT technique tology (SO) to assert the following: (Courtot et al., 2011) and use relations from the OBO rela- tion ontology (RO) or proposed extensions thereto. The  SCCMec subClassOf so:gene_cassette OBO Foundry ontologies we require for our case study are:  SCCMec subClassOf so:mobile_genetic_element Ontology for General Medical Science (OGMS2), Ontology  ‘mec gene complex’ subClassOf for Biomedical Investigations (OBI3), Sequence Ontology so:gene_cassette_member (SO), Infectious Disease Ontology (IDO4), Information Ar-  ‘ccr gene complex’ subClassOf tifact Ontology (IAO5), NCBI Taxonomy (NCBITaxon6), so:gene_cassette_member and Foundational Model of Anatomy (FMA7).  SCCMec has_part SOME ‘mec gene complex’ We also import drug file names from the National Drug  SCCMec has_part SOME ‘ccr gene complex’ File Reference Terminology (NDF-RT) to represent antibi- otic resistance, and create links to two other resources: (1) The classification of SCCmec as a gene cassette is to be Antibiotic Resistance Ontology8 and Antibiotic Resistance preferred over its classification as a mobile genetic element Database Ontology9. Various other stakeholders (such as the because the former tells us what SCCmec is, while the latter DebugIT European Union initiative) have ontologies and tells us what SCCmec can do. However, we include both databases of antimicrobial resistance, but we only to link to here, because most descriptions of SCCmec highlight its open resources for our case study. mobility. Description of a SCCMec subtype then proceeds as follows: 2 http://code.google.com/p/ogms/  SCCMecIV subClassOf SCCMec 3 http://obi-ontology.org/page/Main_Page 4  ‘mec Class B’ subClassOf ‘mec gene complex’ 5 http://infectiousdiseaseontology.org/page/Main_Page  ‘ccr Type 2’ subClassOf ‘ccr gene complex’ http://code.google.com/p/information-artifact-ontology/ 6 http://www.ncbi.nlm.nih.gov/Taxonomy/  SCCMecIV has_part SOME ‘mec Class B’ 7 http://sig.biostr.washington.edu/projects/fm/ 8 http://arpcard.mcmaster.ca 1 9 http://www.sccmec.org/Pages/SCC_ClassificationEN.html http://ardb.cbcb.umd.edu/antibio_resis.obo 3 Goldfain, Smith, and Cowell 3.2 NARSA Isolate Repository The Network on Antimicrobial Resistance in Staphylococ- cus aureus10 maintains a repository of Sa isolates for clinical research which includes genetic, phenotypic, and demo- graphic information on each isolate. For this example, we use a subset of 101 NARSA isolates, those listed in the “Known Clinically Associated Strains – ABCs Collection from CDC” repository. All of the isolates in this subset have an SCCMec type annotation in the NARSA repository and Fig 2. Antimicrobial profile for an isolate in the NARSA subset have diverse geographic origin in the United States.11 The NARSA subset was selected to demonstrate how a The NDF-RT was used to validate this profile by making disease lattice could be constructed starting from only struc- sure that the set of drugs in the profile is a subset of: tured HTML content about isolates. NARSA maintains a database of extended information about such isolates; how- {d | ndf-rt:’Staph Infection’ ndf-rt:may_be_treated_by d} ever we only used the information publicly available on the web. For NARSA, or any other resource on antimicrobial re- A script was created to extract each isolate’s NARSA id sistance, there may be a good reason to restrict attention to a (NRSnnn), culture source, toxin profile, and antimicrobial subset of antimicrobials. However, since new resistance profile. The script was implemented in Ruby and utilized evolves rapidly, a resource such as NDF-RT can be used to the Hpricot HTML library and regular expressions to extract synchronize the latest antibiotics permissible in such a pro- information. First, the NARSA id was used to assert the file. existence of a Sa instance type. Then the culture source data Minimum inhibitory concentration data (MIC) are repre- was extracted. The culture source was sometimes unspeci- sented using IAO and OBI as follows: fied (‘other’) or underspecified (‘blood’ vs ‘wound’). Only culture sources for which FMA types existed were asserted  ‘MIC assay’ subclassOf iao:assay to exist as such, but IDO allows for an even more complete  ‘MIC assay’ has_specified_output SOME representation of host anatomical entities if such infor- ‘MIC data item’ mation is known. For example, the anatomical location from which the infectious organism is isolated may also be a por-  ‘MIC scalar measurement datum’ is_about SOME tal of entry. ‘drug susceptibility of infectious agent’ The toxin profile for NARSA subset isolates included the presence or absence of the Panton Valentine Leukocidin Resistance is a disposition that an infectious agent bears (PVL) and Toxic Shock Syndrome Toxin (TSST). These towards some drugs and is realized in their presence. We toxins are strong determinants of the virulence and clinical have elsewhere modeled resistance in terms of pairwise manifestation of Sa disease. We classify PVL and TSST as complementary dispositions on the part of both the infec- ido:exotoxin. The presence or absence of a toxin is not usu- tious agent and the drug (Goldfain, Smith & Cowell, 2011). ally associated with drug resistance, but by representing Here we link resistance to MIC measurement data using the both pieces of information we are able to query the applica- shortcut relation has_qualitative_basis as follows: tion ontology for correlations between the presence of tox- ins and resistance to certain drug types.  ido:’resistance to drug’ has_qualitative_basis The antimicrobial profile for the NARSA subset includes SOME (is_quality_measured_as SOME ‘MIC 15 drugs (see Figure 2 for a subset of these). For each drug, measurement datum’) NARSA reports a minimum inhibitory concentration – a range or exact value – along with an interpretation of the Finally, for each drug D towards which the isolate Sa has antibiotic resistance indicated by this value following the a drug resistance we assert: Clinical and Laboratory Standards Institute guidelines.  ‘resistance to D’ subclassOf ido:‘resistance to drug’  Sa has_disposition SOME ‘resistance to D’ 3.3 From an Isolate Lattice to a Disease Lattice The lattice of infectious diseases mirrors the isolate lattice by representing the types of infectious disease different iso- 10 lates can give rise to. Infectious agents are parts of those See http://www.narsa.net/ 11 infectious disorders which are the material basis for infec- See http://www.cdc.gov/abcs/reports-findings/surv-reports.html 4 Constructing a Lattice of Infectious Disease Ontologies from a Staphylococcus aureus Isolate Repository tious disease. Using the representation developed above, we We hope to reuse a similar technique to that outlined in can begin to make assertions about the specific types of dis- this paper for isolate repositories across the infectious dis- ease the isolates give rise to and the profiles of the disease ease domain. In so doing, we hope to broaden the lattice and courses which realize these diseases. For example, the pres- integrating organism specific typing systems with the IDO ence of the PVL toxin in Sa can lead to necrotic lesions suite of ontologies. We believe that such an effort can be a (ogms:disorder) and necrotizing pneumonia (ogms:disease). powerful enabler for a new taxonomy of infectious disease and its supporting knowledge network. 4 FACETED BROWSING OF THE LATTICE A faceted browser of the ontologically annotated NARSA ACKNOWLEDGEMENTS isolates was constructed using the MIT Exhibit 2.0 library This work was funded by the National Institutes of Health through Grant R01 AI 77706-01. Smith’s contributions were funded through the NIH (http://www.awqbi.com/LATTICE/narsa-complete.html). Roadmap for Medical Research, Grant U54 HG004028 (National Center This tool allows the user to visualize and correlate isolate for Biomedical Ontology). information across different dimensions (see Figure 3). REFERENCES Committee on the Framework for Developing a New Taxonomy of Disease (2011). Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. The National Academies’ Findings Report. Courtot, M., Gibson F., Lister, A. L., Malone, J., Schober, D., Brinkman, R. R., and Ruttenberg, A. (2011). MIREOT: The minimum information to reference an external ontology term. Applied Ontology, 6(1), 23-33. Goldfain, A., Smith, B., and Cowell, L. G. (under review). BFO Disposi- tions and their Bases: Two Shortcut Relations. Goldfain, A., Smith, B., and Cowell, L. G. (2011). Towards an Ontological Representation of Resistance: The Case of MRSA. Journal of Biomedi- cal Informatics, 44(1), 35-41. Katayama, Y., Ito, T., and Hiramatsu, K. (2000). A New Class of Genetic Fig 3. Faceted browsing illustrates that most isolates with a re- Element, Staphylococcus Cassette Chromosome mec, Encodes Methi- sistance to Clindamycin are of SCCmec type II and lack PVL cillin Resistance in Staphylococcus aureus. Antimicrobial Agents and Chemotherapy, 44(6), 1549-1555. Linking to external resources is facilitated by the fact that Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., such facets are assigned ontology types from the IDO lat- Goldberg, L. J., Eilbeck, K., Ireland, A., Mungall, C. J., The OBI Con- tice. These are exactly the kinds of links that will be needed sortium, Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S.-A., for the knowledge network supporting a new taxonomy of Scheuermann, R. H., Shah, N., Whetzel, P. L., and Lewis, S. (2007). disease. The OBO Foundry: coordinated evolution of ontologies to support bi- omedical data integration. Nat Biotechnol, 25(11), 1251–1255. 5 CONCLUSION A lattice of infectious disease ontologies can serve as a mechanism to integrate pathogen-specific typing systems such as SCCMec with phenotypic data such as drug re- sistance. Such genotype-phenotype relations will be the key to a more effective taxonomy of disease that enables truly personalized medicine. The lattice of infectious diseases is expected to grow along predictable dimensions (host organ- ism, infectious agent organism, drug resistance), but can accommodate lightweight application ontologies that are created for very specific purposes. Each such application ontology will have a place in the lattice on the basis of what IDO terms it imports. We have shown that IDO-conformant annotation of iso- late data (such as that in the NARSA repository) is possible without the need to reassemble OBO Foundry resources for new applications. Other benefits of our approach include: exposing currently accepted SCCmec types in a computable format via an ontology and validating the NARSA antimi- crobial profile using the NDF-RT. 5