Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 1 Transforming and Unifying Research with Biomedical Ontologies The Penn TURBO project Christian J. Stoeckert Jr. David Birtwell, Heather Williams Dept. of Genetics, Institute for Biomedical Informatics Penn Medicine BioBank, Institute for Translational Medicine and Perelman School of Medicine, University of Pennsylvania Therapeutics, Perelman School of Medicine, University of Philadelphia, PA, USA Pennsylvania stoeckrt@pennmedicine.upenn.edu Philadelphia, PA, USA Hayden Freedman, Mark A. Miller Institute for Biomedical Informatics Perelman School of Medicine, University of Pennsylvania Philadelphia, PA, USA Abstract— The Penn TURBO (Transforming and Unifying referent tracking [2], associating information for the same Research with Biomedical Ontologies) project aims to accelerate person, quality, or event with a unique identifier for that referent finding and connecting key information from clinical records for regardless of where and when the information was obtained. research through semantic associations to the processes that generated the clinical data. Major challenges to using clinical data The Open Biomedical Ontologies Foundry [3] provides for research are integrating data from different sources which through its library of ontologies the ability to create a biomedical may contain multiple references to the same entity (e.g., person, ontology that is realism-based. We created the TURBO health care encounter) and incomplete or conflicting information ontology as an application ontology based on these ontologies (e.g., gender, BMI). There is also the need to track the provenance drawing from the Ontology for Biomedical Investigations (OBI) of information used when making decisions on what is the actual [4] and the Ontology for Biobanking (OBIB) [5] in particular. phenotype of a person. We take a realism-based ontology By application ontology, we mean that we are primarily reusing approach to address these problems through transformation and terms (classes, instances, and relations) from existing ontologies instantiation of clinical data with an OBO-Foundry based and creating terms only as needed to move the project forward. application ontology in a semantic graph database. We have Terms that potentially have broader usage are submitted to developed an application stack and used it on an 11,237 whole existing ontologies. exome sequencing patient cohort capturing key demographics, diagnosis codes, and prescribed medications. The anticipated An application stack called Drivetrain was developed to payoff is to be able to make use of inferencing provided by the perform part of the transformation, the unification, referent semantics to classify and search for instances of people and tracking, and generating conclusions as RDF statements about specimens with desired characteristics. people and their qualities. Currently the Karma tool [6] is used to transform tabular data into initial RDF triples for Drivetrain Keywords—realism-based ontology; OBO Foundry; referent to use. Ontology modeling is also used to capture provenance tracking; clinical data; diagnosis codes; prescriptions of data and conclusions drawn based on the data. After running I. INTRODUCTION the Drivetrain stack, the reasoning capabilities of the semantic graph database can be used to classify and aid search for The goal of the TURBO project is to transform and unify instances of people and specimens with desired characteristics. research data with biomedical ontologies. Typically data are For example, people can be identified who have been prescribed obtained in tabular form often from relational databases. The a particular class of drugs (‘statins’). We intend to create column headers and row values are often idiosyncratic and even phenotypic profiles in the form of equivalence axioms that will when based on a standard may be malformed, incomplete, and be used to infer which people or specimens match those profiles. contradictory. Dependencies and deep relations between the headers (data variables) and values are rarely explicit. II. METHODS Transforming the data into a semantic graph instantiating a realism-based ontology allows us to state what is known about A. Technologies used in TURBO people and what has happened to them, what information is Ontotext GraphDB (version 8.4.1) [7] is the semantic graph available about them, and what conclusions can be drawn based database used. Scala (version 2.11) [8] is used for on that information. Clinical data often comes from multiple programmatic interaction with the database, leveraging the sources (e.g., EPIC, REDCap). Instantiation of data from RDF4J (version 2.2.2) library [9]. UUIDs are generated using different sources in the same realism-based ontology [1] allows the randomUUID() method found in the java.util.UUID us to unify the data. Part of the unification comes through TURBO is supported by the Institute for Biomedical Informatics and the Institute for Translational Medicine And Therapeutics at the University of Pennsylvania Perelman School of Medicine. ICBO 2018 August 7-10, 2018 1 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 2 package [10]. LIBSVM was used through the svm() function In addition to RDF triples generated from the data, from R e1071 [11]. individual ontologies and terminologies were also loaded into the GraphDB database. The ontologies included the TURBO The TURBO ontology was generated following the application ontology, RDF representations of ICD9 and ICD10 approach described in [12]. Terms were selected from OBIB codes obtained from the NCBO Bioportal [16], all portions of using Ontodog [13] and additional terms were imported using the Drug Ontology [17] except NDC annotation, the “lite” the OntoFox tool [14]. New terms were added using Protégé component of ChEBI [18], and the Monarch Disease Ontology [15]. (MonDO) [19]. B. TURBO content C. Generation of RDF triples to load into the TURBO Data on a whole exome sequencing cohort of 11,237 GraphDB database. participants (‘biobank consenters’) have been used to populate The Karma application (version 2.1) was used to generate a GraphDB database. The data include information on gender RDF triples from the tabular data for loading into the GraphDB identity, date of birth, and body mass index (BMI, calculated database. Karma models were based on the TURBO ontology. from height and weight) collected during 14,450 biobank D. TURBO code and documentation encounters and 98,585 health care encounters. In addition, 181,420 diagnosis codes and 136,249 medications were The code base for the Drivetrain component is available at obtained during health care encounters. The data was obtained GitHub including documentation of the full TURBO stack and from relational tables provided by the Penn Medicine Biobank description of ontology modeling. from two sources, a data warehouse and REDCap. https://pennturbo.github.io/Turbo-Documentation/ Figure 1. A graph depicting instantiated parts of the TURBO ontology including ‘biobank consenter’. Nodes are classes whose size reflects usage in the instantiation of the WES cohort data. Edges are object properties (including the green ‘subclass of’ but with the exception of the pink edge) whose width also indicates usage. The one exception is a pink annotation property indicating that a ‘retired placeholder for biobank consenter’ was ‘replaced with’ ‘biobank consenter’ as a result of the referent tracking process. ICBO 2018 August 7-10, 2018 2 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 3 III. RESULTS B. EXPAND Queries create fully ontologized model from shortcut triples A technology stack has been developed for the TURBO project that implements a pipeline to transform tabular data into The shortcut expansion phase takes all triples in the input semantic triples, stored in a Resource Description Framework data that use shortcut relations and expands them to fully (RDF) triple store, using terms from the TURBO Ontology ontologized forms. A single shortcut triple will likely expand to (https://raw.githubusercontent.com/PennTURBO/Turbo- multiple ontologized triples. In addition to expanding the triples, Ontology/master/ontologies/turbo_merged.owl). The TURBO the Internationalized Resource Identifiers (IRIs) in the imported ontology at time of writing consists of 727 terms (415 classes, data are made unique using Universally Unique Identifiers 41 individuals, 271 properties). These are primarily drawn from (UUIDs). After this phase is complete, the data in the isolated 25 ontologies with 161 new terms created for TURBO (69 import graph have globally unique identifiers and are fully classes, 19 individuals, 73 properties). URIs and all labels of ontologized, though they may not yet be ready to be terms instantiated in the current TURBO semantic repository are incorporated into the rest of the triple store. listed at the bottom of: https://pennturbo.github.io/Turbo- Data integrity rules are applied to all triples in the isolated Documentation/turbo-ontology.html (along with a discussion import graph to assure that the data meet the minimum level of and an example of an instantiated triple higher on the page). integrity required by the Drivetrain application. Several Terms in the TURBO ontology are focused on patients and their conditions must be met to pass, including checks that all classes qualities along with information collected on them, ‘health care and properties present in the incoming data must also be present encounter’s (http://purl.obolibrary.org/obo/OGMS_0000097) in the TURBO ontology, all denoted registries must be and their outputs (diagnoses, measurements), and biobank represented in the ontology, and all dates must be parseable, encounters and their outputs. The new terms mainly cover reasonable, and be typed as dates. If all integrity checks have shortcut relations utilized in the Karma mapping and for passed, then the data are ready to be connected to the rest of the managing UUIDs during referent tracking. At the Penn graph. Medicine Biobank, data are collected when participants are consented at which time they have not yet donated a specimen C. Scala-based REFERENT TRACKER combines duplicate but have been assigned an ID. To capture this case, a ‘biobank entities consenter’ term has been generated defined as a participant in a During the Referent Tracking phase, all instantiated IRI- biobank consenting process (Figure1). Incorporating the essence bearing terms that singularly and uniquely refer to a single thing of this term is in progress with ICO [20] and OBIB developers. in reality are replaced with a single Instance Unique Identifier The Karma tool was used to map relational data to ontology (IUI), which is implemented by Drivetrain as an IRI that terms saved with an extended version of the R2RML language. specifically contains a Universally Unique Identifier value The mappings were then used to publish the data as RDF triples. (UUID). After this phase is complete, the RDF data are The initial RDF triples make use of shortcut relation properties normalized such that all entities in reality can be identified by a to simplify the manual mapping. The essence of TURBO single unique identifier that is independent yet connected to the shortcut relations is to allow a minimal number of classes to be source relational data (Figure 2). instantiated – frequently just one. For example, an input table nominally about health care encounters may include height, weight and body mass index (BMI) values. Those data items are not values of the encounters, but rather values of properties borne by the people who participated in the encounters. The shortcut relation “shortcut health care encounter to BMI” eliminates the need to instantiate a class that represents the encounter participants and instead says that there is some path from the encounter to the BMI value. The Drivetrain application (described next) contains all of the logic necessary to expand the shortcut into a semantically complete description of reality. The Drivetrain application was built to load and process the RDF triples with the following steps: A. Shortcut RDF Triples and TURBO ontology loaded to an Ontotext GraphDB repository During the data import step, the input data are written to an Figure 2. Prototypical referent tracking. Blue nodes are literals. isolated section of the graph. The triples are not expected to have Edges are annotation properties providing provenance for globally unique identifiers and so must be sectioned off from all referent tracking. other data in the triple store. Since our data comes from many sources, it is possible that the same ‘biobank consenter’ may appear in multiple data sources, each of which may contain different or contradicting ICBO 2018 August 7-10, 2018 3 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 4 information. It is the goal of the Referent Tracker to apply • If the BMI cannot be computed from the health care custom rules in order to determine when two consenters must be encounter, but there are valid height and weight combined into one. Likewise, the same encounter may also measurements records on the case report form filled out as appear in multiple data sources. A simple rule is that the part of the study recruitment process, compute the BMI identifier and identifier source (central registry ID symbol and from the case report form data and conclude that it is the registry) associated with the entity are the same. person’s BMI at the given date of recruitment. D. Scala-based ENTITY LINKER links Health care and • If neither the health care encounter nor the study Biobank Encounters to Biobank Consenters recruitment encounter yield a BMI conclusion, then record Entity Linking is a generic term used here to mean the that BMI for this given date of recruitment is inconclusive. process of attaching consenters to their encounters based on F. Diagnosis Data is mapped by cross-referencing data provided by a relational Join table. This process is ICD9/ICD10 hierarchies and MonDO ontologies necessary because consenters and their encounters may be Diagnosis codes come to TURBO in the form of ICD9 and received in separate files. Drivetrain can make matches by ICD10 codes [21]. In order to enable searches broader than a comparing the literal values of encounter symbols and single code value, we load RDF versions of ICD9 and ICD10 consenter symbols, and the values of the respective registries. downloaded from the NCBO Bioportal, which provide E. Scala-based CONCLUSIONATOR creates inferences subClassOf relations. We also load MonDO, an aggregation of about Dates of Birth, Biological Sex, and BMI disease ontologies including the Human Disease Ontology During the conclusionating phase, rules are applied to the [22]), which includes database cross references for ICD codes. data to generate statements about a person or event. Currently We use these cross references to create mentions between this is done to resolve potentially conflicting data to single diagnosis codes and diseases, thereby enabling disease-based conclusions, which can be used for querying purposes. The searches. potentially conflicting data derived from the sources remain in G. Medication Order Name Data are mapped to ontologies the graph and can be queried. In the future, it will be used to using Solr indexed text search and a Support Vector combine data of different types (e.g., diagnosis code, Machine (SVM) medication, lab test result) to make a single statement (e.g., a Medication orders are provided primarily as free text, often person is diabetic). To facilitate easy querying, the conclusions, including dosage and route of administration information. which are RDF triples, are placed in a separate named graph. Associating these orders to terms in ChEBI (Chemical Entities After this phase is complete, there will be a named graph of of Biological Interest) would enable searches based on the conclusions, which contains simplified non-conflicting parent classes of active ingredients and their roles. To statements. Conclusionating is applied to generate statements accomplish this, the orders are computationally mapped to about the consenter’s biological sex, date of birth, and BMI at terms from the Drug Ontology (DRON) which provides cross- the date of each biobank encounter. The rules used for drawing references to ChEBI. About 30% of the distinct medications conclusions are currently very simple, but the system is prescribed to our WES cohort also came with RxNorm envisioned to handle more complex rules and be able to draw identifiers [23] that could be directly associated to DRON and on a library of different rules in the future. ChEBI via direct cross references. The RxNorm associations One way to calculate BMI is by performing a computation were then used as a training set for machine learning (LIBSVM) over a person’s height and weight, which can be measured using results from the string matching output from Apache Solr during a health care encounter or recorded on a case report form [24]. For the WES cohort, we were able to map 86.1% of during study recruitment during a biobank encounter (when a distinct medications (sensitivity = 0.98; specificity = 0.95) person becomes a ‘biobank consenter’). It is useful to know the covering 88% of the total medications prescribed (excluding BMI of biobank consenters at their date of recruitment. non-drug prescriptions). It is not guaranteed that the source data required to calculate BMI at date of biobank encounter will be both available and of H. Performance sufficient quality. It may be that height and weight The complete Drivetrain stack was run on a linux application measurements were recorded at the health care encounter, the server with 8 GB RAM and 2 processors and a GraphDB biobank encounter, neither, or both. Further, the data may have database server with 64 GB RAM and 4 processors. been recorded improperly, which would result in a calculated The run from loading of graph through medication mapping BMI that is outside the acceptable range. (steps described in sections A through G above) took 82 The following rules are currently applied to account for minutes for the WES cohort data and supportive ontologies. It these situations: resulted in 25,521,235 triples. About 3.6 million triples were For each date of recruitment for each person: initially loaded and then expanded to about 12 million triples. • If there are in-range height and weight measurements Additional triples resulted from referent tracking, recorded in the health care encounter on the date of conclusionating, and adding diagnosis and medication terms recruitment, compute the BMI and conclude that it is the and associations. person’s BMI at the given date of recruitment. Searches for diagnosis classes take approximately a second. For example, a search for all participants in a health care ICBO 2018 August 7-10, 2018 4 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 5 encounter which resulted in a diagnosis that mentions The TURBO project represents a new direction in applying ‘myocardial infarction’ will return those assigned a ICD10 code ontologies to clinical data. Most efforts do not explicitly involve of I21.3 (acute myocardial infarction). realism-based ontologies or if they do use them it is in the form of associations and not instantiations. However, there are related Searches for medications also take on the order of seconds. projects instantiating OBO and realism-based ontologies. These A search for all participant prescribed a ‘statin’ returned all include ones by William Duncan (Roswell Park) [26], by appropriate statins and no inappropriate ones based on drug Amanda Hicks and William Hogan (U. Florida) [27], and by name matches and their active ingredients with one important Bjoern Peters (LaJolla Institute for Immunology) [28] although exception. Crestor contains rosuvastatin but is not identified as they don’t do referent tracking or conclusionating as in TURBO. a statin. That is because rosuvastatin while present in both This growing number of independent efforts raise the exciting DRON and ChEBI have different IRIs. We are able to address potential of linking such systems together. this issue locally by using equivalence statements between the two (we are also following up with DRON to resolve this issue). Ultimately, we intend for the TURBO project to provide a Phenotype Storefront that users can query to find participants IV. DISCUSSION and specimens of interest. The current plan is to just return the The TURBO project is currently in active development as a number of hits as results and require IRB approval for accessing demonstration project for the Penn Institute for Biomedical identifiable data. We also want to learn from searches made by Informatics. We have a stable application stack, Drivetrain, that investigators in order to generate defined classes of participants combined with the Karma tool, enabled us to transform, load, and specimens. For example, equivalence axioms for someone referent track, and make conclusions related to a real dataset of who has had a particular disease course could include an interest, a WES cohort of 11,237 participants. Unlike traditional appropriate diagnosis code but also a relevant prescription and data warehousing, the TURBO system performs integration laboratory test result. Inferencing applications of this nature will through rules applied during referent tracking and bring to bear the power of ontologies to provide what can’t be conclusionating. The processes used to determine when entities done by traditional relational systems. are the same (people, encounters) in referent tracking or make statements about a person (e.g., BMI) in conclusionating are ACKNOWLEDGMENT modeled in the ontology and stored in the graph for provenance. All the authors have been approved under IRB protocol 813913 Thus, Drivetrain provides an ontology-supported knowledge from the University of Pennsylvania to work with the described layer along with the loaded data. patient data. We thank Werner Ceusters and William Hogan for User stories, common requests by researchers searching their advice and feedback on implementation of referent clinical data, are driving TURBO development. Competency tracking. We also thank Jason Moore, Scott Damrauer, Michael questions based on these user stories are then used to evaluate Feldman, Peter Gabriel, John Holmes, and Daniel Rader for the system. Examples include identification of people of their support and guidance as the TURBO governance board. specified age, biological sex, and BMI. These are possible as is finding those who have been prescribed a particular class of REFERENCES drugs and assigned a diagnosis code linked to a particular class of disease. We are currently working on adding genotype data [1] B. Smith and W. Ceusters, “Ontological realism: A methodology for resulting from exome sequencing. Future additions will include coordinated evolution of scientific ontologies,” Appl Ontol. 2010 Nov 15;5(3-4):139-188. laboratory tests. [2] W. Ceusters and B. Smith, “Strategies for referent tracking in electronic Scalability of the system remains to be determined. We plan health records,” J Biomed Inform. 2006 Jun;39(3):362-78. to expand both the number of participants and type of data [3] B. Smith, et al., “The OBO Foundry: coordinated evolution of ontologies instantiated in the semantic graph database. At 25 million triples, to support biomedical data integration,” Nat Biotechnol, 2007. 25(11): p. 1251-5. our current graph database has room to grow. We run Drivetrain [4] A. Bandrowski, et al., “The Ontology for Biomedical Investigations,” with reasoning off but can then load into a graph database with PLoS One, 2016. 11(4): p. e0154556. RDFS+ or OWL-Horst reasoning turned on. For the current [5] M. Brochhausen, et al., “OBIB-a novel ontology for biobanking,” J datasets this takes less than an hour. We are also exploring Biomed Semantics, 2016. 7: p. 23. loading shortcut triples generated by alternative methods to [6] C. A. Knoblock, et al., “Semi-Automatically Mapping Structured Sources Karma that are less manual. into the Semantic Web,” ESWC’2012 Our efforts at medication mapping have used standard tools [7] Ontotext GraphDB. https://ontotext.com/products/graphdb/ with good success but we would like to improve coverage as [8] The Scala Programming Language. https://www.scala-lang.org/ much as possible. Some prescriptions are not medications at all [9] Eclipse RDF4J. http://rdf4j.org/ (e.g., wheelchairs, saline solutions, etc.) and we can generate [10] Class UUID. lists to recognize these. We will explore use of other https://docs.oracle.com/javase/8/docs/api/java/util/UUID.html terminologies (e.g., NDFRT [25]) that may provide routes [11] R e1701e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. https://cran.r- through active ingredients and equivalence matches to entries in project.org/web/packages/e1071/index.html ChEBI. Once we have a ChEBI IRI linked to a prescription it [12] J. Zheng, E. Manduchi, and C. J. Stoeckert, “Development of an then can be searched based on the structure or role of the active application ontology for beta cell genomics based on the ontology for ingredient. biomedical investigations,” CEUR Workshop Proceedings, 1060, 62-67, 2013. ICBO 2018 August 7-10, 2018 5 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 6 [13] J. Zheng, Z. Xiang, C. J. Stoeckert Jr., and Y. He, “Ontodog: a web-based [21] World Health Organization International Classification of Diseases. ontology community view generation tool” Bioinformatics, 2014 May http://www.who.int/classifications/icd/en/ 1;30(9):1340-2. [22] W. A. Kibbe, et al., “Disease Ontology 2015 update: an expanded and [14] Z. Xiang, M. Courtot, R. R. Brinkman, A. Ruttenberg, and Y. He, updated database of human diseases for linking biomedical knowledge “OntoFox: web-based support for ontology reuse,” BMC Res Notes. 2010 through disease data,” Nucleic Acids Res. 2015 Jan;43(Database Jun 22;3:175. issue):D1071-8. [15] M. A. Musen, “The Protégé project: A look back and a look forward. AI [23] RxNorm. https://www.nlm.nih.gov/research/umls/rxnorm/ Matters,” Association of Computing Machinery Specific Interest Group [24] Apache Solr. http://lucene.apache.org/solr/ in Artificial Intelligence, 1(4), June 2015. [25] NDFRT (National Drug File - Reference Terminology) – Synopsis. [16] P. L. Whetzel, et al., “BioPortal: enhanced functionality via new Web https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/NDF services from the National Center for Biomedical Ontology to access and RT/ use ontologies in software applications” Nucleic Acids Res. 2011 [26] T. Thyvalikakath, et al., “National Dental PBRN. Restorative/Endodontic Jul;39(Web Server issue):W541-5. Procedures Performed in National Dental PBRN Practices,” J Dent Res [17] W. R. Hogan, et al., “Therapeutic indications and other use-case-driven 97 (Spec Iss ): 2859794, 2018. updates in the drug ontology: anti-malarials, anti-hypertensives, opioid analgesics, and a large term request,” J Biomed Semantics. 2017 Mar [27] PCORowl. https://zenodo.org/record/1241209#.WvoBFsgh2L4 3;8(1):10. [28] R. Vita, J. A. Overton, J. A. Greenbaum, A. Sette, OBI consortium, and B. Peters, “Query enhancement through the practical application of [18] J. Hastings, et al., “ChEBI in 2016: Improved services and an expanding ontology: the IEDB and OBI,” Journal of Biomedical collection of metabolites,” Nucleic Acids Res. 2016 Jan 4;44(D1):D1214- Semantics20134(Suppl 1):S6. 9 [19] Monarch Disease Ontology. http://obofoundry.org/ontology/mondo.html [20] Inormed Consent Ontology (ICO). https://github.com/ICO-ontology/ICO ICBO 2018 August 7-10, 2018 6