Medical and Transmission Vector Vocabulary Alignment with Schema.org William Smith*, Alan Chappell, and Courtney Corley Pacific Northwest National Laboratory ABSTRACT are not standardized across research projects and communi- Available biomedical ontologies and knowledge bases currently lack ties. formal and standards-based interconnections between disease, disease This effort addresses the tracking of a disease and treat- vector, and drug treatment vocabularies. The PNNL Medical Linked Da- ment regimen across vector-borne transmission variables, taset (PNNL-MLD) addresses this gap. This paper describes the PNNL- including geography and species. The variety of issues de- MLD, which provides a unified vocabulary and dataset of drug, disease, scribed renders any available single source of research data side effect, and vector transmission background information. Currently, the PNNL-MLD combines and curates data from the following research pro- unusable to address realistic research questions across the jects: DrugBank, DailyMed, Diseasome, DisGeNet, Wikipedia Infobox, breadth of this domain space. Table 1 represents common Sider, and PharmGKB. The main outcomes of this effort are a dataset diseases and transmission vectors for tracking vector-borne aligned to Schema.org, including a parsing framework, and extensible infections that were used as the starting point. hooks ready for integration with selected medical ontologies. The PNNL- MLD enables researchers more quickly and easily to query distinct da- Transmission tasets. Future extensions to the PNNL-MLD may include Traditional Chi- Disease Vector nese Medicine, broader interlinks across genetic structures, a larger thesau- Culiseta rus of synonyms and hypernyms, explicit coding of diseases and drugs Eastern Equine Encephalitis across research systems, and incorporating vector-borne transmission vo- melanura / Cs. Virus cabularies. morsitans Western Equine Encephalitis Culex / 1 INTRODUCTION Virus Culiseta Medical vocabularies and ontologies have been developed Culiseta Highlands J Virus over the last two decades and represent a large cross-section melanura of Linked Open Datasets. Several research initiatives are St. Louis Encephalitis Virus Culex now de facto authoritative data stores used by thousands of West Nile Virus Many medical researchers daily including: DrugBank (Law, et al. Ochlerotatus 2014), PharmGKB (Stanford University 2014), Vectorbase triseriatus La Crosse Encephalitis (National Institute of Allergy and Infectious Diseases; synonym Ae- National Institutes of Health; Department of Health and des triseriatus Human Services 2014), Uniprot (Consortium 2014), Allen A. albopictus Institute for Brain Science (AIBS) Brain Map (Allen Chikungunya and A. aegypti Institute for Brain Science 2014), and Kyoto Encyclopedia Genus Aedes, of Genes and Genomes (KEGG) (Kanehisa, et al. 2014). Dengue Fever principally A. However, with the collection of these advanced medical aegypti vocabularies and descriptive logic rules, a data classification divergence occurred. Table 1. Common diseases and associated transition vectors. Medical research groups rarely attempted to standardize vocabularies and ontologies with other research teams. This 2 INITIAL VOCABULARIES AND created data resources that are not natively interconnected ONTOLOGIES with knowledge bases outside of a specific research objec- One way of making use of the extensive previous work in tive. Furthermore, specific medical coding may exist on an disease descriptions by different research efforts and ena- entity level (OMIM, MeSH, eMedicine, etc), but there is no bling associations across these vocabularies is assembling a inherent guarantee across data sources that these codes are knowledge base targeting the research area of interest. The available or properly represented in a standard format. Enti- more overlapping sets of information present in the resulting ty matching between datasets is complicated by the fact knowledge base the better chance a system has of making most medical classes operate on a complex set of synonyms, associations across vocabularies simply because of the hypernyms or taxonomical naming schemas that typically availability of information on which to make the associa- tions. For tracking vector-borne infections, disease datasets * To whom correspondence should be addressed: william.smith@pnnl.gov Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 1 Smith et al. Dataset Schema Schema Predi- Schema Unaligned Unaligned Unaligned Entities cates Objects Entities Predicates Objects Diseasome 4,213 7 31,538 3,938 13 43,836 PharmGKB 3,442 2 43,030 0 3 10,326 DisGeNet 13,172 1 13,172 0 3 39,516 Wikipedia Infobox 2,273 2 5,747 0 3 5,179 DailyMed 5,019 3 11,729 9,294 25 151,243 DrugBank 4,772 10 155,410 19,686 89 29,230 Sider 2,661 10 51,244 9 89 32,370 Table 2: Entity, predicate, and object counts after Schema.org alignment. are a primary focus. Therefore, the team initially collect- ed authoritative resources with a large amount of disease 3 TARGET VOCABULARY: SCHEMA.ORG entities and extensive properties attached to each entity. In order to facilitate easier query description through a The chosen datasets and entity count estimates include: consistent vocabulary, the project chose one primary vo- Diseasome (Goh, et al. 2007), PharmGKB, DisGeNet cabulary to encompass the collected data. Selection of this (DisGeNet 2014), and Wikipedia Infobox (Wikimedia vocabulary is driven by two primary considerations: 1) Foundation 2014). Table 2 depicts the data sets incorpo- adequate expressiveness for the queries, and 2) not overly rated and the scale of the associated relevant vocabularies. prescriptive such that it creates conflicts with the individ- These datasets provided different levels of expression ual dataset semantics. The selection of this primary vo- across diseases, an example being PharmGKB having a cabulary is important, as it is an opportunity to promote small number of diseases with many properties versus wider use of the assembled dataset through adoption of an DisGeNet having several times more entities expressed impactful or widely used vocabulary. with a single name property and medical code. Schema.org (Google Inc; Microsoft Inc; Yahoo Inc Drug datasets, while initially not appearing to be part 2014) was released in June 2011, and has become the of the use case of tracking vector-borne infections, are search industry preferred standard for publishing search useful as a direct path for aligning diseases across naming engine readable data. After the release of schema.org a conventions. The selected drug datasets and estimated RDFS (W3C RDF Working Group 2004) mapping was entity counts include: DailyMed (United States National created and hosted on http://schema.rdfs.org, and this Library of Medicine 2014) and DrugBank. In practice, mapping is now a standard for Linked Data research uti- drug datasets contain an extensive listing of medical lizing Schema.org. Finally, at the end of June 2011, codes, collected from prior research, across databases Schema.org released an official OWL (W3C OWL often missing from disease datasets. While these codes Working Group 2012) version of the Schema.org ontolo- can be imprecise, they provide a starting point for entity gy bridging the gap between vocabulary and description interlinks and additional data enrichment through NLP logic. and Linked Data techniques. When we focus on the dis- Schema.org provides a base ontology class for medical ease medical codes affected by a specific treatment, the entities available as a subclass of Thing entitled Medi- medical codes in the drug datasets enable us to program- calEntity. The subclasses of the MedicalEntity class were matically create owl:sameAs relations across diseases in selected to represent the disease, drug, and side effect the disease data sets that are missing explicit matching entities available within the PNNL-MLD. Table 3 lists medical codes or proper names. As a result, when drugs the selected sub-classes. listing extensive medical codes are used as a reference point, diseases often can be more fully described, as miss- Schema.org Class Entity ing medical codes are combined across datasets for more MedicalCondition Disease complete Linked Open Data. MedicalCause Disease Cause Side effects were also included in the initial PNNL- MedicalSignOrSymptom Disease Symptom MLD. This additional information enables detecting MedicalTherapy, Drug Drug symptoms and matching the symptom to a disease or drug MedicalCode Entity Code combination. The Sider (Kuhn, et al. 2010) dataset was MedicalEntity Side Effect selected as the lone source due to limited availability, but Table 3. Schema.org classes selected to represent use case Sider contained dozens of different connections per entity entities. across drugs further helping to align the combined da- taset. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes2 Medical and Transmission Vector Vocabulary Alignment with Schema.org 4 VOCABULARY ALIGNMENT enced in both disease and side effect datasets as potential Simply adding a primary vocabulary to the datasets is not treatments (disease) and causes of (side effect). The first adequate to simplify querying. The source datasets must substitution took place by converting all unique entity be aligned with the primary vocabulary so that queries IRIs to a common format: will return results that span and integrate all the available beo-drug: information. The central goal in this alignment is to pro- We then added the Schema.org declaration of class: vide a mapping of the source vocabularies to the new a schema:Drug primary vocabulary that preserves the semantics of the source but bridges the divergence between the different Drug, a subclass of MedicalTherapy, was selected due to knowledge representations. the semantics of the original data. Drugs have the same medical coding standards as Table 4, but the attributes 4.1 Base dataset alignment linking the drugs are more abstract including two descrip- The project selected the URI: tions of the drug: http://beowulf.pnnl.gov/2014/ to serve as the RDF (W3C schema:potentialAction RDF Working Group 2004) prefix base for all aligned schma:description data. We used this new base URI to simplify software To link the drug entity to a disease we replace: development later in the alignment process. Furthermore, beo-drugbank:possibleDiseaseTarget all properties were immediately aligned by import dataset, with: prefix associations demonstrated by the following: schema:possibleTreatment beo-:propertyName. Finally, drugs can interact with each other creating ad- By first associating property and class values with an verse reactions. The DrugBank dataset provides the in- original prefix denoting dataset we could now track prop- terconnections for this possibility. We aligned these reac- erties that were not explicitly aligned to Schema.org. The tions by creating the entity type: rdfs:label and owl:sameAs properties were left unmodi- a beo-drugbank:drug_interactions fied throughout the entire process, and sche- And ensuring the new entity has at least two of the fol- ma:alternateName is used to track synonyms of lowing relations: rdfs:label. schema:interactingDrug beo-drug:id 4.2 Disease dataset alignment 4.4 Side Effect dataset alignment Four large datasets of varying entity counts and properties The single Sider dataset provides the final links to drug were the first targets after the base import of the PNNL- entities with each side effect’s unique IRI converted to: MLD. The first substitution took place by converting all beo-interaction: unique entity IRIs to a common format: beo-disease: Then adding the Schema.org declaration of class: We then added the Schema.org declaration of class: a schema:medicalEntity a schema:MedicalCondition Completing the ontology requires one last step linking Primary preventions were added to diseases as drug IRIs drugs to side effects with the drug entity property: were detected: schema:seriousAdverseOutcome beo-interaction: schema:primaryPrevention beo-drug:, … 5 QUERY A DISEASE Finally, we use Table 4 to ensure we can match back to Using Dengue Fever as an example disease we can now online medical resources and unify datasets: use schema:MedicalCondition to query across all of the disease datasets. The SPARQL (W3C SPARQL Working Schema.org Class Entity Group 2013) query below locates the available infor- MedicalCode IRI mation in the combined dataset about any medical condi- tion with “dengue” in its name and collects the comments MedicalPage URI that describe the source of that information. code Unknown Code Type Table 4. Alignment of Schema.org classes to medical resources. @prefix schema: 4.3 Drug dataset alignment @prefix rdfs: terlinks to side effect metadata. These entities were refer- Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 3 Smith et al. SELECT ?label ?comment GRAPH ?G2{ WHERE { ?drug schema:possibleTreatment ?target . ?item a schema:MedicalCondition . ?target rdfs:label ?diseaseTarget }} ?item rdfs:label ?label . FILTER (regex(?label, 'dengue', 'i')) . This query returns Table 6. OPTIONAL { ?item rdfs:comment ?comment }} ?drugLabel ?diseaseTarget Running this query on the PNNL-MLD returns Table 5. "Alpha-D-Mannose" "Dengue_fever,_protection_against" "Fucose" "Dengue_fever,_protection_against" ?label ?comment Table 6. Result of SPARQL query for drugs treating Dengue "Imported from Fever showing value of aligned vocabulary. "Dengue shock syndrome" DISGENET” "Imported from Another limitation of the current PNNL-MLD is exposed "Dengue" PharmGKB " reviewing the results of Table 6. When creating interlinks "Imported from across diseases, only the Diseasome entities were refer- "Dengue Hemorrhagic Fever" PharmGKB" enced in the corresponding drug datasets as possible tar- "Imported from gets for treatment. To correct this oversight we also need "Dengue" DISGENET" to include owl:sameAs associations within our queries, or "Imported from select a logical reasoner capable of associating and return- "Dengue Hemorrhagic Fever" ing all related entities upon a single link between a dis- DISGENET " "Dengue fever, protection against" ease and drug. "Dengue_fever,_protection_against" Most importantly, Table 6 depicts the value of the Table 5. Result of SPARQL query on Dengue Fever. combined and aligned PNNL-MLD dataset. Queries like the one given here that require information linking diseas- The results in Table 5 expose a current limitation of the es to treatments or symptoms or side effects are now system due to regex matching of the label property. Be- greatly simplified and can focus on a single vocabulary. cause the query can now reach across several different Schema.org provided classes and properties appropriate datasets with conflicting naming schemes an additional for drafting queries that can provide views of the data not normalization process is needed during the data import to visible using only a single source of data. normalize labels for all of the entities linked with No technical limitation exists that would restrict a user owl:sameAs. from loading all of the datasets into separate graphs of an The results in Table 5 show that one simple query now available triplestore and querying the different vocabular- identifies data from three different sources. This begins to ies across graphs. However, when we align these datasets show the value of the combined dataset. However, to ex- into the PNNL-MLD we achieve four major benefits: plore the full impact of the alignment a more complex 1. Queries are now simplified. Early drafts for query- query is needed that requires the integration of infor- ing across all of the graphs required queries that mation from multiple sources. Expanding on our previous were dozens of lines in length, and portions of the query we can search across all originally returned “den- queries varied drastically in format and language. gue” conditions and append the drug links and treatments 2. A standardized vocabulary, that is industry recog- added with Schema.org . nized, is now in place for application development. 3. All of the graphs, when aligned into the PNNL- @prefix schema: MLD, are now equally extensible. Adding new @prefix rdfs: would require special updates to each dataset, and SELECT DISTINCT ?drugLabel ?diseaseTarget require updates to each specific portion of a query WHERE { using that dataset. GRAPH ?G{ 4. As shown in Table 2, when a dataset is converted ?item a schema:MedicalCondition . using RDF, and not generated from a different file ?item rdfs:label ?label . type (unaligned entities = 0), the Schema.org enti- FILTER (regex(?label, 'dengue', 'i')) .} ties now have a much higher ratio of Schema.org GRAPH ?G1{ predicates to object triple mappings. By flattening ?item schema:primaryPrevention ?drug . the ontology a simplified query now has access to a ?drug rdfs:label ?drugLabel .} much greater range of values and entities. 4 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes Medical and Transmission Vector Vocabulary Alignment with Schema.org Additionally, because all modification and additions made conceptualization of new questions that bridge the earlier while aligning are programmatically defined rather than work without requiring replicating research with a broad- human expert mediated, new version of the PNNL-MLD er focus. This combination of separate datasets with can be easily created as source datasets produce new ver- common data points aligned to nonexclusive properties sions. and ontology rules simplifies queries, and creates a new superset built for application development and public dis- 6 CURRENT LIMITATIONS covery. The complete PNNL-MLD is now capable of being que- ried through SPARQL using only Schema.org associa- ACKNOWLEDGEMENTS tions. However, there are still shortcomings in searching This work was funded by a contract with the Defense for drugs and diseases by name, including the correspond- Threat Reduction Agency (DTRA), Joint Science and ing regex filters. To resolve this conflict a primary label Technology Office for Chemical and Biological Defense for a group of entities related by owl:sameAs should be under project number CB10082. Pacific Northwest Na- selected upon entity interlinking with the previous labels tional Laboratory is operated for the U.S. Department of turned into schema:alternateName properties. Queries Energy by Battelle under Contract DE-AC05-76RL01830. should then be composed to either search for a primary name and/or alternate synonym. To remove duplicates REFERENCES imported from different datasets a reasoner capable of Allen Institute for Brain Science. Allen Human Brain Atlas. 2014. merging owl:sameAs relations should be used when que- http://human.brain-map.org/ (accessed 2014). rying the complete PNNL-MLD. Ashburner, Michael. BioPortal. 2014. Medical coding was not at first considered a feature of http://bioportal.bioontology.org/ontologies/GAZ. the application and early versions of the PNNL-MLD did Consortium, The UniProt. "UniProt: a hub for protein information ." not prioritize accurately creating the properties in Table 3. Oxford Journals 43, no. D1 (2014). As it became more apparent diseases and drugs were not DisGeNet. 10 2014. consistently labeled across datasets, and outside database http://www.disgenet.org/web/DisGeNET/v2.1/dbinfo. entities generally were consistent across datasets, more Goh, Kwang-Il, Michael Cusick, David Valle, Barton Childs, Marc focus was added to ensure medical codes were applied to Vidal, and Albert-László Barabási. "The Human Disease Network." drug and disease entities. However, this process was nev- Proc Natl Acad Sci USA, 4 2007. er finalized through Linked Data authentication to ensure Google Inc; Microsoft Inc; Yahoo Inc. 2014. http://schema.org/. the medical codes supplied were accurate for the attached Heath, Tom, and Christian Bizer. Linked Data: Evolving the Web into a entity. Global Data Space. 1. Berlin: Morgan & Claypool, 2011. 6.1 Future work Kanehisa, M, S Goto, Y Sato, M Kawashima, M Furumichi, and M Tanabe. "Data, information, knowledge and principle: back to To address current limitations we need to focus on best metabolism in KEGG." Nucleic Acids Res, Jan 2014. practices utilizing linked data (Heath and Bizer 2011), Kuhn, M, M Campillos, I Letunic, LJ Jensen, and P Bork. "A side effect and expanding vector transmission geo-properties. resource to capture phenotypic effects of drugs." Epub (NCBI), 1 1. Authenticate medical coding. Confirm the entity is 2010. correctly aligned to outside sources. Law, V, et al. "DrugBank 4.0: Shedding new light on drug metabolism." 2. Add Gazetteer to provide formal geographic nam- PubMed, no. 24203711 (2014). ing entities while also mapping a list of local collo- National Institute of Allergy and Infectious Diseases; National Institutes quialisms for geographic regions. of Health; Department of Health and Human Services. 2014. 3. Add Vectorbase. (National Institute of Allergy and https://www.vectorbase.org. Infectious Diseases; National Institutes of Health; Stanford University. 2014. https://www.pharmgkb.org/. Department of Health and Human Services 2014) United States National Library of Medicine. 10 1, 2014. http://dailymed.nlm.nih.gov/. 7 CONCLUSIONS W3C OWL Working Group. 2012. http://www.w3.org/TR/2012/REC- owl2-overview-20121211/. The broader implications of aligning datasets under a W3C RDF Working Group. 2004. http://www.w3.org/TR/2004/REC- common vocabulary, and making them available using rdf-mt-20040210/. Linked Open Data best practices, is to standardize and W3C SPARQL Working Group. "SPARQL 1.1 Query Language." W3C expand the original research objectives. When we aug- Recommender. March 2013. http://www.w3.org/TR/sparql11-query/ ment the unique vocabulary and ontology mappings of (accessed October 2014). individual research programs with the broader Sche- Wikimedia Foundation. 2014. http://www.wikidata.org ma.org vocabulary, we create data interlinks that enable Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 5