=Paper=
{{Paper
|id=Vol-1515/regular11
|storemode=property
|title=Medical and transmission vector vocabulary alignment with Schema.org
|pdfUrl=https://ceur-ws.org/Vol-1515/regular11.pdf
|volume=Vol-1515
|dblpUrl=https://dblp.org/rec/conf/icbo/SmithCC15
}}
==Medical and transmission vector vocabulary alignment with Schema.org==
<pdf width="1500px">https://ceur-ws.org/Vol-1515/regular11.pdf</pdf>
<pre>
                         Medical and Transmission Vector Vocabulary
                                 Alignment with Schema.org
                                   William Smith*, Alan Chappell, and Courtney Corley
                                                      Pacific Northwest National Laboratory


ABSTRACT                                                                        are not standardized across research projects and communi-
     Available biomedical ontologies and knowledge bases currently lack         ties.
formal and standards-based interconnections between disease, disease                This effort addresses the tracking of a disease and treat-
vector, and drug treatment vocabularies. The PNNL Medical Linked Da-            ment regimen across vector-borne transmission variables,
taset (PNNL-MLD) addresses this gap. This paper describes the PNNL-
                                                                                including geography and species. The variety of issues de-
MLD, which provides a unified vocabulary and dataset of drug, disease,
                                                                                scribed renders any available single source of research data
side effect, and vector transmission background information. Currently, the
PNNL-MLD combines and curates data from the following research pro-
                                                                                unusable to address realistic research questions across the
jects: DrugBank, DailyMed, Diseasome, DisGeNet, Wikipedia Infobox,              breadth of this domain space. Table 1 represents common
Sider, and PharmGKB. The main outcomes of this effort are a dataset             diseases and transmission vectors for tracking vector-borne
aligned to Schema.org, including a parsing framework, and extensible            infections that were used as the starting point.
hooks ready for integration with selected medical ontologies. The PNNL-
MLD enables researchers more quickly and easily to query distinct da-                                                      Transmission
tasets. Future extensions to the PNNL-MLD may include Traditional Chi-                            Disease
                                                                                                                               Vector
nese Medicine, broader interlinks across genetic structures, a larger thesau-
                                                                                                                          Culiseta
rus of synonyms and hypernyms, explicit coding of diseases and drugs                  Eastern Equine Encephalitis
across research systems, and incorporating vector-borne transmission vo-                                                  melanura / Cs.
                                                                                      Virus
cabularies.                                                                                                               morsitans
                                                                                      Western Equine Encephalitis         Culex /
1    INTRODUCTION                                                                     Virus                               Culiseta
Medical vocabularies and ontologies have been developed                                                                   Culiseta
                                                                                      Highlands J Virus
over the last two decades and represent a large cross-section                                                             melanura
of Linked Open Datasets. Several research initiatives are                             St. Louis Encephalitis Virus        Culex
now de facto authoritative data stores used by thousands of                           West Nile Virus                     Many
medical researchers daily including: DrugBank (Law, et al.                                                                Ochlerotatus
2014), PharmGKB (Stanford University 2014), Vectorbase                                                                    triseriatus
                                                                                      La Crosse Encephalitis
(National Institute of Allergy and Infectious Diseases;                                                                   synonym Ae-
National Institutes of Health; Department of Health and                                                                   des triseriatus
Human Services 2014), Uniprot (Consortium 2014), Allen                                                                    A. albopictus
Institute for Brain Science (AIBS) Brain Map (Allen                                   Chikungunya
                                                                                                                          and A. aegypti
Institute for Brain Science 2014), and Kyoto Encyclopedia                                                                 Genus Aedes,
of Genes and Genomes (KEGG) (Kanehisa, et al. 2014).                                  Dengue Fever                        principally A.
However, with the collection of these advanced medical
                                                                                                                          aegypti
vocabularies and descriptive logic rules, a data classification
divergence occurred.                                                                Table 1. Common diseases and associated transition vectors.
    Medical research groups rarely attempted to standardize
vocabularies and ontologies with other research teams. This                     2    INITIAL VOCABULARIES AND
created data resources that are not natively interconnected                          ONTOLOGIES
with knowledge bases outside of a specific research objec-                      One way of making use of the extensive previous work in
tive. Furthermore, specific medical coding may exist on an                      disease descriptions by different research efforts and ena-
entity level (OMIM, MeSH, eMedicine, etc), but there is no                      bling associations across these vocabularies is assembling a
inherent guarantee across data sources that these codes are                     knowledge base targeting the research area of interest. The
available or properly represented in a standard format. Enti-                   more overlapping sets of information present in the resulting
ty matching between datasets is complicated by the fact                         knowledge base the better chance a system has of making
most medical classes operate on a complex set of synonyms,                      associations across vocabularies simply because of the
hypernyms or taxonomical naming schemas that typically
                                                                                availability of information on which to make the associa-
                                                                                tions. For tracking vector-borne infections, disease datasets
* To whom correspondence should be addressed: william.smith@pnnl.gov


 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes                                              1
Smith et al.


          Dataset             Schema         Schema Predi-          Schema           Unaligned         Unaligned         Unaligned
                              Entities            cates             Objects           Entities         Predicates         Objects
 Diseasome                     4,213                7               31,538             3,938              13               43,836
 PharmGKB                      3,442                2               43,030               0                 3               10,326
 DisGeNet                     13,172                1               13,172               0                 3               39,516
 Wikipedia Infobox             2,273                2                5,747               0                 3               5,179
 DailyMed                      5,019                3               11,729             9,294              25              151,243
 DrugBank                      4,772               10              155,410             19,686             89               29,230
 Sider                         2,661               10               51,244               9                89               32,370
                              Table 2: Entity, predicate, and object counts after Schema.org alignment.

are a primary focus. Therefore, the team initially collect-
ed authoritative resources with a large amount of disease              3    TARGET VOCABULARY: SCHEMA.ORG
entities and extensive properties attached to each entity.             In order to facilitate easier query description through a
The chosen datasets and entity count estimates include:                consistent vocabulary, the project chose one primary vo-
Diseasome (Goh, et al. 2007), PharmGKB, DisGeNet                       cabulary to encompass the collected data. Selection of this
(DisGeNet 2014), and Wikipedia Infobox (Wikimedia                      vocabulary is driven by two primary considerations: 1)
Foundation 2014). Table 2 depicts the data sets incorpo-               adequate expressiveness for the queries, and 2) not overly
rated and the scale of the associated relevant vocabularies.           prescriptive such that it creates conflicts with the individ-
These datasets provided different levels of expression                 ual dataset semantics. The selection of this primary vo-
across diseases, an example being PharmGKB having a                    cabulary is important, as it is an opportunity to promote
small number of diseases with many properties versus                   wider use of the assembled dataset through adoption of an
DisGeNet having several times more entities expressed                  impactful or widely used vocabulary.
with a single name property and medical code.                              Schema.org (Google Inc; Microsoft Inc; Yahoo Inc
     Drug datasets, while initially not appearing to be part           2014) was released in June 2011, and has become the
of the use case of tracking vector-borne infections, are               search industry preferred standard for publishing search
useful as a direct path for aligning diseases across naming            engine readable data. After the release of schema.org a
conventions. The selected drug datasets and estimated                  RDFS (W3C RDF Working Group 2004) mapping was
entity counts include: DailyMed (United States National                created and hosted on http://schema.rdfs.org, and this
Library of Medicine 2014) and DrugBank. In practice,                   mapping is now a standard for Linked Data research uti-
drug datasets contain an extensive listing of medical                  lizing Schema.org. Finally, at the end of June 2011,
codes, collected from prior research, across databases                 Schema.org released an official OWL (W3C OWL
often missing from disease datasets. While these codes                 Working Group 2012) version of the Schema.org ontolo-
can be imprecise, they provide a starting point for entity             gy bridging the gap between vocabulary and description
interlinks and additional data enrichment through NLP                  logic.
and Linked Data techniques. When we focus on the dis-                      Schema.org provides a base ontology class for medical
ease medical codes affected by a specific treatment, the               entities available as a subclass of Thing entitled Medi-
medical codes in the drug datasets enable us to program-               calEntity. The subclasses of the MedicalEntity class were
matically create owl:sameAs relations across diseases in               selected to represent the disease, drug, and side effect
the disease data sets that are missing explicit matching               entities available within the PNNL-MLD. Table 3 lists
medical codes or proper names. As a result, when drugs                 the selected sub-classes.
listing extensive medical codes are used as a reference
point, diseases often can be more fully described, as miss-                     Schema.org Class                     Entity
ing medical codes are combined across datasets for more                    MedicalCondition                    Disease
complete Linked Open Data.                                                 MedicalCause                        Disease Cause
     Side effects were also included in the initial PNNL-
                                                                           MedicalSignOrSymptom                Disease Symptom
MLD. This additional information enables detecting
                                                                           MedicalTherapy, Drug                Drug
symptoms and matching the symptom to a disease or drug
                                                                           MedicalCode                         Entity Code
combination. The Sider (Kuhn, et al. 2010) dataset was
                                                                           MedicalEntity                       Side Effect
selected as the lone source due to limited availability, but
                                                                           Table 3. Schema.org classes selected to represent use case
Sider contained dozens of different connections per entity
                                                                                                   entities.
across drugs further helping to align the combined da-
taset.


                         Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes2
                                                           Medical and Transmission Vector Vocabulary Alignment with Schema.org


4     VOCABULARY ALIGNMENT                                            enced in both disease and side effect datasets as potential
Simply adding a primary vocabulary to the datasets is not             treatments (disease) and causes of (side effect). The first
adequate to simplify querying. The source datasets must               substitution took place by converting all unique entity
be aligned with the primary vocabulary so that queries                IRIs to a common format:
will return results that span and integrate all the available                          beo-drug:<disease-id>
information. The central goal in this alignment is to pro-            We then added the Schema.org declaration of class:
vide a mapping of the source vocabularies to the new                                     a schema:Drug
primary vocabulary that preserves the semantics of the
source but bridges the divergence between the different               Drug, a subclass of MedicalTherapy, was selected due to
knowledge representations.                                            the semantics of the original data. Drugs have the same
                                                                      medical coding standards as Table 4, but the attributes
4.1    Base dataset alignment                                         linking the drugs are more abstract including two descrip-
The          project       selected       the       URI:              tions of the drug:
http://beowulf.pnnl.gov/2014/ to serve as the RDF (W3C                                 schema:potentialAction
RDF Working Group 2004) prefix base for all aligned                                       schma:description
data. We used this new base URI to simplify software                  To link the drug entity to a disease we replace:
development later in the alignment process. Furthermore,                        beo-drugbank:possibleDiseaseTarget
all properties were immediately aligned by import dataset,            with:
prefix associations demonstrated by the following:                                   schema:possibleTreatment
          beo-<dataset-name>:propertyName.
                                                                      Finally, drugs can interact with each other creating ad-
By first associating property and class values with an                verse reactions. The DrugBank dataset provides the in-
original prefix denoting dataset we could now track prop-             terconnections for this possibility. We aligned these reac-
erties that were not explicitly aligned to Schema.org. The            tions by creating the entity type:
rdfs:label and owl:sameAs properties were left unmodi-                           a beo-drugbank:drug_interactions
fied throughout the entire process, and sche-                         And ensuring the new entity has at least two of the fol-
ma:alternateName is used to track synonyms of                         lowing relations:
rdfs:label.                                                                    schema:interactingDrug beo-drug:id
4.2    Disease dataset alignment                                      4.4    Side Effect dataset alignment
Four large datasets of varying entity counts and properties           The single Sider dataset provides the final links to drug
were the first targets after the base import of the PNNL-             entities with each side effect’s unique IRI converted to:
MLD. The first substitution took place by converting all                              beo-interaction:<effect-id>
unique entity IRIs to a common format:
                 beo-disease:<disease-id>                             Then adding the Schema.org declaration of class:
We then added the Schema.org declaration of class:                                   a schema:medicalEntity
               a schema:MedicalCondition                              Completing the ontology requires one last step linking
Primary preventions were added to diseases as drug IRIs               drugs to side effects with the drug entity property:
were detected:                                                        schema:seriousAdverseOutcome beo-interaction:<id>
 schema:primaryPrevention beo-drug:<original-drug-
                       id>, …                                         5     QUERY A DISEASE
Finally, we use Table 4 to ensure we can match back to                Using Dengue Fever as an example disease we can now
online medical resources and unify datasets:                          use schema:MedicalCondition to query across all of the
                                                                      disease datasets. The SPARQL (W3C SPARQL Working
       Schema.org Class                    Entity                     Group 2013) query below locates the available infor-
  MedicalCode                  IRI
                                                                      mation in the combined dataset about any medical condi-
                                                                      tion with “dengue” in its name and collects the comments
  MedicalPage                  URI
                                                                      that describe the source of that information.
  code                         Unknown Code Type
Table 4. Alignment of Schema.org classes to medical resources.
                                                                       @prefix schema: <http://schema.org/>
4.3    Drug dataset alignment                                          @prefix rdfs: <http://www.w3.org/2000/01/rdf-
Two datasets comprised drug metadata and provided in-                  schema#>
terlinks to side effect metadata. These entities were refer-


Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes                            3
Smith et al.


    SELECT ?label ?comment                                             GRAPH ?G2{
    WHERE {                                                              ?drug schema:possibleTreatment ?target .
      ?item a schema:MedicalCondition .                                  ?target rdfs:label ?diseaseTarget }}
      ?item rdfs:label ?label .
      FILTER (regex(?label, 'dengue', 'i')) .                        This query returns Table 6.
    OPTIONAL { ?item rdfs:comment ?comment }}
                                                                          ?drugLabel                  ?diseaseTarget
Running this query on the PNNL-MLD returns Table 5.                   "Alpha-D-Mannose"      "Dengue_fever,_protection_against"
                                                                           "Fucose"          "Dengue_fever,_protection_against"
                 ?label                       ?comment                 Table 6. Result of SPARQL query for drugs treating Dengue
                                            "Imported from                     Fever showing value of aligned vocabulary.
        "Dengue shock syndrome"
                                             DISGENET”
                                            "Imported from           Another limitation of the current PNNL-MLD is exposed
                 "Dengue"
                                             PharmGKB "              reviewing the results of Table 6. When creating interlinks
                                            "Imported from           across diseases, only the Diseasome entities were refer-
       "Dengue Hemorrhagic Fever"
                                             PharmGKB"               enced in the corresponding drug datasets as possible tar-
                                            "Imported from           gets for treatment. To correct this oversight we also need
                 "Dengue"
                                             DISGENET"               to include owl:sameAs associations within our queries, or
                                            "Imported from           select a logical reasoner capable of associating and return-
       "Dengue Hemorrhagic Fever"                                    ing all related entities upon a single link between a dis-
                                             DISGENET "
     "Dengue fever, protection against"      <diseasome>             ease and drug.
    "Dengue_fever,_protection_against"       <diseasome>                 Most importantly, Table 6 depicts the value of the
        Table 5. Result of SPARQL query on Dengue Fever.             combined and aligned PNNL-MLD dataset. Queries like
                                                                     the one given here that require information linking diseas-
The results in Table 5 expose a current limitation of the            es to treatments or symptoms or side effects are now
system due to regex matching of the label property. Be-              greatly simplified and can focus on a single vocabulary.
cause the query can now reach across several different               Schema.org provided classes and properties appropriate
datasets with conflicting naming schemes an additional               for drafting queries that can provide views of the data not
normalization process is needed during the data import to            visible using only a single source of data.
normalize labels for all of the entities linked with                     No technical limitation exists that would restrict a user
owl:sameAs.                                                          from loading all of the datasets into separate graphs of an
   The results in Table 5 show that one simple query now             available triplestore and querying the different vocabular-
identifies data from three different sources. This begins to         ies across graphs. However, when we align these datasets
show the value of the combined dataset. However, to ex-              into the PNNL-MLD we achieve four major benefits:
plore the full impact of the alignment a more complex                   1.   Queries are now simplified. Early drafts for query-
query is needed that requires the integration of infor-                      ing across all of the graphs required queries that
mation from multiple sources. Expanding on our previous                      were dozens of lines in length, and portions of the
query we can search across all originally returned “den-                     queries varied drastically in format and language.
gue” conditions and append the drug links and treatments                2.   A standardized vocabulary, that is industry recog-
added with Schema.org .                                                      nized, is now in place for application development.
                                                                        3.   All of the graphs, when aligned into the PNNL-
    @prefix schema: <http://schema.org/>                                     MLD, are now equally extensible. Adding new
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-                            vocabularies and ontologies to the original data
    schema#>                                                                 would require special updates to each dataset, and
    SELECT DISTINCT ?drugLabel ?diseaseTarget                                require updates to each specific portion of a query
    WHERE {                                                                  using that dataset.
    GRAPH ?G{                                                           4.   As shown in Table 2, when a dataset is converted
      ?item a schema:MedicalCondition .                                      using RDF, and not generated from a different file
      ?item rdfs:label ?label .                                              type (unaligned entities = 0), the Schema.org enti-
      FILTER (regex(?label, 'dengue', 'i')) .}                               ties now have a much higher ratio of Schema.org
    GRAPH ?G1{                                                               predicates to object triple mappings. By flattening
      ?item schema:primaryPrevention ?drug .                                 the ontology a simplified query now has access to a
      ?drug rdfs:label ?drugLabel .}                                         much greater range of values and entities.


4                           Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes
                                                             Medical and Transmission Vector Vocabulary Alignment with Schema.org


Additionally, because all modification and additions made               conceptualization of new questions that bridge the earlier
while aligning are programmatically defined rather than                 work without requiring replicating research with a broad-
human expert mediated, new version of the PNNL-MLD                      er focus. This combination of separate datasets with
can be easily created as source datasets produce new ver-               common data points aligned to nonexclusive properties
sions.                                                                  and ontology rules simplifies queries, and creates a new
                                                                        superset built for application development and public dis-
6        CURRENT LIMITATIONS                                            covery.
The complete PNNL-MLD is now capable of being que-
ried through SPARQL using only Schema.org associa-                      ACKNOWLEDGEMENTS
tions. However, there are still shortcomings in searching               This work was funded by a contract with the Defense
for drugs and diseases by name, including the correspond-               Threat Reduction Agency (DTRA), Joint Science and
ing regex filters. To resolve this conflict a primary label             Technology Office for Chemical and Biological Defense
for a group of entities related by owl:sameAs should be                 under project number CB10082. Pacific Northwest Na-
selected upon entity interlinking with the previous labels              tional Laboratory is operated for the U.S. Department of
turned into schema:alternateName properties. Queries                    Energy by Battelle under Contract DE-AC05-76RL01830.
should then be composed to either search for a primary
name and/or alternate synonym. To remove duplicates                     REFERENCES
imported from different datasets a reasoner capable of                  Allen Institute for Brain Science. Allen Human Brain Atlas. 2014.
merging owl:sameAs relations should be used when que-                       http://human.brain-map.org/ (accessed 2014).
rying the complete PNNL-MLD.                                            Ashburner,              Michael.             BioPortal.             2014.
    Medical coding was not at first considered a feature of                 http://bioportal.bioontology.org/ontologies/GAZ.
the application and early versions of the PNNL-MLD did                  Consortium, The UniProt. "UniProt: a hub for protein information ."
not prioritize accurately creating the properties in Table 3.               Oxford Journals 43, no. D1 (2014).
As it became more apparent diseases and drugs were not                  DisGeNet.                             10                            2014.
consistently labeled across datasets, and outside database                  http://www.disgenet.org/web/DisGeNET/v2.1/dbinfo.
entities generally were consistent across datasets, more                Goh, Kwang-Il, Michael Cusick, David Valle, Barton Childs, Marc
focus was added to ensure medical codes were applied to                     Vidal, and Albert-László Barabási. "The Human Disease Network."
drug and disease entities. However, this process was nev-                   Proc Natl Acad Sci USA, 4 2007.
er finalized through Linked Data authentication to ensure               Google Inc; Microsoft Inc; Yahoo Inc. 2014. http://schema.org/.
the medical codes supplied were accurate for the attached               Heath, Tom, and Christian Bizer. Linked Data: Evolving the Web into a
entity.                                                                     Global Data Space. 1. Berlin: Morgan & Claypool, 2011.
6.1       Future work                                                   Kanehisa, M, S Goto, Y Sato, M Kawashima, M Furumichi, and M
                                                                            Tanabe. "Data, information, knowledge and principle: back to
To address current limitations we need to focus on best
                                                                            metabolism in KEGG." Nucleic Acids Res, Jan 2014.
practices utilizing linked data (Heath and Bizer 2011),
                                                                        Kuhn, M, M Campillos, I Letunic, LJ Jensen, and P Bork. "A side effect
and expanding vector transmission geo-properties.
                                                                            resource to capture phenotypic effects of drugs." Epub (NCBI), 1
    1.    Authenticate medical coding. Confirm the entity is                2010.
          correctly aligned to outside sources.                         Law, V, et al. "DrugBank 4.0: Shedding new light on drug metabolism."
    2.    Add Gazetteer to provide formal geographic nam-                   PubMed, no. 24203711 (2014).
          ing entities while also mapping a list of local collo-        National Institute of Allergy and Infectious Diseases; National Institutes
          quialisms for geographic regions.                                 of Health; Department of Health and Human Services. 2014.
    3.    Add Vectorbase. (National Institute of Allergy and                https://www.vectorbase.org.
          Infectious Diseases; National Institutes of Health;           Stanford University. 2014. https://www.pharmgkb.org/.
          Department of Health and Human Services 2014)                 United States National Library of Medicine. 10 1, 2014.
                                                                            http://dailymed.nlm.nih.gov/.
7        CONCLUSIONS                                                    W3C OWL Working Group. 2012. http://www.w3.org/TR/2012/REC-
                                                                            owl2-overview-20121211/.
The broader implications of aligning datasets under a
                                                                        W3C RDF Working Group. 2004. http://www.w3.org/TR/2004/REC-
common vocabulary, and making them available using
                                                                            rdf-mt-20040210/.
Linked Open Data best practices, is to standardize and
                                                                        W3C SPARQL Working Group. "SPARQL 1.1 Query Language." W3C
expand the original research objectives. When we aug-
                                                                            Recommender. March 2013. http://www.w3.org/TR/sparql11-query/
ment the unique vocabulary and ontology mappings of
                                                                            (accessed October 2014).
individual research programs with the broader Sche-
                                                                        Wikimedia Foundation. 2014. http://www.wikidata.org
ma.org vocabulary, we create data interlinks that enable


Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes                                             5

</pre>