Enabling Tailored Therapeutics with Linked Data Anja Jentzsch Bo Andersson Oktie Hassanzadeh Freie Universität Berlin AstraZeneca R&D Lund University of Toronto Web-based Systems Group 221 87 Lund, Sweden Database Group Garystr. 21 10 King’s College Rd, Toronto, Canada 14195 Berlin, Germany bo.h.andersson@ astrazeneca.com oktie@cs.toronto.edu mail@anjajentzsch.de Susie Stephens Christian Bizer Eli Lilly and Company Freie Universität Berlin Lilly Corporate Center Web-based Systems Group Indianapolis, Indiana 46285, USA Garystr. 21 14195 Berlin, Germany Stephens_Susie_M@Lilly.com chris@bizer.de ABSTRACT that are suitable for preventive and tailored treatment regimes [1, Advances in the biological sciences are allowing pharmaceutical 2]. This shift requires a more systematic approach to integrating companies to meet the health care crisis with drugs that are more and interpreting information spanning genes, proteins, pathways, suitable for preventive and tailored treatment, thereby holding the targets, diseases, drugs, and patients [3]. The amount of publicly promise of enabling more cost effective care with greater efficacy available data that is relevant for drug discovery has grown and reduced side effects. However, this shift in business model significantly over recent years [4, 5], and has reached a point increases the need for companies to integrate data across drug where present tools are no longer effective. Scientists need new discovery, drug development, and clinical practice. This is a more efficient ways to interrogate data than simply jumping from fundamental shift from the approach of limiting integration one public data source to another. This is because there are too activities to functional areas. The Linked Data approach holds many disparate data sources for scientists to conceptualize there much potential for enabling such connectivity between data silos, relationships and remember that they all exist, let alone mastering thereby enabling pharmaceutical companies to meet the urgent the different user interfaces and inconsistent terminology. Further, needs in society for more tailored health care. This paper the prevalence of single query input fields makes it difficult for examines the applicability and potential benefits of using Linked scientists to retrieve precise information of interest, and to Data to connect drug and clinical trials related data sources and retrieve data that spans different data sources. gives an overview of ongoing work within the W3C's Semantic Linked Data has the potential to ease access to these data for Web for Health Care and Life Sciences Interest Group on scientists and managers by making the connections between the publishing drug related data sets on the Web and interlinking data sets explicit in the form of data links. This can be them with existing Linked Data sources. A use case is provided accomplished using RDF as a standardized data representation that demonstrates the immediate benefit of this work in enabling format, HTTP as a standardized access mechanism, and through data to be browsed from disease, to clinical trials, drugs, targets the development of algorithms for discovering the links between and companies. data sets. Such explicit links allow scientists to navigate between data sets and discover connections they might not have been Categories and Subject Descriptors aware of previously. The standardized representation and access H.3.5 [Online Information Services]: Data Sharing mechanisms allow generic tools, such as Semantic Web browsers and search engines, to be employed to access and process the data. General Terms Experimentation, Languages The Linking Open Drug Data (LODD) task within the W3C's Semantic Web for Health Care and Life Sciences Interest Group1 gathered a list of data sets that include information about drugs, Keywords and then determined how the publicly available data sets could be Linked Data, Semantic Web, Tailored Therapeutics, Drugs, linked together. The review showed that this domain is promising Clinical Trials, Competitive Intelligence for Linked Data as there are many publicly available data sets, and they frequently share identifiers for key entities. The 1. INTRODUCTION complete evaluation results are posted on the W3C ESW Wiki2. The crisis in health care is changing the business model of Participants of the LODD task have undertaken to demonstrate pharmaceutical companies to discovering and developing drugs the value of Linked Data to the health care and life sciences Copyright is held by the author/owner(s). 1 http://esw.w3.org/topic/HCLSIG/LODD LDOW2009, April 20, 2009, Madrid, Spain. 2 http://esw.w3.org/topic/HCLSIG/LODD/Data/DataSetEvaluation domain. This has been achieved by publishing and linking several of disorders, disease genes, and associations between them was drug related data sets on the Web, and investigating use cases that obtained from the Online Mendelian Inheritance in Man demonstrate how researchers in life science, as well as physicians (OMIM)7, a compilation of human disease genes and phenotypes. and patients can take advantage of the connected data sets. The data set is published by Diseasome in a flat file representation. The flat files were read into a relational database This paper is structured as follows: Section 2 describes the and made accessible as Linked Data using D2R server. The published data sets, their linkage with other published data Linked Data version of Diseasome contains 88,000 triples and sources, and the methods that were used to create the links. 23,000 links8. Section 3 exemplifies how navigating linked data can be utilized within a competitive intelligence use case. While Section 4 DailyMed9 is published by the National Library of Medicine, and summarizes our findings and experiences from publishing and provides high quality information about marketed drugs. navigating the data sets. DailyMed provides much information including general background on the chemical structure of the compound and its mechanism of action, details on the clinical pharmacology of the 2. LINKED DATA SETS compound, indication (disorder) and usage, contraindications, In this project, data about pharmaceutical companies, drugs in warnings, precautions, adverse reactions, overdosage, and patient clinical trials, mechanisms of action of drugs, safety information, counseling. The data was originally published in Structured and data about disease gene correlations were added to the Linked Product Labeling 10 , a XML-based standard for exchanging Data cloud. This selection of data sets enabled strong connections medication information that has been recently introduced by the to existing Linked Data resources, while providing novel data of Food and Drug Administration in the United States. It was interest to the pharmaceutical industry. The existing Linked Data published using the D2R server. The Linked Data version of of primary interest to this work includes the many bioinformatics DailyMed contains 124,000 triples and 29,600 links11. and cheminformatics data sources published by Bio2RDF [6], and the information on diseases and marketed drugs in DBpedia [7]. The linkage of the newly published data sets to each other and relevant existing Linked Data is shown in Figure 1. The Linked Clinical Trials (LinkedCT) data source 3 is derived from a service provided by U.S. National Institutes of Health, ClinicalTrials.gov, a registry of more than 60,000 clinical trials conducted in 158 countries. Each trial is associated with a brief description, related disorders 4 and interventions, eligibility criteria, sponsors, locations (investigators), and several other pieces of information. The data on LinkedCT is obtained by first transforming the XML data provided by ClinicalTrials.gov to relational data using the capabilities of a hybrid relational-XML Relational Database Management System such as IBM DB2. This transformation requires identification of the entities and facts in the XML data and storing them in reasonably normalized relational tables that are appropriate for transformation into RDF. Figure 1. This figure shows the incorporation of The RDF data is then published using D2R server [8]. The RDF LinkedCT, DailyMed, DrugBank, and Diseasome into the version of the dataset contains 7,011,000 triples and 290,000 Linked Data cloud. These data are represented in dark links. gray, while light gray represents other Linked Data from the life sciences, and white indicates interlinked datasets DrugBank [9] is a large repository of almost 5000 FDA-approved covering geographic, person-related and conceptual data. small molecule and biotech drugs. It contains detailed information about drugs including chemical, pharmacological and pharmaceutical data; along with comprehensive drug target data There are many commonly used identifiers in the life sciences such as sequence, structure, and pathway information. The data that can be utilized for making links between data sets explicit. was originally published as DrugBank DrugCards 5 and was re- Links that were generated based on shared identifiers include the published as Linked Data using D2R server. The Linked Data connections from LinkedCT to Bio2RDF's PubMed, and from version of DrugBank contains 1,153,000 triples and 60,300 links6. DrugBank to DBpedia. The connections between bioinformatics and cheminformatics data sources are already provided by Diseasome [10] contains information about 4,300 disorders and Bio2RDF allowing us to interlink our drug-related data sets to disease genes linked by known disorder–gene associations for their work. In cases where no shared identifiers exist, string and exploring known phenotype and disease gene associations and semantic matching techniques were applied for link discovery indicating the common genetic origin of many diseases. The list 7 3 http://linkedct.org www.ncbi.nlm.nih.gov/omim 8 4 disorder is used as a synonym for disease and indication, http://www4.wiwiss.fu-berlin.de/diseasome/ 9 http://en.wikipedia.org/wiki/Disease#Disorder http://dailymed.nlm.nih.gov/ 5 10 http://www.drugbank.ca/fields http://www.fda.gov/oc/datacouncil/SPL.html 6 11 http://www4.wiwiss.fu-berlin.de/drugbank/ http://www4.wiwiss.fu-berlin.de/dailymed/ [11]. Approximate string matching was employed to interlink DrugBank (drug) → drugbank:cas 2,240 LinkedCT and Diseasome, where for instance "Alzheimer's Bio2RDF’s CAS RegistryNumber disease" in LinkedCT was matched with "Alzheimer_disease" in DrugBank (drug) → drugbank: hgncId 1,675 Diseasome. Semantic matching is especially useful in matching Bio2RDF’s HGNC clinical terms as many drugs and diseases have multiple names. DrugBank (drug) → drugbank: kegg 1,331 Drugs tend to have generic names and brand names, for example, Bio2RDF’s KEGG Compound CompoundId "Varenicline" has the synonym "Varenicline Tartrate" and the DrugBank (drug) → drugbank:kegg brand names "Champix" and "Chantix". 913 Bio2RDF’s KEGG Drug Drug DrugBank (drug) → Table 1. Numbers of outgoing data links from the published drug drugbank: chebiId 736 Bio2RDF’s ChEBI related data sets. Diseasome (gene) → diseasome:bio2rdf 9,743 Data set Number of links Bio2RDF’s Symbol Symbol 290,000 links; Diseasome (disease) → LinkedCT diseasome:omim 2,929 50,000 of them inside the LODD cloud Bio2RDF’s OMIM 23,000 links; Diseasome (gene) → DrugBank diseasome:hgncId 688 8,500 of them inside the LODD cloud Bio2RDF’s HGNC 29,600 links; Diseasome (gene) → DailyMed diseasome:geneId 688 all of them inside the LODD cloud Bio2RDF’s GeneID 23,000 links; Diseasome 8,400 of them inside the LODD cloud 3. COMPETITIVE INTELLIGENCE CASE STUDY Table 1 summarizes the number of links from our published data A use case has been developed that demonstrates the value of sets to Linked Data within the LODD cloud and beyond. Table 2 Linked Data about drugs to the pharmaceutical industry. differentiates the number and type of links between data sources Departments within pharmaceutical companies have typically and indicates their frequency. A double headed arrow in the first decided independently which data sets need to be brought into column indicates that the links are bidirectional, while a single their organization for integration and interrogation. Access to the headed arrow indicates unidirectional links. data is provided to employees based upon their roles. The use Table 2. Type and frequency of links between the LODD data case describes the value that can be gained by allowing sets, and between LODD and Bio2RDF. employees to gain access to a more diverse and linked body of data. This approach enables new and novel questions to be Source / Target Link Type Count explored. The following use case describes a scenario in LinkedCT (intervention) ↔ competitive intelligence. owl:sameAs 27,685 DailyMed (drug) A neuroscience focused business manager is interested in seeing LinkedCT (intervention) ↔ owl:sameAs 12,127 an update on new clinical trials that competitors are starting in DrugBank (drug) Alzheimer’s Disease (AD). These updates influence future sales LinkedCT (intervention) ↔ rdfs:seeAlso 8,848 forecasts across geographies, and impact portfolio decisions as DBpedia (drug) new drugs needs to demonstrate improved safety and efficacy LinkedCT (condition) ↔ owl:sameAs 444 compared to the existing pharmacopeia. DBpedia (disease) LinkedCT (condition) ↔ Using a Semantic Web browser of choice – for instance owl:sameAs 301 Diseasome (disease) Tabulator12 or the Marbles data browser13, the manager is able to LinkedCT (trial) → see all drugs in trials for AD in LinkedCT, including a new phase foaf:based_near 129,177 Geonames III trial planned by Pfizer for a drug called Varenicline. The LinkedCT (reference) → business manager can see that more information is available about owl:sameAs 42,219 Bio2RDF’s PubMed the drug, which is unusual because not much data is typically LinkedCT (trial) → available for drugs that are under investigation. Following the foaf:page 61,920 ClinicalTrials.gov data link the manager sees data from DailyMed that shows that DrugBank (drug) ↔ drugbank:possible the drug is already on the market for nicotine addiction. 8,201 Diseasome (disease) DiseaseTarget As side effects are better understood for drugs that are already on DrugBank (drug) ↔ drugbank:branded 1,593 the market, they tend to be more successful in trials. Out of DailyMed (drug) Drug curiosity, the manager scrolls down the page to see that side DrugBank (drug) ↔ owl:sameAs 1,522 effects are listed as constipation, sleeping problems, vomiting, DBpedia (drug) nausea, and gas; and that the typical dose is 1mg twice daily. The DrugBank (drug target) → drugbank: pfam 19,028 dose stated on LinkedCT for the trial was no higher than that, so it Bio2RDF’s PFAM DomainFunction is unlikely that this drug will have new safety problems. DrugBank (drug) → drugbank:enzyme 4,660 Bio2RDF’s UniProt SwissprotId DrugBank (drug) → drugbank:iupacId 4,592 Bio2RDF’s IUPAC 12 http://www.w3.org/2005/ajar/tab DrugBank (drug target) → drugbank:pdbId 3,379 13 http://beckr.org/marbles Bio2RDF’s PDB Given the promising safety profile, the manager is curious to 4. OUTLOOK discover why a nicotine addiction drug might work for AD. This paper describes the mapping of four drug related data Linking to DrugBank highlights to the manager that Varenicline sources into the Linked Data cloud, and the ensuing insights that is an alpha-4 beta-2 neuronal nicotinic acetylcholine receptor can be gained in the area of competitive intelligence. However, agonist. However, Diseasome indicates that the corresponding this is just the beginning, because more interesting and novel genes are only important in nicotine addiction, rather than AD. questions will be able to be addressed as additional data sets are This suggests that there is a more complex relationship between added. As a next step, it would be interesting to incorporate data the diseases, than just sharing a drug target. Extending the relating to epidemiology, as that could provide information browsing to the SWAN Knowledgebase14 [12] shows that there relating to geographical areas in which diseases are prevalent, and are hypotheses relating AD to nicotinic receptors through amyloid where there is a strong need for the development of a drug that beta [13]. meets the needs of a specific population. It would also be valuable Using the Linked Data approach a business manager was able to to create links to the AD hypotheses data that is in RDF within the browse data relating to companies, clinical trials, drugs, diseases SWAN Knowledgebase. and genetic variation. More specifically, the manager was able to Pharmaceutical companies need to make decisions based upon determine when extra data was available, gain access to data both internal and external data, it is therefore important that without needing to map different identifiers and synonyms, and companies begin to make internal data available in a linked gain additional insights as to interesting questions to ask. representation, both to break down the internal silos and to easily connect with external data. Such an approach would require organizations to understand where the linkage points occur across internal data sets, but this is ongoing work as it is a critical prerequisite for all data integration efforts relating to the effective tailoring of drugs. Currently, when pharmaceutical companies bring copies of data within their organizations for integration, they each need to have experts who understand the connectivity across data sets. However, with the Linked Data approach, this responsibility is shifted to the data providers. This is a much more efficient approach, as the data providers are the individuals who understand the data best. It also means that the integration only has to happen one time. In addition, it becomes possible for data providers to incrementally add links to new data sets as they become aware of their existence, rather than needing to design a model to do everything in one go. As stated in [14], reasoning and querying limitations can often be compensated for by integrating additional data resources. As the Linked Data cloud grows, focus in pharmaceutical companies will be moved to approaches for interpretation. One project with potential to utilize the value from Linked Data is the Large Knowledge Collider (LarKC), a platform for massive distributed incomplete reasoning that aims at removing the scalability barriers of currently existing reasoning systems for the Semantic Web15. The Linked Data approach is very promising for the pharmaceutical industry, and its value will increase as more data sources become available. However, our technical work as well as use case experiments revealed various challenges that need to be mitigated to make this approach robust enough to be deployed within an enterprise environment: 1. Progress needs to be made in finding links between data items across data sets where no commonly used identifiers exist. Discovering such links requires using specific record Figure 2. Data relating to Varenicline from LinkedCT, linkage [15] and duplicate detection [16] techniques DrugBank and Diseasome shown within the Marbles data developed within the database community as well as browser. ontology matching [17] methods from the knowledge representation literature. Recent work has proposed frameworks for simplifying this task for RDF data sets [18] and relational data [11]. In order to benefit from these 14 15 http://hypothesis.alzforum.org/swan/ http://www.larkc.eu/ frameworks for setting links within the LODD data sets, bioinformatics knowledge systems. J. Biomed. Infor. 41. domain experts need to identify linkage points and specific 706-716, 2008. rules required for finding the links. [7] Auer, S., Bizer, C., Lehmann, J., Kobilarov, G., Cyganiak, 2. Work needs to be undertaken to make data browsers more R., Ives, Z. DBpedia: A Nucleus for a Web of Open Data. In robust and performant. In addition, the user interface of data proceedings of the 6th International Semantic Web browsers needs to be improved. Life Sciences data Conference. Lecture Notes in Computer Science 4825 frequently consists of long lists of entities (e.g. genes, trials, Springer, ISBN 978–3-540–76297–3, 2007. diseases, patients) that need to be browsed, filtered, and [8] Bizer, C., Cyganiak, R.: D2R Server - Publishing Relational queried. Benefits would be gained if hybrid interfaces that Databases on the Semantic Web. Poster at the 5th combine querying and browsing would be available and able International Semantic Web Conference, 2006. to process the large amounts of data that are typically relevant within this domain. For such interfaces, it could be [9] Wishart D.S., Knox C., Guo A.C., Shrivastava S., Hassanali promising to combine live data retrieval with local caching M., Stothard P., Chang Z., Woolsey J.: DrugBank: a and in-advance crawling of relevant data sets, as it is comprehensive resource for in silico drug discovery and currently done by Semantic Web Search engines such as exploration. Nuc. Acids Res. 1(34): D668-72, 2006. Sindice [19] and Falcons [20]. [10] Goh K.-I., Cusick M.E., Valle D., Childs B., Vidal M., 3. A significant challenge within the life sciences and health Barabási A.L.: The human disease network. Proc. Natl. care is the strong prevalence of terminology conflicts, Acad. Sci. USA 104:8685-8690, 2007. synonyms, and homonyms. These problems are not [11] Hassanzadeh O., Lim L., Kementsietsidis A., and Wang M.: addressed by simply making data sets available on the Web A Declarative Framework for Semantic Link Discovery over using RDF as common syntax but require deeper semantic Relational Data. Poster at the 18th World Wide Web integration. For applications that focus on discovery and data Conference, 2009. navigation, having explicit links between data sources is often already a huge benefit even without semantic [12] Gao Y., Kinoshita J., Wu E., Miller E., Lee R., Seaborne A., integration. For other applications that rely on expressive Cayzer S., Clark T.: SWAN: A Distributed Knowledge querying or automated reasoning deeper integration is Infrastructure for Alzheimer Disease Research. J. Web Sem. essential. In order to also provide for such applications and 4(3): 222-228, 2006. lay the foundation for fusing data from several Linked Data [13] Dineley, K.T., Westerman, M., Bui, D., Bell, K., Ashe K.H., sources, it would be beneficial if more community practices Sweatt, J.D.: b-Amyloid Activates the Mitogen-Activated on publishing term and schema mappings would be Protein Kinase Cascade via Hippocampal a7 Nicotinic established. Acetylcholine Receptors: In Vivo Mechanisms Related to Alzheimer’s Disease. J. Neurosci. 21(12):4125-4133, 2001. 5. ACKNOWLEDGEMENTS [14] Sahoo, S., Bodenreider, B., Rutter, J., Skinner, K., and This work was undertaken within the LODD task of the W3C's Sheth, A.: An ontology-driven semantic mashup of gene and Semantic Web for Health Care and Life Sciences Interest Group. biological pathway information: Application to the domain Significant contributions to the LODD task have also been made of nicotine dependence. Journal of Biomedical Informatics by Kei Cheung, Don Doherty, Matthias Samwald, and Jun Zhao. 41: 752-765, 2008. Anja Jentzsch and Chris Bizer received funding for this work from Eli Lilly. [15] Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S. Duplicate record detection: A survey. IEEE Trans. Knowledge and Data Engineering, 19(1): 1–16, 2007. 6. REFERENCES [1] Healthcare 2015: Win-win or lose-lose? [16] Winkler, W.: Overview of Record Linkage and Current www.Ibm.com/healthcare/hc2015. Research Directions. Bureau of the Census, Technical Report, 2006. [2] Gerhardsson de Verdier, M.: The Big Three Concept - A Way to Tackle the Health Care Crisis? Proc. Am. Thorac. [17] Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Soc. 5: 800–805, 2008. Heidelberg, 2007. [3] Andersson B., Momtchev V.: D7a.1.1 LarKC Requirements [18] Volz, J., Bizer C., Gaedke, M., and Kobilarov, G.: Silk – A summary and data repository, Link Discovery Framework for the Web of Data. In: Linked http://wiki.larkc.eu/LarkcProject/WP7a. Data on the Web workshop at WWW2009, 2009. [4] Sharp, M., Bodenreider, O., and Wacholder, N.: A [19] Tummarello G. et al. Sindice.com: Weaving the Open framework for characterizing drug information sources. Linked Data. In: 6th International Semantic Web AMIA Annu. Symp. Proc. 2008 Nov 6:662-666. Conference, 2007. http://www.ncbi.nlm.nih.gov/pubmed/18999182. [20] Gong Cheng, H. W., Weiyi Ge, Qu Y.: Searching Semantic [5] Goble, C., Stevens, R.: State of the Nation in Data Web Objects Based on Class Hierarchies. In: Linked Data on Integration for Bioinformatics. J. Biomed. Infor. 41: 687- the Web workshop at WWW2008, 2008. 693, 2008. [6] Belleau F., Nolin., M.-A., Tourigny N., Rigault, P., and Morissette, J. Bio2RDF: Towards a mashup to build