Towards a Semantic Clinical Data Warehouse: A Case Study of Discovering Similar Genes Benedikt Kämpgen1 , Horst Werner2 , Radwan Deeb2 , and Christof Bornhövd2 1 FZI Research Center for Information Technology, Karlsruhe, Germany, kaempgen@fzi.de 2 SAP AG, Karlsruhe, Germany, firstname.lastname@sap.com Abstract. Physicians nowadays have to consider a diverse range of data sources when treating a patient. Semantic clinical data warehouses al- low to easily add new data and to pro-actively help the physician mak- ing sense of the data. In this work-in-progress paper we investigate an approach of using Linked Data as the access mechanism and a graph database for storage and query processing. We describe lessons learned from a case study of discovering similar genes where we use an existing similarity metric to derive new information, the Gene Ontology as a data source, and SAP HANA as an efficient graph database. 1 Introduction Examples of data sources that physicians nowadays have to consider when treat- ing a patient include background information collected as part of trials or from publications and encyclopedias, as well as genomic information [5, 4]. An example tool to support the physician in accessing and analysing the data from such sources is the Patient Data Explorer (PDE) based on the SAP HANA in-memory database deployed at the National Center for Tumor Diseases (NCT) in Heidelberg3 . PDE allows an overview of patients; to create diagrams visualising the distribution of characteristics in patients; and to zoom-in to single patients to get detailed information about diagnoses and therapies. PDE can be improved in several ways: PDE uses a broadly-applicable entity relationship data model about “interactions” and “observations” similar to a Star Schema; adding additional background information such as ICD codes or PubMed references would require to manually modify ETL pipelines and the schema. For information coming from different sources heterogeneities remain, e.g., different terminologies for diseases and drugs may be used and inconsisten- cies and redundancies easily occur. Maschine Learning (ML) algorithms such as for clustering of genes are difficult to apply for physicians and the results are not written back to the data warehouse for provenance tracking and information sharing. 3 http://www.sap-innovationcenter.com/2013/09/19/ medical-research-insights/ 2 In this work-in-progress paper we argue that overcoming such challenges is possible using Linked Data, graph databases, and semantic algorithms (Sec- tion 2): we describe a use case for discovering similar genes (Section 3) and derive lessons learned (Section 4). We mention related work (Section 5) and conclude (Section 6). 2 Semantic Clinical Data Warehouse See Figure 1 for the architecture. Information in the semantic clinical data ware- house is presented to the user by a visualisation and analysis tool. To store, query, and visualise arbitrary information we use a graph database and the following intuitive data model (property graph): Relevant objects such as patients, inter- actions, and observations are represented as vertices in the graph. Such objects have properties with values of primitive datatypes such as String and Integer, e.g., the surname of a patient. Objects are related to each other via edges in the graph, e.g., a patient is diagnosed with a disease. Such relationships also can have properties, e.g., provenance information about the algorithm or human expert that has generated the relationship. The integrator and reasoner component 1) translates an RDF graph to a property graph, 2) derives implicit information useful for data integration and decision support of users, and 3) imports the graph to the graph database. The RDF graph is crawled based on the Linked Data principles. Data Sources Semantic Clinical Data Warehouse F Linked (Open) Data RD Graph Query HTTP Get Graph Query Bulk Load Integrator RDF Visualisation & Graph RDF & Crawler LD-Wrapper CSV Analysis Tool Database Reasoner Results Results User HTTP Result RD F LD-Wrapper XML Fig. 1. Semantic clinical data warehouse based on Linked Data and graph database This architecture has the following advantages: Already, there are large a- mounts of life science data – directly or using LD wrappers – published using such widely-adopted access mechanisms and standard vocabularies [1]. A graph database is more schema-flexible than a relational database, i.e., if new data sources introduce new vertices, edges, and properties in the graph, no database administrator has to modify the schema. Linked Data allows to easily add new data sources to the data warehouse by following new links to further objects on the Web. Implicit information can be derived in two ways: 1) by evaluating OWL axioms represented in RDF; for instance, semantics from the OWL 2 RL profile such as equality can be evaluated using rule engines, and 2) by ML algorithms that make use of ontological information, e.g., to discover similar genes. Also, graph databases usually are designed to efficiently process analytical operations over large graphs, i.e., can be used to efficiently compute and write-back results from ML algorithms. 3 3 Case Study of Discovering Similar Genes In this section, we apply our approach to a use case for discovering similar genes from a plant [4]. Similarity is an important basis for other relationships. For instance, the effect of a drug depends on the genes it targets. If drugs target similar genes, they likely have similar effects. Relevant data sources for our prototype – HANA Linked Data AnnSim (HLA) – are descriptions of genes4 , gene annotations from experts5 , and the Gene Ontology (GO) with a concept hierarchy6 . Using OpenRefine with RDF extension, we translate the former two sources to RDF and reuse links from the GO RDF representation. Crawling such data results in one RDF graph with genes, concepts, and annotations between genes and concepts. HLA uses as a graph database HANA Graph, an extension to the HANA in-memory database for storing and querying of property graphs [6]. Graphs are logically stored in HANA using two (virtual) tables: one table for vertices and one table for edges each with columns for an id and every possible property. Graph queries over HANA Graph are issued using the so-called GEM language and are translated to SQL queries over the two tables. Based on a column-oriented and in-memory database, HANA Graph allows fast query processing. An importer program then maps the crawled RDF graph to a property graph and bulk loads the property graph to HANA Graph. Intuitively, the importer generates for every triple two vertices for the subject and object (if not exist- ing), and an edge for the predicate. HANA Graph then contains genes (e.g., AT5G23810) and concepts (e.g., Amino Acid Transport) as vertices, and rela- tionships between genes and concepts as edges. For instance, there are annotation relationships between genes and concepts as well as is-a relationships between concepts. Vertices and edges can have properties, e.g., a concept has a textual description. The graph is then extended with edges between genes describing their similarity, and edges between concepts describing their distance in the is-a concept hierarchy. Such information we compute based on an existing algorithm, AnnSim [4]. AnnSim makes use of the distances between the concepts of two genes. Intu- itively, the shorter the average path between any two concepts of two genes the more similar the two genes. Both computed distances and similarities are written back to HANA Graph as edges between concepts and genes, respectively. Every edge has as a property a numeric value between 0 and 1 for the similarity and distance, and – in case several different algorithms are used – the name of the algorithm. 4 ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_ functional_descriptions 5 ftp://ftp.arabidopsis.org/home/tair/Ontologies/Gene_Ontology/ATH_GO_ GOSLIM.txt 6 http://purl.obolibrary.org/obo/go.owl 4 As also visible in a screencast on our paper website7 , the user of HLA gets an overview of genes and their similarities to other genes; can zoom into sin- gle genes to see textual descriptions of concepts; can visit concepts along the concept hierarchy (Figure 2). Also, the user can ask for a graph view showing similarities between genes based on distances between concepts (Figure 3). For the visualisation, we used a visualisation engine called Symbiosis that can be configured with a JSON-based template language to visualise a graph. Symbio- sis uses HANA Graph and GEM for querying the data. Fig. 2. Zoom-in to single gene (left) andFig. 3. Graph view of genes with smaller displayed concept hierarchy (right) distances in darker color 4 Lessons Learned To draw preliminary lessons learned about the applicability of our approach, we compared our HLA system with an implementation of AnnSim by Palma et al. (AnnSim 1.0) [4] to compute pair-wise similarities of 20 genes from the taxonomic class 1-aaap. For both approaches, we use a workstation with Ubuntu 14.04 VM on W7 Intel Core i5-3360M CPU 2.80GHz, 16 GB RAM to execute the program logic. In addition to that, for HLA, we use a server with SUSE Linux Enterprise Server 11.1 500 GB RAM, 80 cores to host HANA Graph. Table 1 compares the two approaches. HLA uses the same data sources than AnnSim 1.0 but considers all contained information and the newest versions. Table 1. Available relevant data for HLA and AnnSim 1.0 Approach Size of data Triples Vertices Edges HLA 537 MB 7,337,447 601,519 1,658,322 AnnSim 1.0 2.80 MB - 39,209 74,123 Correct Computation of Similarities. There is a mean squared error between the results of HLA and AnnSim of 0.09. This difference we expect is due to newer, possibly more elaborate versions of annotations and GO (version 1.2) used by HLA. We compared the results of both approaches with the gold standard, a similarity metric based on the DNA sequence of genes (SeqSim). 7 http://people.aifb.kit.edu/bka/hla/ 5 The mean squared error between HLA and SeqSim (0.19) is lower than between AnnSim 1.0 and SeqSim (0.36), indicating that AnnSim similarities improve with newer data sources; yet, further experiments are needed to confirm this claim. Scalable Computation of Similarities. Table 2 gives an overview of the time for the different steps in the execution. Loading of data is estimated with a connection of 6.7 Mbps download speed. Although HLA takes considerably longer than AnnSim 1.0, we argue that HLA’s bottlenecks can be resolved and that HLA is more promising for larger datasets. Table 2. Elapsed query processing time (in sec) for computing similarities of 20 genes Approach Prepare Download Map Load Compute Read Write Sources Data Graph Graph AnnSim Queries Queries HLA < 120 641 355 15 2,667 230 2,202 AnnSim 1.0 N/A 3 0 0 408 - - AnnSim 1.0 uses a proprietary graph data format with reduced information that is probably fast to generate (Prepare), download, and load. HLA uses a more verbose but also more expressive graph model (RDF) and has to gener- ate (Prepare), download, transform to property graph (Map) and load 15 times more vertices, 22 times more edges and comprehensive properties such as tex- tual descriptions. Loading graph data to HANA Graph showed fast and the preprocessing steps we believe can be optimised by parallelisation. AnnSim 1.0 uses program logic in C/C++ over arrays to compute the 400 similarities and displays the results to users. HLA loads the relevant data to a graph database and uses program logic in Java to issue database queries to efficiently compute the similarities and to write back the results to the data warehouse. The query language GEM was useful and intuitive for graph-traversal queries. For instance, the following GEM read query is issued using a special-type function WIPE() to the SQL interface of HANA, recursively visiting one or more edges of type rdfs:subClassOf, and returns a vertex table with all ancestors of a GO concept: RESULT uri:myResult FROM { GO:0005634 }-[@core:type = ’rdfs:subClassOf’]->(1,*);. The program logic in HLA spent more than 90% of the time to compute a specific part of AnnSim, a min-weight perfect matching (Blossom IV). We believe we can optimise the Blossom IV execution, e.g., by running part of it directly in HANA Graph via built-in and user-defined functions. Writing back of the results to the data warehouse took a lot of time since done using single write queries instead of a bulk load. In this case, since read queries to HANA Graph showed fast, HLA should also scale with larger datasets, in contrast to AnnSim 1.0 that does not outsource bulk loading, reading, and writing to an external database. Computing additional information can be done offline. Interactive visualisation over HANA Graph were possible using the Symbiosis engine. Flexible Computation and Visualisation of Similarities. Whereas AnnSim 1.0 was implemented specifically for the problem of efficiently computing similarities of objects described in a proprietary format, HLA uses Linked Data as a unified data model and standard access mechanism. 6 New data sources can be added to HLA by providing more links to crawleable Linked Data. We believe that efforts such as by Bio2RDF [1] to release life science Linked Data will allow to semi-automatically resolve semantic conflicts using OWL semantics and rules. Other objects such as patients can be compared in HLA; AnnSim only re- quires objects to be annotated with concepts and concepts to be described in an is-a hierarchy. Algorithms that use other relationships and derive other informa- tion can be added to HLA. The Symbiosis engine showed that – given sufficient understanding of the domain experts’ problem – it is easily possible (5–10h of manual work) to provide flexible visualisations over a graph-based data model. 5 Related Work According to Haussler et al. [3] a Million Genome Warehouse has to pro-actively process relevant data in data analysis pipelines to draw valid and useful medical inferences. HLA accesses the Gene Ontology and computes AnnSim [4], yet, can be extended with other biomedical ontologies and other semantic similarity measures [5]. HLA uses the HANA Graph in-memory database [6] but may also use other graph databases such as Graphium. Callahan and Dumontier [2] present an approach to represent and evaluate scientific hypotheses based on RDF and SPARQL. 6 Conclusions In this work-in-progress paper, in a small case study of discovering similar genes we illustrated the potential of modular access mechanisms with Linked Data, queries over a schema-flexible graph database, and semantic algorithms to derive new information. Continuously adding new data sources and data items, new algorithms, and new visualisations leave exciting future work. References 1. Callahan, A., Cruz-Toledo, J., Ansell, P., Dumontier, M.: Bio2RDF Release 2: Im- proved Coverage, Interoperability and Provenance of Life Science Linked Data. In: The Semantic Web: Semantics and Big Data (2013) 2. Callahan, A., Dumontier, M.: Evaluating Scientific Hypotheses Using the SPARQL Inferencing Notation. In: The Semantic Web: Research and Applications (2012) 3. Haussler, D., Patterson, D., Diekhans, M., Fox, A., Jordan, M., Joseph, A., Ma, S., Paten, B., Shenker, S., Sittler, T., Stoika, I.: A Million Cancer Genome Warehouse. Tech. rep., University of California at Berkeley (2012) 4. Palma, G., Vidal, M.E., Haag, E., Raschid, L., Thor, A.: Measuring Relatedness Between Scientific Entities in Annotation Datasets. In: International Conference on Bioinformatics, Computational Biology and Biomedical Informatics (2013) 5. Pesquita, C., Faria, D., Falco, A.O., Lord, P., Couto, F.M.: Semantic Similarity in Biomedical Ontologies. PLOS Computational Biology 5(7) (2009) 6. Vasilyeva, E., Thiele, M., Bornhövd, C., Lehner, W.: Leveraging Flexible Data Man- agement with Graph Databases. First International Workshop on Graph Data Man- agement Experiences and Systems (2013)