Semantic Knowledge Graph Embeddings for biomedical Research: Data Integration using Linked Open Data Jens Dörpinghaus1,2 , Marc Jacobs1 1 Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany 2 jens.doerpinghaus@scai.fraunhofer.de Abstract. Knowledge Graphs are becoming a key instrument for biomed- ical knowledge discovery and modeling. These approaches rely on struc- tured data, e.g. about related proteins or genes, and form cause-and- effect networks or – if enriched with literature data and other linked data sources – knowledge graphs. A key aspect of analysis on these graphs is the missing context. Here we present a novel semantic approach towards a context enriched Knowledge Graph for biomedical research utilizing data integration with linked data. The result is a general graph concept that can be used for graph embeddings in different contexts or layers. 1 Introduction Biological and medical researchers considering computational approaches rely on structured data, e.g. about related proteins or genes, see [9]. Cause-and-effect networks are a special subtype of more general Knowledge Graphs. In principle, the integration of external data sources and manual curated data is key. Although several commercial solutions exist, Fakhry et al. state, that the ”adoption and extension of such methods in the academic community has been hampered by the lack of freely available, efficient algorithms and an accompanying demonstration of their applicability using current public networks” [4]. This and the emerging improvements on large-scale Knowledge Graphs and machine learning approaches are the motivation for our novel approach on se- mantic Knowledge Graph embeddings for biomedical research utilizing data in- tegration with linked open data. Several similar approaches (often in the context of drug-repurposing) have been described like Bio2RDF [2], hetionet [6], or Open PHACTS [5]. Our approach is more focussed on integrating the literature itself in a FAIR [10] and open knowledge graph which is also accessible from public a public resource: SCAIView3 . SCAIView is an information retrieval system that allows semantic searches in large textual collections by ontological representa- tions of automatic recognized biological entities [7]. 3 https://www.scaiview.com/ Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Dörpinghaus, Jacobs et al. Knowledge Graphs: Context Embeddings for Knowledge Discovery and Data Mining on biomedical data Information Highway Macro-Context Information Highway To allow an easy and FAIR access to the data and benefit from semantic graph-queries we provide users with an SPARQL and Cypher frontend. Data Uplifting (Data Mining) Knowledge Graph Data Integration Layer n e.g. Document Layer Knowledge Graph The Knowledge Graph is a key concept connecting unstructured knowledge entities into a network structure. It allows easy data integration and data mining utilizing graph- Layer n-1 e.g. Mechanism Layer theoretical algorithms and technologies. (BEL) … Data Layers Data within the Knowledge Graph can be ordered according to context and information to data layers (e.g. a molecular Layer 1 e.g. Molecular Layer or mechanism layer). This helps to examine novel causal connections and context. Knowledge Foundation Data Integration Layer 0 Adding more data will increase the Knowledge-Foundation and gives a more precise view on the micro-context and Micro-Context helps to unveil new context and insights. Fig. 1. Illustration of the knowledge graph embedding between different layers. Here, every layer corresponds to a context defining new contexts on several other layers. Thus layers and contexts are flexible and can be defined in a feasible way for every application. The basis for generating our large-scale Knowledge Graph representation is the biomedical literature, e.g. MedLine and PubMed4 . These articles or abstracts are the source for biological relations mentioned above. In addition, meta infor- mation like authors, journals, keywords (so called MeSH-Terms, Medical Subject Headings), etc. are freely available. Ontologies can be used to contextualize en- tities in the Knowledge Graph providing biological or medical relations (cf. 5 ). Every ontology will form another knowledge (sub-)graph. Using methods of natural language processing (NLP) and text mining, we can combine and link these knowledge graphs to a giant and very dense new knowledge graph. This will meet a very general definition of context. We can see every knowledge (sub-)graph as context to another. Biological expressions are context of the corresponding literature, authors are context of a text, named entities from ontologies found in a text are context to it or to the corresponding biological expressions. Our overarching integration schema is based on the Biological Expression Language6 is widely applied in biomedical domain to convert unstructured tex- tual knowledge into a computable form. The BEL statements that form knowl- edge graphs are semantic triples that consist of concepts, functions and relation- ships. Thus they can be easily added to a knowledge graph representing another layer or context. An example for a large Alzheimer network can be found in [8]. In the next section we describe the novel concept of semantic graph embed- dings within large-scale Knowledge Graphs. We will present several use-cases 4 See https://www.ncbi.nlm.nih.gov/pubmed/. 5 OLS, https://www.ebi.ac.uk/ols/index 6 BEL, www.openbel.org Semantic Knowledge Graph Embeddings 3 and application examples as well as the semantic interoperability layer using RDF and SPARQL. 2 Knowledge Graph architecture A Knowledge Graph is a systematic way to connect information and data to represent common knowledge. As described above, the context is the most im- portant topic to generate knowledge or even wisdom. We define knowledge graphs G = (E, R) with entities e ∈ E coming from a formal structure like an ontology O, see [1] and [11]. The relations r ∈ R can be ontology relations, thus in general we can say every ontology O which is part of the data model is a subgraph of G which means O ⊆ G. In addition we allow inter-ontology relations between two nodes e1 , e2 with e1 ∈ O1 , e2 ∈ O2 and O1 6= O2 . More general we define R = {R1 , ..., Rn } as list of either ontologies, terminologies or any sort of controlled vocabulary containing relations or not. We define contexts C = {c1 , ..., cm } as a finite, discrete set. Every node v ∈ G and every edge r ∈ R may have one ore more contexts c ∈ C denoted by con(v) or con(r). It is also possible to set con(v) = ∅. Thus we have a mapping con : E ∪R → P(C). If we use a quite general approach towards context, we may set C = E. Thus every inter-ontology relation defines context of two entities, but also the relations within an ontology can be seen as context, see figure 1 for an illustration. Here every context is identified as a layer (e.g. a document layer, a molecular layer, a mechanism layer, ...). This allows new connections between different contexts or layers: If two edges e1 , e2 ∈ R1 are connected and e01 , e02 ∈ R2 with con(e1 ) = e01 and con(e2 ) = e02 are not connected, we may add another edge (e1 , e2 ) with provenance information that this connection comes from a different context, namely R2 . Since every layer or context can be seen as a subgraph forming a surface we can denote the relation between two layers a knowledge graph embedding. It is also possible to get the context of a subgraph Ri ⊆ G which can be denominated by con(Ri ) or with the notation of graph theory as the extended induced subgraph by the vertex set Ei from Ri given by Gc [Ei ]. This is quite trivial if context from Ri can only be annotated to vertices in G. Then Gc [Ei ] = G[Ei ] ∪ {(e, e0 ) ∀e0 ∈ N (e), e ∈ Ei } Here con|Ei = Gc [Ei ] is the context of Ei restricted to the set of edges (relations) in the graph. The two edges e0 , e00 are implicitly given by this context. It is quite easy to see that the restriction on context annotated to edges makes the problem more easy from a computational perspective. Nevertheless, context on edges is needed from a real-world perspective. The technical design was done with respect to the microservice architecture of SCAIView [3]. We offer both a REST API as well as a Java Message Service (JMS) interface. As a database backend, we used Neo4j. Here, we used Spring Data Neo4j to map objects to graphs. Thus our software can be used to perform Cypher and SPARQL queries. Data can be retrieved in JSON Graph Format or RDF format. 4 Dörpinghaus, Jacobs et al. hasBel HGNC:… hasSu bgrap h hasS ubgra ph Respon… GOBP:"… Glutath… Alzhei… el sB ha Neurotr… n el tio BELRe Calciu… sB nc BE BELRelation sFu Endoso… ha LR ha ph Cell n hasD ela tio hasD lation isease gra cycle nc hasD BE tio hasD sFu el h path hasD n ap subgr… e Sub LR hasDiseas ha iseas sB n hasD tio as hasD e iseas ha ph gr nc ela hasD iseas ise eas has ub iseas has hasD h sFu n ubgra Dis Bel tio iseas ha ha tio e sD sS ap has Sub sSu has HGNC nc iseas e n gra ha gr ha el ha HGNC b… ph iseas bg BELRelation e ph sFu n ation sB e iseas ub rap sS ha tio gra e BELRelation BELRel ha hasS sS h ub nc Sub hasSu e gr sFu n ha tiohas e Non-a… HGNC ap ha e h nc e BELRelation sFu n ha tio ph nc graph gra sFuhasSubn Sub n dbSNP:r… ha tio has ctio BELRelation nc ation dbSNP:r… elBELRelation sFu BELRel MESHD… BELRelation CHEBI:… n BELRelation hasDisea sB ha tio sFun tion ha el nc ha HGNC LRela sB sFu hasSu sSu ha BELRelation ha bgrap n ha BE ha el htio bg sF hasF sB nc BELRel … ation un unct ion ha sBel se sFu hasDisease cti ha l ha el Cholest… on Be BELRelation sB BELRelation ha el BEL P s sB ha Rel HGNC ha Bel ha sBel ha U atio has sS ph n sS ub RO ha sBel BELRelation hasSTATEMENT_GROUP has g BELRelation ubgraph BELRelation Be BELRelation gr Bel BELRelation ub has ha sBel ap BE gra l gra h has LR G ha sBel n Bel Be has ela BE elatio ub ph ha GOBP:"… ha sBel T_ ph y ha l tio LR has n sB Bel hasS sB Be BELRelation hasBel n tom ha sBel elatio phasS el has el hasFunction EN n athasSTATEMENT_GROUP el l BELR has ha sBel n ra elatio ha hasSTATEMENT_GROUP io hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUP na has ha Be Bel n n ng gra h o ha sBel has TEion ha HGNC sF BELR elatio M l HA BE ctuiob P STA cti has sSh un sS el ha el n ub ha Be LR BELR U elatio TA sB ct eS sB el sS TEM TAit TE el l ion sFun ha sB uabs sS BELR RO un at el ha sB hasFunction M hasBel sSnd io ha ha el ha EN ha sM ha ph Cell n has ENT BELR sB sS ha ha gSrT T_ Bel sF G n… ha el adhesion hsaCo GR u sB _GR sB T_ gr… aApT subgr… bg OU ha ha el hasSub ra el has P hasBel sB dbSNP:r… OU ra g EN ha ha hEM el hasMeSHA Bel ub HGNC P ph sB ha ha ha ioEM sSph sB hasS P EN n has el hasSTATEMENT_GRO… U ha hagra sS Bel ditAT hasBel el RO HGNC T_ sC sB TA onST ha ub has G G HGNC ubgr Bel sB ha sCas on RO T_ TE sS el hasBel dbSNP:r… hasSubgraph TGF-Be… ha h N han dit has UP ME has Regulat… itio ME Bel ctio hasSubgraph aph hasC Fu ha ion sS nct sFun NT nd TE TA n io T… ha Co STA n _G ha el hasS sSTATE ph ondit BELRel elatha as RO Group 1 sB ubgr MENT Caspase ra bg h el ha Neurons BELRel iosn UP ap _GRO sB ha subgra… Su ion sS p has h UP deg has ha ation Con el ub gra sB dit LR ation ha has ph ion (b) ha sB BE Bel BELRelation el el sB ha n Group 1 sB aph Amyloi… ha io el ct ubgr BE un hasS sF LR ha ph ha ha hasMeSHAnatomy ela sB UP el ha has T_GRO ra el UP sB TEMEN tio Bel Tau Normal sFun RO ubg ROh hasSTA n ha _Gap ha protein ha Healthyion hasSTA ha S sB gr sF sFhun UP ctio s el h hasCondition hasSubgrap subgr… Statenct NT ha u nsShEub ha nc asct u n HGNC tio sF sFun M _G Beion io ap ha n ha TEMENT P ub E ph ha OU NT l on TAT haSubgra ditgr sS graph has ctio s sph nME GR BE u S Bel LRel ha bg haraph itTioE hagra T_ sS atio sS n ha haC ha sC TdA EN n sB ub hasSub _GROU hasSub s TE n gr osSn on el M Be TA itio ap shCa d l BE h sS d itio LRel haCon has P at n ion HGNC s has act ha hasB hasBel el Bel ction hasFun ction n hasBel atio hasFun Rel act BEL HGNC hasB HGNC hasBel hasFunction hasBel el BEL hasBel BELRelation hasFunc hasBel Rela hasBel ph tion HGNC hasBel HGNC gra tion Bel hasBel Sub has hasB Bel tion OUP on GOBP:"… has unc el hasBel BELR has _GR diti Bel hasB l BELRelation elati hasF aph ph hasBe hasB has ENT Con el gra on Bel has n hasBell BELRelation HGNCUPel hasB ubgr GOBP:"… Bel hasBehasB has el Sub has BELRelatio T_GROgraph BELRelatio hasBel b… Bel hasB hashas on l TEMEN el hasSub hasDise n n hasS BELRelat hasDise hasBe has hasSTA hasSu hasFunc hasST hasDise el diti hasBl hasConditio TEM hasS ion hasC Bel h UP HGNC hasST ph hasBe el tion has ph dition hasDis… hasCo hasB hasDise hasDis… Con bgrap hasSTition gra ubg Sub RO hasDis… gra hasDise gra el has hasBel OUP BELRelation hasB l STA hasST nditio hasDise hasC raph hasSu has Be ond ATEM T_G hasST hasBel ase ATEM hasSubgraphROUP has hasSubgra el Sub hasSun Bel ion hasSubgraph hasBe hasST hasitionENT_G hasFunctio l ase ATEM hasS ph ATEM unctn hasB hasCon hasSubgraph has HGNC hasC ATEM MeS ph ase ATEM hasST hasB hasBel hasSubgraph has MENT_GR EN hasBel has has ATEM raph bgrap has gra ond hasSubgraph TATE el hasST has ditio P n el has HAn Bel hasF BELRelation ph y hasST TEM ctio BehasSubgraph hasBel ase ENT_G h BELRelation hasC ATEM MEN hasBhasSubgragratom Sub hasST ph ase Sub l HGNC UP hasC Binding Con ond n ph ition atom ase ENT_G hasSubgraph OU hasB hasST Bel has ubg T_GR ctio Fun has el has hasDiseaseT_GROgraph ATEM STA ondit gra ENT_G hasST has and Up… Ana y OUP has Bel _GR hasMeSHAnatomy ENT_G ph hasSub ond ion hasST Be ATEM has hasS Sub graph Fun hasS Sub has ENT_Gph has Sub n BELR has ROUP STA l TEMEN has ph ROUP ENT ENT_G ubgr hasSub hasSubgra hasSTATE hasSubgraph has hasSTA eSH ATEM has hasConditio ition TEM Alzhei… Bel ha dition Bel n has has gra hasBel ATEM aph on ENT_G aph ROUP TEM Bel has Be el ENT_G ENT hasCon ATEM elatio atio l hasSubgr elati hasM n el hasSu has raph ENT_G aph n STA ROUP tion Hippoc… ctio Bel hasB ctio _GR OUPAnatomy ENT_G Rel has hasEvidence hasBel bgraph ROUP hasSubg BELR el has ubgr has has hasEvide ENT_G Bel has ROUP OUP Normal Fun sB BEL Caspase T_GR Be hasMeSH n ROUP Fun Rela ENT_G sB Fun P hasS n Healthy has Bel ROUP hasSubgraph Bel MEN has l has atio ROUP subgra… ENT_GROU BEL hasS hasBel ha has ctio ndition TATEhasCondition ion ROUP graph Rel State Bel n hasSTATEM n has ubgr n ROUP n hasCo Bel unct hasSub UP ctio el el hasS el n ha ROUP ctio hasFaph elatio T_GRO ceBEL el Bel Amyloi… has hasB has hasF Fun atio BELRTEMENHGNC n n sB hasB hasFunction Fun hasSubgraph atio n Bel den sB Rel aph Interleu… Bel has hasSTAhasConditio has has hasCon _GROUP graph has Evi unct e ubgr hasCondition y BEL has Sub Rel dition hasSub hasBel n hasSTATEMENT gra Bel tom hasS iseas raph Astrocy… Bel has el BEL hasSubgr ha hasSubg ion ph Ana hasD P Group 1 ctio ph aph Bel has gra eSH ENT_GROU has Fun hasC hasSTATEMENT_GRO tion ndition … Synucl… hasM has Sub hasSTATEM Chemo… has Rela Bel hasM hasCo Bel ondi UP graph Sub hasM has aph bgraph eSHA BEL hasSub hasFuncti hasSubgr tomy hasSu lation hasS tion HGNCUP nato hasS has on ion hasCondition Bel ubgr BELRe hasSTA ubgr eSH hasSub SHAna my tion n T_GRO aphunct hasSTATEMENT_GROUP on TEMEN nce has graph hasMe … Rela hasF atio TEMEN uncti Ana T_GRO graph BEL hasSTA Rel hasM ph raph P hasF UP tion hasSubg HGNC BEL tom Sub gra aph ENT_GROU ondi BELRelation ndition on hasSubgr hasCondition HGNC Sub hasSubg hasSub hasFuncti hasC hasSTATEM eSHA h has natom y ction P hasCo y diti hasSTATEMENT_GROUP on bgrap natomy has n hasFun atom UP GROU h ctio has Con raph hasSu RO HAn ENT_ rapFun HGNC gra Sub has MeS T_G ATEM bghas aph ph has hasMeSHA BELRelat gra hasST hasS sSu EN hasFunction y ubgr hasFuncti ion ubgr ha ion ph P n TEM atio on ENT_GROUraph unct BEL p hasSTATEM hasSubg STA hasS hasF Rel Rel hasFun BELR hasBel has aph BEL atio ction elatio n hasFunction hasFuncti n HGNC n ctio on has hasBel Fun hasBel Fun HGNC has ctio n hasBel HGNC HGNC (a) (c) Fig. 2. (a) This is an illustration of the context found for the BEL statement act(p(HGNC:KLC1)) => p(HGNC:MAPT) (found on the bottom of the graph). Both HGNC terms have an evidence in two different documents (purple) and both form a relation in another document (PMID:22272245 in the middle). The green nodes form a manually curated context (e.g. ”Normal Healthy State” or ”Tau protein subgraph”). All HGNC entities are connected to other HGNC elements, documents and function. (b) This is an illustration of the context of a single document (purple, left). (c) This is an illustration of the context of a context (green, left). 3 Application The initial research question was how a general context could be added to biomedical knowledge graphs to answer generic questions according to context, e.g. time, location or biological layer. We have integrated subsets of PubMed data, several ontologies like GO, HGNC, MGI and mappings, BEL networks from Parkinson’s and Alzheimer’s disease as well as data obtained from KEGG. See fig. 2 for some illustrations of different context layers. For example, seman- tic questions can be formulated as subgraph structures of the initial knowledge graphs. We may think of complex examples, e.g. ”Give me all pathways from protein A to B in the context of Disease C focusing on clinical trials”. Hypothesis generation within medical research and digital health may lead to search for genomic or moleculare patterns, diagnosis or build longitudinal models which build the basis for a multitude of predictive and personalised medicine ML and AI approaches. This information system can be used to retrieve data by context (cohort size, settings, results, ..) and by content (imaging data, genomic or moleculare measures, ...). For example, this system may answer questions like Give me a clinical trial to reproduce my results or to apply my model or Give me literature for phenotype A, disease B age between C and D and a CT-scan with characteristic E. Semantic Knowledge Graph Embeddings 5 4 Conclusion Here we presented a novel approach that annotates research data with context information. The result is a knowledge graph representation of data, the context graph. It contains computable statement representation (e.g. RDF or BEL). This graph allows to compare research data records from different sources as well as the selection of relevant data sets using graph-theoretical algorithms. References 1. Guidelines for the construction, format, and management of monolingual controlled vocabularies. Standard, National Information Standards Organization, Baltimore, Maryland, U.S.A. (2005) 2. Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P., Morissette, J.: Bio2rdf: to- wards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics 41(5), 706–716 (2008) 3. Drpinghaus, J., Klein, J., Darms, J., Madan, S., Jacobs, M.: Scaiview – a seman- tic search engine for biomedical research utilizing a microservice architecture. In: Proceedings of the Posters and Demos Track of the 14th International Conference on Semantic Systems - SEMANTiCS2018 (2018) 4. Fakhry, C.T., Choudhary, P., Gutteridge, A., Sidders, B., Chen, P., Ziemek, D., Zarringhalam, K.: Interpreting transcriptional changes using causal graphs: new methods and their practical utility on public networks. BMC bioinformatics 17(1), 318 (2016) 5. Harland, L.: Open phacts: A semantic knowledge infrastructure for public and commercial drug discovery research. In: International Conference on Knowledge Engineering and Knowledge Management. pp. 1–7. Springer (2012) 6. Himmelstein, D.S., Lizee, A., Hessler, C., Brueggeman, L., Chen, S.L., Hadley, D., Green, A., Khankhanian, P., Baranzini, S.E.: Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife 6, e26726 (2017) 7. Hodapp, S., Madan, S., Fluck, J., Zimmermann, M.: Integration of UIMA Text Mining Components into an Event-based Asynchronous Microservice Architecture. In: Proceedings of the LREC 2016 Workshop ”Cross-Platform Text Mining and Natural Language Processing Interoperability”. pp. 19–23. European Language Resources Association (ELRA), Portorož, Slovenia (2016) 8. Kodamullil, A.T., Younesi, E., Naz, M., Bagewadi, S., Hofmann-Apitius, M.: Com- putable cause-and-effect models of healthy and alzheimer’s disease states and their mechanistic differential analysis. Alzheimer’s & Dementia 11(11), 1329–1339 (2015) 9. Martin, F., Sewer, A., Talikka, M., Xiang, Y., Hoeng, J., Peitsch, M.C.: Quantifi- cation of biological network perturbations for mechanistic insight and diagnostics using two-layer causal models. BMC bioinformatics 15(1), 238 (2014) 10. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.: The fair guiding principles for scientific data management and stewardship. Scien- tific data 3 (2016) 11. Zeng, M.: Knowledge organization systems (kos) 35, 160–182 (01 2008)