openCypher over RDF: Connecting Two Worlds Michael Schmidt* , Brad Bebee, Willem Broekema, Mohamed Elzarei, Carlos Manuel Lopez Enriquez, Marcin Neyman, Florian Schmedding, Andreas Steigmiller, Bryan Thompson, Geo Varkey, Gregory Todd Williams and Amanda Xiang Amazon Neptune Team, Amazon Web Services, Seattle, WA Abstract Today’s graph database space is divided into two – for the most part, separated – technology stacks: Labeled Property Graphs (LPGs) and RDF. As the team working on Amazon Neptune, a graph database that supports both technologies, we aim to give our customers the choice and flexibility they need to address their graph use cases. What we learned when working with them on their graph applications – from social networks, recommendations, fraud detection, to Knowledge Graphs and LLM grounding – is that the two technology stacks each have their unique strengths. RDF, with its standardized serialization formats, global identifiers, and the availability of Linked Open Data sets, is of particular value for data architects who seek to build, integrate, and interchange graph data. Application development teams, on the other hand, often prefer LPG query languages to interact with graphs, due to their intuitive syntax, the maturity of developer ecosystems (client drivers, programming language integration, etc.), and graph-specific features such as built-in support for path extraction and algorithms. In this demo we will present our recent work on openCypher over RDF, which aims to combine the strengths of the two worlds by (a) allowing our customers to load LPG and RDF data into a single, connected graph and (b) querying this single graph using the openCypher query language. From a conceptual perspective, this functionally is achieved by an overarching graph metamodel called OneGraph, which encompasses both data models and provides LPG and RDF specific views that implicitly define query semantics. We will utilize open data sets to showcase how we combine LPG and RDF data into a single data graph, demonstrate how this unified graph can be queried and modified using openCypher, and discuss concepts, design decisions, as well as remaining challenges in aligning the LPG and RDF stacks. 1. Introduction The Semantic Web stack, which comes with the W3C standardized Resource Description Format (RDF) as its foundational data model, originates from the vision of enhancing the Web with a globally connected, machine-understandable network of knowledge [1]. RDF decomposes data into (subject, predicate, object) triples that build upon globally unique identifiers for resources, so-called IRIs [2]. This clears the way for data serialization, exchange, and interlinking at global scale. Consequently, companies often choose RDF when defining a broader, top-down information architecture that aims to connect data across heterogeneous domains and feeds multiple applications. Labeled Property Graphs (LPGs), on the other hand, were driven by companies, Open Source initiatives like Apache TinkerPop, and non-profit organization such as LDBC [3] alike. As a result, different flavors of the LPG data model and different query language Posters, Demos, and Industry Tracks at ISWC 2024, November 13–15, 2024, Baltimore, USA * Corresponding author: schmdtm@amazon.com. © 2024 Copyright for this paper by Amazon Web Services. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings have emerged. As a common denominator, LPGs abstract graphs as vertices connected through edges, both of which can be described by properties (essentially, key-value pairs) and may be tagged with labels. The absence of built-in semantics and a looser set of constraints gives LPG users a larger degree of freedom, e.g. when it comes to questions like what labels are used for, or whether identifiers are globally unique vs. application scoped. While this can be seen as a downside of LPGs for building and interlinking graphs in Open World scenarios, it lowers complexity and makes LPGs an attractive choice especially for closed application scenarios. Differences in coverage, usability, and expressiveness can also be found when comparing LPG and RDF query languages. SPARQL, the W3C standardized language for RDF, comes with a concise semantics and is well-suited for declarative extraction of subgraphs via pattern matching. While it offers some unique features compared to LPG query languages – built-in federation across SPARQL endpoints, convenient querying of schema information, and reasoning support – it lacks other functionality that is commonly found in LPG query languages, most notably when it comes to path queries. The SPARQL 1.1 extensions introduced support for property paths, but important graph use cases (e.g., binding paths to variables and returning them as query output) remain hard, in some cases impossible, to achieve. In contrast, LPG languages like Gremlin [4], openCypher [5], and GQL [6] treat paths as first-class citizens, coming with built-in data types to represent and user-defined functions to process paths. The key take-away is that both LPG and RDF have unique characteristics which, from an end user perspective, translate into pros and cons for choosing either of the two stacks. When users build graph applications today, they typically opt into either of the two worlds – a decision with profound consequences that is not only hard to make, but also expensive to revert if requirements change. To address this situation, we proposed OneGraph [7] as an initiative to bring the two worlds together and allows graph users to benefit from the best of each world. 2. OneGraph OneGraph aims to achieve seamless interoperability between LPGs and RDF by providing users the ability to load and manage LPG and RDF data in a unified data graph. It can be understood as a meta model that comprises both RDF and LPGs. The conceptual architecture depicted in Fig. 2 illustrates the basic idea (components marked in bold font are in scope for this demo). When loading data, LPG and RDF are both mapped into OneGraph’s internal data model. This facilitates the co-existence of (possibly interlinked) data from both stacks inside the same database. The idea of mapping both LPG and RDF into an overarching model, which comes with the merit of a lossless representation for both formats, is conceptually different from the majority of prior work in this space, which mostly focused on exploring direct mappings at data model [8, 9] or query language layer (e.g., [10, 11, 12]). Approaches that propose a unifying meta model like OneGraph were only introduced recently, either developed in parallel to OneGraph (such as multilayer graphs [13]) or directly inspired by OneGraph (e.g., statement graphs [14]). While a detailed formalization of the OneGraph model is out of scope for this demo, the basic idea behind OneGraph is to define data model agnostic elements that both LPG and RDF data can be mapped to. As an example, two such element types are relationship statements, which allow to represent links between resources, and property statements, to describe properties of Figure 1: Conceptual overview of OneGraph and how it is used in Neptune Analytics resources. Relationship statements, for instance, arise from the RDF-to-OneGraph mapping for RDF triples that have resources in object positions, and also from the LPG-to-OneGraph mapping for edges in property graphs. Similarly, RDF triples with literals in object position are mapped to property statements, and so are property graph vertex properties. On top of its internal format, OneGraph then allows to query the unified graph using an LPG or RDF query language of choice. Conceptually, this is achieved through logical views – virtually defined mappings from the OneGraph model back into either LPG or RDF that implicitly define the semantics of read queries for LPG and RDF languages. Relationship statements, for instance, are mapped to property graph edges, no matter if they originated from RDF or LPG data. To bridge the gap between identifiers, OneGraph also allows for the co-existence of IRIs (as globally scoped identifiers originating from RDF dataset) and so-called local resource identifiers (LRIs) originating from LPG data. In the RDF view, LRIs are exposed as IRIs that are prefixed with a system local namespace; vice versa, in LPG languages full IRIs are identified through syntactic conventions that set them apart from LRIs (we provide an example below in Section 3). 3. Demo In the demo, we sketch the life cycle of building, querying, and maintaining graphs composed from both LPG and RDF data using an interactive Jupyter notebook running over Neptune Analytics – our memory-optimized graph database engine – which implements OneGraph concepts to support openCypher over RDF. The demo scenario uses a mix of public LPG and RDF graphs: Air Routes1 , an LPG dataset describing airports and flight routes, contextualized with RDF graphs from GeoNames2 and Wikidata3 . We will highlight serialization formats (CSV for LPG, N-Triples for RDF) and showcase mechanisms to integrate data into a connected graph. The central part of the demo will be the interactive execution of openCypher queries against the unified graph. We will highlight that our design supports RDF querying purely via syntactic 1 Available at https://github.com/krlawrence/graph/tree/master/sample-data (last accessed: Sep 2, 2024) 2 Available for download at https://www.geonames.org/ontology/documentation.html (last accessed: Sep 2, 2024) 3 See https://www.wikidata.org/ (last accessed: Sep 2, 2024). conventions within openCypher, allowing users to reuse existing tooling (such as syntax valida- tors) without modifications. A focus will be on the syntax to disambiguate local identifiers (as found in Air Routes) and global IRIs (stemming from GeoNames and Wikidata). Complementary, we will discuss minor syntactic extensions to openCypher for prefixed IRI support as an opt-in feature for increased readability. To provide a simplistic example of an openCypher query that leverages these syntactic extensions, the following query matches Wikidata airports (where the prefixed Wikidata IRI entity::Q644371 identifies the class “Airport”) and returns their IRIs as well as their opening dates (identified through Wikidata property prop::P1619): PREFIX entity : PREFIX prop : MATCH (airport : entity::Q644371) RETURN id(airport) AS airportIri, airport.prop::P1619 AS openingDate Beyond such simple queries that highlight syntactic aspects, our demo will also cover areas in which openCypher as a query language for RDF provides benefits over SPARQL. This includes (a) classes of openCypher path queries that cannot be expressed in SPARQL today, (b) support for composite types (e.g., lists and maps) and result transformations such as folding and unfolding, and (c) built-in support for stored procedures and graph analytics, which we demonstrate via queries that invoke graph algorithms like Breadth First Search and Page Rank. In addition to read queries, we will include openCypher update queries, with a focus on interoperability aspects between LPG and RDF such as connecting LPG and RDF subgraphs, introducing edge properties over RDF data, or linking LPG subsets of the graph against RDF ontologies. Last but not least, on the conceptual side we will highlight the idea behind the OneGraph data model, how it helps Neptune Analytics to manage LPGs and RDF in a unified way, and discuss key challenges that we faced when designing openCypher over RDF. We will make the demo publicly available to users who want to explore the functionality outside of the conference. 4. Conclusion and Future Work We see the ability to manage integrated RDF and LPG graphs and query them with openCypher as a first step towards the broader vision behind OneGraph. Directions that we plan to explore in the future include extended feature coverage in openCypher for RDF (e.g., named graph and RDF blank node support), LPG-RDF interoperability for other query languages (like SPARQL, Gremlin, and GQL), as well as combining of fragments from different query languages within a single query against OneGraph. We are also engaged in standardization activities that aim to bridge the worlds between RDF and LPGs. Examples include composite type support for RDF and SPARQL [15], which aims to close usability and expressiveness gaps between the RDF and LPG data model and query languages, as well the RDF-star Working Group [16], which seeks to extend RDF with user-friendly reification mechanisms similar to LPG edge properties. We do believe that interoperability between LPGs and RDF contains an interesting set of open research challenges, especially for the Semantic Web community, and are convinced that researching and building technology to overcome the separation of the two worlds is a unique opportunity to accelerate the adoption of Semantic Web technology in industry. We would like to encourage researchers that are interested in this space to reach out to us. References [1] O. Lassila, J. Hendler, T. Berners-Lee, The Semantic Web, Scientific American 284 (2001) 34–43. [2] M. Duerst, M. Suignard, RFC 3987: Internationalized Resource Identifiers (IRIs), 2005. [3] P. Boncz, LDBC: Benchmarks for Graph and RDF Data Management, in: Proceedings of the 17th International Database Engineering & Applications Symposium, 2013, pp. 1–2. [4] M. A. Rodriguez, The Gremlin Graph Traversal Machine and Language (invited talk), in: Proceedings of the 15th Symposium on Database Programming Languages, 2015, pp. 1–10. [5] A. Green, M. Junghanns, M. Kießling, T. Lindaaker, S. Plantikow, P. Selmer, openCypher: New Directions in Property Graph Querying., in: EDBT, 2018, pp. 520–523. [6] A. Deutsch, N. Francis, A. Green, K. Hare, B. Li, L. Libkin, T. Lindaaker, V. Marsault, W. Martens, J. Michels, et al., Graph Pattern Matching in GQL and SQL/PGQ, in: Proceed- ings of the 2022 International Conference on Management of Data, 2022, pp. 2246–2258. [7] O. Lassila, M. Schmidt, O. Hartig, B. Bebee, D. Bechberger, W. Broekema, A. Khandelwal, K. Lawrence, C. M. Lopez Enriquez, R. Sharda, et al., The OneGraph Vision: Challenges of Breaking the Graph Model Lock-in, Semantic Web 14 (2023) 125–134. [8] H. Chiba, R. Yamanaka, S. Matsumoto, G2GML: Graph to graph mapping language for bridging RDF and property graphs, in: International Semantic Web Conference, Springer, 2020, pp. 160–175. [9] S. Khayatbashi, S. Ferrada, O. Hartig, Converting property graphs to RDF: a Preliminary Study of the Practical Impact of Different Mappings, in: Proceedings of the 5th ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), 2022, pp. 1–9. [10] R. Mutharaju, A SPARQL to Cypher Transpiler, Ph.D. thesis, Indraprastha Institute of Information Technology New Delhi, 2022. [11] Z. Zhao, X. Ge, Z. Shen, C. Hu, H. Wang, S2CTrans: Building a bridge from SPARQL to Cypher, in: International Conference on Database and Expert Systems Applications, Springer, 2023, pp. 424–430. [12] H. Thakkar, D. Punjani, J. Lehmann, S. Auer, Two for one: Querying property graph databases using SPARQL via gremlinator, in: Proceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), 2018, pp. 1–5. [13] R. Angles, A. Hogan, O. Lassila, C. Rojas, D. Schwabe, P. Szekely, D. Vrgoč, Multilayer graphs: A unified data model for graph databases, in: Proceedings of the 5th ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), 2022, pp. 1–6. [14] E. Gelling, G. Fletcher, M. Schmidt, Bridging graph data models: RDF, RDF-star, and property graphs as directed acyclic graphs, arXiv preprint arXiv:2304.13097 (2023). [15] O. Hartig, G. Williams, M. Schmidt, O. Lassila, C. M. L. Enriquez, B. Thompson, Datatypes for Lists and Maps in RDF Literals, in: European Semantic Web Conference, Springer, 2024. [16] World Wide Web Consortium, RDF-star Working Group (2024). URL: https://www.w3. org/groups/wg/rdf-star/, last accessed: Sep 2, 2024.