Extracting Graphs from Tables via Conceptual Models SebastiΓ‘n Ferrada1 1 IMFD Chile & Data and Artificial Intelligence Initiative, Universidad de Chile, Beauchef 851, Santiago, Chile Abstract This poster presents initial progress on a mapping to convert relational databases into knowledge graphs by utilizing the conceptual model of the database as a means of capturing its underlying semantics. We leverage the ERDoc language for defining Entity-Relationship Diagrams, for which we provide semantics. Unlike previous approaches, this method assumes the conceptual model as part of the input and emphasizes the formal definition, semantic correctness, and other properties of the mapping. Keywords Data Mapping, Knowledge Graphs, Relational Databases, Conceptual Models 1. Introduction Knowledge graphs (KGs) model the data of a given domain as a set of entities or objects connected through a rich network of relationships [1]. Similarly, the conceptual model usually conceived when designing a relational database defines the types of entities that will inhabit the database and the types of relationships in which they can participate. As most data is stored in relational databases, we propose leveraging their conceptual model to capture the underlying semantics and define a mapping procedure that produces a KG from its data. Such a mapping can be useful to be able to apply richer querying [2] and analytics [3] over the mapped data and, further, the mere definition of the mapping can allow for a virtual graph view of the relational data which can be accessed employing query translation. In this poster, we present progress on the development of such a mapping that transforms a relational database into either an RDF/RDF-star graph or a property graph (PG), using the conceptual model of the input database. We leverage the ERDoc language [4] used to define Entity-Relationship Diagrams (ERDs) [5], which are a common way to design and communicate conceptual models. Our mapping, differently from Stoica et al. [6], is not direct (it requires extra input) but considers the semantics embedded in the conceptual model, yielding a semantically more accurate graph. For instance, in N-to-N relationships, [6] would create a node for each tuple in the table storing the relationship, whereas our approach translates such tuples directly to edges. Similarly, multivalued attributes would each be mapped to a node by [6], whereas our approach produces multivalued properties. Differently from Barret et al. [7], we assume that the conceptual model is part of the input of our mapping, and we shall focus on the formal definition of the mapping and its properties (e.g., information and query preserving [8]). Posters, Demos, and Industry Tracks at ISWC 2024, November 13–15, 2024, Baltimore, USA $ sebastian.ferrada@uchile.cl (S. Ferrada) Β€ https://sferrada.com (S. Ferrada)  0000-0002-9834-8376 (S. Ferrada) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Conceptual Models The conceptual model, defined as ERDs by Chen [5], aims to provide a unified view of data that allows to interoperate the relational model [9] and the network model [10]. As such, it also can allow us to leverage it to interoperate with RDF graphs [11], RDF-star graphs [12] or PGs [13]. ERDoc is a scripting language designed to specify ERDs. The idea for our mapping is to receive an input database along with the ERDoc document that codifies its conceptual model. To formalize our mapping, we provide semantics to a fragment of ERDoc [4]. Definition 1. An Entity-Relationship Diagram (ERD) π’ž is a tuple (β„°, β„›) such that: β€’ β„° is a set of entities. An entity is an expression of the form E{𝐾1 , ... , 𝐾𝑛 }{𝐴1 , ... , π΄π‘š }, where E ∈ β„° is the name of the entity, {𝐾1 , ... , 𝐾𝑛 }, is the non-empty set of prime attribute names of E, and {𝐴1 , ... , π΄π‘š } is the possibly empty set of non-prime attribute names. β€’ β„› is a set of relationships. A relationship R⟨(E1 , 𝐢1 ), ... , (En , 𝐢𝑛 )⟩{𝐴1 , ... , π΄π‘š } is such that R ∈ β„› is the relationship name, E1 , ... , En , elements of β„°, are the participating entities, 𝐢1 , ... , 𝐢𝑛 are the cardinalities and participation constraints of each participating entity, and {𝐴1 , ... , π΄π‘š } is the possibly empty set of relationship attribute names. β€’ A weak entity W⟨(E1 , R1 ), ... , (El , Rl )⟩{𝑃1 , ... , 𝑃𝑛 }{𝐴1 , ... , π΄π‘š }, is such that W ∈ β„° is the name of the weak entity, Ei ∈ β„° for 𝑖 ∈ {1, ... , l} are the names of the entities that W depends on, Ri ∈ β„› for 𝑖 ∈ {1, ... , l} are the names of the relationships through which W depends on each Ei , {𝑃1 , ... , 𝑃𝑛 } is the non-empty set of attribute names that are part of the partial key of W, {𝐴1 , ... , π΄π‘š } is the possibly empty set of non-prime attribute names. An example of an ERD can be found in Figure 1b, where Person and Bank are Entities, Account is a Weak Entity depending on Bank via relationship ofBank (which means that the primary key of Account is unique only for a given Bank), and hasAccount is an N-to-N relationship. 3. Mapping Relational Databases to Knowledge Graphs Our mapping β„³ is such that, given a relational database D with primary and foreign keys Ξ£ (following the definition of Sequeda et al. [8] over a domain of values 𝒱), and an ERD π’ž, β„³(D, π’ž) returns a PG 𝐺 = (𝑉, 𝐸, 𝜎, πœ‘, πœ†) (following the definition of Angles [13]). RDF graphs can also be produced later on (e.g., by using the mapping of [14]). We assume that D is in BCNF [15]. Non-weak Entities E{𝐾1 , ... , 𝐾𝑛 }{𝐴1 , ... , π΄π‘š } are mapped to a relation 𝑅E ∈ D, such that the attributes of 𝑅E are π‘Žπ‘‘π‘‘(𝑅E ) = {𝐾1 , ... , 𝐾𝑛 , 𝐴1 , ... , π΄π‘š }, and 𝑅E [𝐾1 , ... , 𝐾𝑛 ] is a primary key. To map such a relation to a graph, we take each tuple 𝑑 ∈ 𝑅E , where 𝑑 = (𝐾1 : 𝑣1 , ... , 𝐾𝑛 : 𝑣𝑛 , 𝐴1 : 𝑣𝑛+1 , ... , π΄π‘š : 𝑣𝑛+π‘š ), with 𝑣1 , ... , 𝑣𝑛 ∈ 𝒱 and 𝑣𝑛+1 , ... , 𝑣𝑛+π‘š ∈ 𝒱 βˆͺ {null}, and map it to a node πœ‚ = 𝑓𝑛 (𝑣1 , ... , 𝑣𝑛 ), where 𝑓𝑖 : 𝒱 𝑖 β†’ 𝒱 is a function that returns identifiers . Then, we extend the property assigning function 𝜎 to contain 𝜎(πœ‚, 𝑝) = 𝑣 for each 𝑝 : 𝑣 ∈ 𝑑. Finally, we extend the label assigning function πœ† to include πœ†(πœ‚) = 𝑅E . It can be seen that if 𝑓𝑖 is bijective, this mapping is reversible, and thus information preserving [8]. The mapping of relationships to the relational model is more nuanced. It depends on the number of participating entities, the presence or absence of attributes, and even the cardinalities (a) The Tables. (b) The ERD. (c) The Graph. Figure 1: The proposed mapping process. and participation constraints. We will summarize two acceptable mappings that preserve BCNF. Let us consider the generic relationship R⟨(E1 , 𝐢1 ), ... , (En , 𝐢𝑛 )⟩{𝐴1 , ... , π΄π‘š }. If 𝑛 = 2, {𝐴1 , ... , π΄π‘š } = βˆ…, and either 𝐢1 or 𝐢2 are one and only one cardinalities, the first mapping applies. W.l.o.g., we assume that 𝐢2 is one and only one, and that the partic- ipating entities are E1 {𝐾1 , ... , 𝐾𝑛1 }{𝐡1 , ... , π΅π‘š1 } and E2 {𝐽1 , ... , 𝐽𝑛2 }{𝐹1 , ... , πΉπ‘š2 }, which are mapped to relations 𝑅E1 and 𝑅E2 respectively. To map this relationship, we extend 𝑅E2 so that att(𝑅E2 ) ← att(𝑅E2 ) βˆͺ {𝐾1 , ... , 𝐾𝑛1 } and 𝑅E2 [𝐾1 , ... , 𝐾𝑛1 ] REF 𝑅E1 [𝐾1 , ... , 𝐾𝑛1 ] is a foreign key. To convert this foreign key to graph elements, for each tuple 𝑑 ∈ 𝑅E2 , where 𝑑 = (𝐾1: 𝑣1 , ... , 𝐾𝑛1 : 𝑣𝑛1 , 𝐽1: 𝑣𝑛1 +1 , ... , 𝐽𝑛2 : 𝑣𝑛1 +𝑛2 , 𝐹1: 𝑣𝑛1 +𝑛2 +1 , ... , πΉπ‘š2 : 𝑣𝑛1 +𝑛2 +π‘š2 ) we create an edge 𝑒 = 𝑓𝑛1 +𝑛2 (𝑣1 , ... , 𝑣𝑛1 +𝑛2 ), with πœ‘(𝑒) = (𝑓𝑛1 (𝑣1 , ... , 𝑣𝑛1 ), 𝑓𝑛2 (𝑣𝑛1 +1 , 𝑣𝑛1 +𝑛2 )), and πœ†(𝑒) = R. Yet again, if 𝑓𝑖 is bijective, then this transformation is reversible. For any other case of relationship R (e.g., 𝑛 > 2 or {𝐴1 , ... , π΄π‘š } = ΜΈ βˆ…), a relation 𝑅R for the relationship is created and it contains foreign keys referencing all the relations mapping the participating entities. The primary key of 𝑅R depends on the cardinalities and participation constraints [5]. Each tuple 𝑑 ∈ 𝑅R can be mapped to an edge only if 𝑛 = 2. Otherwise, 𝑑 should be mapped to a node, and an edge for each foreign key must be created, as is done in [6]. Weak entities W⟨(E1 , R1 ), ... , (El , Rl )⟩{𝑃1 , ... , 𝑃𝑛 }{𝐴1 , ... , π΄π‘š } are mapped into a relation 𝑅W , similar to the first relationship case, as the cardinality with which W participates in every Ri is one and only one. 𝑅W has attributes att(𝑅W ) = 𝒦 βˆͺ {𝑃1 , ... , 𝑃𝑛 , 𝐴1 , ... , π΄π‘š }, where 𝒦 is the set of all the prime attributes of all the relations 𝑅Ei of each Ei . 𝑅W has therefore foreign keys referencing to each 𝑅Ei . Each tuple 𝑑 ∈ 𝑅W is mapped to a node πœ‚ with id 𝑓|𝒦|+𝑛 (𝑑[𝒦 βˆͺ {𝑃1 , ... , 𝑃𝑛 }]), label πœ†(πœ‚) = W, and each foreign key is mapped to an edge without properties and the label of the respective relationship Ri going from πœ‚ to the respective node representing the referenced tuple in 𝑅Ei . Example. In Figure 1a, we present 4 relations that follow the ERD of Figure 1b. These are Person(SSN, name), Bank(SWIFT, name), Account(number, SWIFT, name), and HasAc- count(SSN, number, SWIFT). Note that Account is a weak entity and HasAccount is an N-to-N relationship. The tuples in Figure 1a are translated to the graph of Figure 1c, following the rules of each case. See how, for instance, the tuple (number: 333, SWIFT: TTQCL, type: checking) from relation Account is mapped to a node with id πœ‚ = 𝑓1 (333), label πœ†(πœ‚) = β€œAccount”, properties 𝜎(πœ‚, number) = 333 and 𝜎(πœ‚, type) = β€œchecking”, and to and edge 𝑒, such that πœ‘(𝑒) = (πœ‚, 𝑓1 (β€œTTQCL”)) and πœ†(𝑒) = β€œofBank”. This mapping is similar to [6]. However, we can extract the appropriate label for 𝑒 from the conceptual model. Our mapping presents its difference particularly when mapping the relation HasAccount. The tuple (SSN: 111, number: 333, SWIFT: TTQCL) is mapped, according to [6], to a node with label β€œHasAccount”, with one edge to the node mapping the person with ID 𝑓1 (111), and another to the node of the Account with ID 𝑓2 (333, TTQCL). Our mapping simply creates one edge 𝑒′ , with πœ†(𝑒′ ) = β€œHasAccount”, and πœ‘(𝑒′ ) = (𝑓1 (111), 𝑓2 (333, TTQCL)). This is not only more semantically accurate but also implies the use of fewer joins when querying the resulting graph. 4. Conclusion and Future Directions In this poster, we present initial progress on a formal mapping that, given a relational database and its conceptual model, produces a knowledge graph that behaves differently from [6]. Initially, we consider property graphs, but RDF and RDF-star graphs can also be produced. Further, we provide semantics for a fragment of the ERDoc language and a formalization of the elements present in an ERD. We are currently defining the translations of the rest of the ERD constructs (class hierarchies, aggregations, multivalued attributes, etc.) and studying the general properties of information and query preservation [8] of the mapping. We note that in the future, we may leverage the work by Barret et al. [7] to automatically obtain the conceptual model for the mapping. We are also working on a mapping algorithm and a tool implementation. Furthermore, we can explore using the conceptual model to generate SHACL constraints [16] to validate the mapping output. Acknowledgments Partly funded by ANID, Millennium Science Initiative Program, Code ICN17_002. References [1] A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. de Melo, C. Gutierrez, S. Kirrane, J. E. Labra Gayo, R. Navigli, S. Neumaier, A.-C. Ngonga Ngomo, A. Polleres, S. M. Rashid, A. Rula, L. Schmelzeisen, J. Sequeda, S. Staab, A. Zimmermann, Knowledge Graphs, volume 22 of Synthesis Lectures on Data, Semantics, and Knowledge, Springer Nature, 2021. [2] L. Libkin, D. Vrgoč, Regular path queries on graphs with data, in: Proceedings of the 15th International Conference on Database Theory, ACM, Berlin Germany, 2012, pp. 74–85. doi:10.1145/2274576.2274585. [3] A. Hogan, J. L. Reutter, A. Soto, In-Database Graph Analytics with Recursive SPARQL, in: The Semantic Web – ISWC 2020, volume 12506, Springer International Publishing, Cham, 2020, pp. 511–528. doi:10.1007/978-3-030-62419-4_29. [4] M. LΓ³pez, S. Ferrada, A. Hogan, ERDoc: A Web Interface for Entity-Relation Modelling, in: Proceedings of the 3rd International Workshop on Data Systems Education, ACM, 2024. [5] P. P.-S. Chen, The entity-relationship modelβ€”toward a unified view of data, ACM Transactions on Database Systems 1 (1976) 9–36. doi:10.1145/320434.320440. [6] R. Stoica, G. Fletcher, J. F. Sequeda, On directly mapping relational databases to property graphs, in: 13th Alberto Mendelzon International Workshop on Foundations of Data Management, AMW 2019, CEUR-WS. org, 2019, p. 06. [7] N. Barret, I. Manolescu, P. Upadhyay, Computing Generic Abstractions from Applica- tion Datasets, in: EDBT 2024 - 27th International Conference on Extending Database Technology, volume 27, 2024, pp. 94–107. [8] J. F. Sequeda, M. Arenas, D. P. Miranker, On directly mapping relational databases to RDF and OWL, in: Proceedings of the 21st International Conference on World Wide Web, ACM, 2012, pp. 649–658. doi:10.1145/2187836.2187924. [9] E. F. Codd, A relational model of data for large shared data banks, Communications of the ACM 13 (1970) 377–387. doi:10.1145/362384.362685. [10] C. W. Bachman, Data structure diagrams, ACM SIGMIS Database: the DATABASE for Advances in Information Systems 1 (1969) 4–10. doi:10.1145/1017466.1017467. [11] R. Cyganiak, D. Wood, M. Lanthaler, G. Klyne, J. J. Carroll, B. McBride, RDF 1.1 Concepts and Abstract Syntax, W3C Recommendation, W3C, 2014. [12] O. Hartig, Foundations of RDF* and SPARQL* : (An Alternative Approach to Statement- Level Metadata in RDF), in: 11th Alberto Mendelzon International Workshop on Founda- tions of Data Management and the Web., volume 1912, Montevideo, Uruguay, 2017. [13] R. Angles, The Property Graph Database Model, in: Proc. Alberto Mendelzon International Workshop on Foundations of Data Management (AMW), volume 2100, CEUR Workshop Proceedings, 2018. [14] O. Hartig, Foundations to query labeled property graphs using SPARQL, in: Joint Proceed- ings of the 1st International Workshop on Semantics for Transport and the 1st International Workshop on Approaches for Making Data Interoperable, volume 2447, CEUR Workshop Proceedings, 2019. [15] I. J. Heath, Unacceptable file operations in a relational data base, in: Proceedings of the 1971 ACM SIGFIDET Workshop on Data Description, Access and Control - SIGFIDET ’71, ACM Press, San Diego, California, 1971, p. 19. doi:10.1145/1734714.1734717. [16] H. Knublauch, D. Kontokostas, Shapes constraint language (SHACL), Technical Report, W3C, 2017. URL: https://www.w3.org/TR/shacl/.