Towards the Definition of a Language-Independent Mapping Template for Knowledge Graph Creation Ana Iglesias-Molina David Chaves-Fraga Ontology Engineering Group Ontology Engineering Group Universidad Politécnica de Madrid, Spain Universidad Politécnica de Madrid, Spain ana.iglesiasm@upm.es dchaves@fi.upm.es Freddy Priyatna Oscar Corcho Ontology Engineering Group Ontology Engineering Group Universidad Politécnica de Madrid, Spain Universidad Politécnica de Madrid, Spain fpriyatna@fi.upm.es ocorcho@fi.upm.es ABSTRACT to establish relationships between the global schema and the data The use of knowledge graphs is spreading in the scientific commu- sources. Examples of mappings languages are the W3C recommen- nity across different domains, from social sciences to biomedicine. dation R2RML [7] and its extension RML [9]. The creation of knowledge graphs usually needs the integration The use of declarative mappings for semantic web non-experts of multiple heterogeneous data sources in different formats and is often complicated. That is one of the reasons why the mapping schemas. One common way to achieve this process is using declara- creation is usually carried out by knowledge engineers. This poses tive mappings, which establish the relationships between the source a barrier for potential users from other domains. To face this issue, data and the ontology, improving relevant aspects such as main- several mapping editors have been proposed. They aim at making tainability, readability and understandability. Learning how to use the mapping creation and editing easier and more intuitive [11, 16]. and create mappings is not an easy task, hindering the use of this Despite these efforts, users prefer to use tools like OpenRefine1 , technology to anyone outside the area. As a result, this task is usu- which is non-declarative, thus hindering the reproducibility and ally carried out by experts. To ease the mapping creation, several maintainability of the transformations performed. mapping editors have been developed, but their success is limited. Mapping languages consist of common elements to be created In this paper, we devise the use of a well-known tool commonly (e.g. the source data, subjects, predicates and objects). In this pa- used in the scientific community, the spreadsheets, to specify the per we propose the use of spreadsheets to specify these elements, mapping rules in a language-independent way. Our aim is to ease the mapping rules, in a language-independent way, so it can be the mapping creation and make it more accessible for the commu- translated into the most convenient specification [6]. Spreadsheets nity. We also show a real use case, in which using spreadsheets are a well-known tool commonly used in the scientific community, helps in the mapping creation process and enables a handy way for versatile and easy to understand, what makes them a suitable target editing and visualizing mapping rules. to specify mapping rules. With this proposal, our aim is to lower the barrier of mapping creation and motivate the scientific community CCS CONCEPTS to use this technology. This paper is organized as follows: Section 2 presents the related • Computing methodologies → Artificial intelligence; Knowl- work done on mapping creation. Section 3 shows the common edge representation and reasoning. mapping structure. Section 4 describes the spreadsheet template we propose for the creation of mapping rules. Section 5 shows a KEYWORDS real case in which we use spreadsheets to create mappings. Finally, Knowledge graph, spreadsheet, declarative mapping section 6 presents the conclusions and areas for future work. 1 INTRODUCTION 2 RELATED WORK The expansion of the Semantic Web technologies has reached users A wide variety of mapping languages has been proposed over the across several domains, such as legal and biomedical. An increasing last decades [8]. The W3C Recommendation is R2RML [7], a declar- number of knowledge graphs from these areas are being created, ative mapping language that allows the generation of adapters to restructuring knowledge in a machine-readable way [4]. For their transform relational databases into RDF. There are other declara- construction it is necessary to integrate different data sources; then tive languages that enable dealing with more data formats, such as they allow search optimization and the possibility of applying ma- RML [9] (extension of R2RML for CSV, JSON and XML), YARRRML chine learning techniques to obtain new knowledge, among other [10] (a user-friendly serialization of RML), xR2RML [15] (for non- possibilities. Some examples are DBpedia [1] and Wikidata [18]. SQL databases) and RMLC-Iterator [5] (for statistical data). There are multiple approaches to create knowledge graphs, from There are not as many mapping editors as languages; in fact, the using ad-hoc tools to declarative mappings. The later defines rules majority of them support R2RML or RML. Some of the most used Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 http://openrefine.org/ SciKnow’19, November, 2019, Iglesias-Molina et al. rml:logicalSource[ [ rml:logicalSource rml:source"/home/user/data/people.csv" rml:source "/home/user/data/people.csv"; ; rml:logicalSource[ [ rml:logicalSource rml:referenceFormulationql:CSV rml:referenceFormulation ql:CSV; ; rml:source"/home/user/data/sports.csv" rml:source "/home/user/data/sports.csv"; ; ]; ]; rml:referenceFormulationql:CSV rml:referenceFormulation ql:CSV; ; rr:subjectMap[ [ rr:subjectMap ]; ]; rr:classex:Person; rr:class ex:Person; rr:subjectMap[ [ rr:subjectMap rr:template"http://ex.com/Person/{name}"; rr:template "http://ex.com/Person/{name}"; rr:classex:Sport; rr:class ex:Sport; ]; ]; rr:template"http://ex.com/Sport/{sport}"; rr:template "http://ex.com/Sport/{sport}"; rr:predicateObjectMap rr:predicateObjectMap[ [ ]; ]; rr:predicateMap rr:predicateMap [ rr:constant [ rr:constantex:name ex:name]; ]; rr:predicateObjectMap rr:predicateObjectMap[ [ rr:objectMap rr:objectMap [ rml:reference [ rml:reference"name" "name"]; ]; rr:predicateMap rr:predicateMap [ rr:constant [ rr:constantex:name ex:name]; ]; ]; ]; rr:objectMap rr:objectMap [ rml:reference [ rml:reference"sport" "sport"]; ]; rr:predicateObjectMap rr:predicateObjectMap[ [ ]; ]; rr:predicateMap rr:predicateMap [ rr:constant [ rr:constantex:sport ex:sport]; ]; rr:predicateObjectMap rr:predicateObjectMap[ [ rr:objectMap rr:objectMap [ rr:parentTriplesMap [ rr:parentTriplesMap; ; rr:predicateMap rr:predicateMap [ rr:constant [ rr:constantex:code ex:code]; ]; rr:joinCondition rr:joinCondition[ rr:child [ rr:child"sport_id"; "sport_id";rr:parent rr:parent"id"; "id";]; ]; rr:objectMap rr:objectMap [ rml:reference [ rml:reference"id"; "id";]; ]; ]; ]; ]; ]; ]; ]; (a) Triples Map for PERSON (b) Triples Map for SPORT Figure 1: RML mapping. Fig. 2a shows the triples map that generates instances of the class ex:Person and two predicate-object maps, the latest a join to the Triples Map shown in Fig. 2b, that creates the instances for the class ex:Sport and two predicate-object maps. "name","birthdate","sport_id" "name","birthdate","sport_id" "id","sport" "id","sport" 3 STRUCTURE OF DECLARATIVE MAPPINGS "Serena"Serena Williams",19810926,1 Williams",19810926,1 1,"Tennis" 1,"Tennis" The mapping languages have usually a similar structure, as many "Alexander Ovechkin",19850917,4 "Alexander Ovechkin",19850917,4 2,"Ice 2,"Ice skating" skating" "Emily Scarratt",19900208,3 "Emily Scarratt",19900208,3 3,"Rugby" 3,"Rugby" of them are based on the standard. The earliest (e.g. R2O [2]) or the "Javier "Javier Fernández",19910415,2 Fernández",19910415,2 4,"Hockey" 4,"Hockey" non-declarative languages (e.g. SPARQL-Generate [14]) differ in structure, but they all share the same elements: identifier of data (a) people.csv (b) sports.csv sources (URL, path, table name) and the rules for generating the corresponding RDF triples. An RML mapping example is shown in Figure 2: CSV data example. Example of the source data in CSV Figure 1. It organizes the transformation rules in two triple maps, format for the RML mapping example form Figure 1. one for each data source (Figure 2) used to generate RDF triples. We define more in detail the essential elements that declara- tive mapping rules contain, providing examples based on the RML mappings showed in Figure 1: tools implement graphical visualization and editing of the mappings • An element that specifies where the data sources are stored. as graphs, such as Karma [13] and Map-On [17] for R2RML, and In the case of RML, these elements are defined using the RMLEditor [11] for RML. Others provide an environment to write property rml:logicalSource. them, like OntopPro2 , an extension of Protégé that allows mapping • A set of rules that defines the subjects and classes of the creation in their custom language and import/export R2RML. triples. In RML, the rr:subjectMap property is used to spec- The current mapping editors are language-oriented or create ify these characteristics. the mapping rules through graphical visualization. Thus, the user • Pairs (rr:predicateObjectMap property in RML) that spec- either knows the language, or creates the mapping building a vi- ify rules for generating predicate (rr:predicateMap) and sual graph. Using spreadsheets enables a language-independent object (rr:objectMap) of the triples. declarative approach to write concisely the mapping rules taking • Join condition to another triple map, where the subject of the advantage of the functionalities of a spreadsheet. In other words, referenced triples map is to be the object in the new triple. the rules can be created specifying only the essential elements with- This is defined in RML using rr:joinCondition property. out knowing any mapping language, and the repetitive elements As we show in the example mapping, these rules usually contain can be autocompleted. Moreover, its compact structure allows a multiple and repetitive elements to describe the rules. This char- quick visualization of all the rules. acteristic makes it easy to commit mistakes when writing them There are other approaches that use spreadsheets to capture manually. Using a spreadsheet template can ease this process to knowledge of domain experts [12, 19]. This kind of tools enable non-experts in mapping creation. It enables manual writing, while the specification of ontologies in tables and generate the corre- helping with the repetitive parts with autocompleting functions. sponding RDF. Similarly, the mapping rules for data conversion are Moreover, all the language’s syntax and formatting is later auto- declared in spreadsheets with our proposal, to be later translated matically written by the tool, not the user. into different mapping languages. 2 https://github.com/ontop/ontop/wiki/ontopProUserManual Towards the Definition of a Language-Independent Mapping SciKnow’19, November, 2019, Template for Knowledge Graph Creation 4 SPREADSHEET DESIGN Table 3: Subject sheet. The class of the subject is specified in Class, along with the URI that is to be created in URI and a unique In this section we show the designed spreadsheet template3 that identifier in ID. In the latest, the words between brackets refer to contains the essential elements to create a mapping. It consists of at fields in the data. least four sheets: prefixes, source data, subject and predicate-object maps; and optionally, a sheet with transformation functions. Prefixes sheet. In this sheet the namespace prefixes for URLs ID Class URI are specified. They can be found at the beginning in most of map- PERSON ex:Person http://ex.com/Person/{name} ping languages, as they make it easier and shorter to write the SPORT ex:Sport http://ex.com/Sport/{sport} mappings. This sheet is composed of two columns, in the column Prefix the prefix is defined, and in the column URI the whole link is written (Table 1). subject to join (ReferenceID), and the fields of the source data they share (InnerRef for the field of the current triple, and OuterRef for Table 1: Prefix sheet. The whole link is written in the column the field of the referred subject). These fields are left blank until this URI, and its abbreviation in the column Prefix. case happens. When it does, the aforementioned fields referring to the object are not necessary (Object and Data type). The last Prefix URI item to specify is which subject each triple belongs to. For that rdf http://www.w3.org/1999/02/22-rdf-syntax-ns# purpose the column ID exists. It links each predicate-object to its ex http://ex.com/ correspondent subject. sql http://w3.org/ns/sql# Function sheet. Some languages support the use of transforma- tion functions over the data (e.g. FnO+RML), so the template allow Source sheet. Here we specify where the data is taken from to include an additional sheet to detail these functions (Table 5). The (Table 2). It consists of three columns, ID, Feature, Value. The most used are the SQL and GREL functions, but any can be used. column Value contains path to the source data, the format, and The functions are referred from the Predicate Object map sheet optionally the iterator (the loop used to map the data of JSON or other function row with the identifier specified in FunctionID. and XML files). In Feature we declare the type of information The function to use is defined in Function, and the parameters in provided in Value. Finally, ID refers to its correspondent subject in Params (if there are several, they are written separated by commas). the Subject Sheet. 5 USE CASE: THE BIO2RDF PROJECT Table 2: Source sheet. The information about the source data it’s specified, such as where the data is stored and its format. The Bio2RDF [3] is an open source project, started in 2008, that inte- kind of information is defined in Feature, the information itself in grates heterogeneous sources of biomedical data into Linked Data. Value, and to which subject it refers in ID. For each biological database in its catalogue, Bio2RDF provides an ontology and a PHP script to transform data into RDF. With the aim of enhancing the maintainability and understandability of the ID Feature Value transformation, we show the first steps to change the RDF transfor- PERSON source /home/user/data/people.csv mation methodology from using ad-hoc PHP scripts to declarative PERSON format CSV mappings using spreadsheets. SPORT source /home/user/data/sports.csv In this use case, we create mappings for the datasets of the project SPORT format CSV that have their data published as CSVs and relational databases. With the information provided by the PHP sripts and the source Subject sheet. The subjects of the triples to generate and their data, the mapping rules are specified in the spreadsheets. Then, they correspondent classes are defined in three columns (Table 3). In are translated into the most suitable mapping language depending ID is specified an identifier for each subject so it can be referred on the format of the data source, and which engine is used to build from other sheets; in Class, the class which the subject belongs to; the knowledge graph. In this specific case, we translate them into and in URI, the template for the URI of the subjects that are to be R2RML for relational databases and RML for CSVs. created. In the latest field, there is a variable part between curly For most of the data sources more than one subject is created, braces that refers to a field in the data (in the first line, name, and or the database is distributed in several files, or there is a high in the second, sport). number of triples (predicate-object maps) to generate. Moreover, Predicate-Object Maps sheet. In this sheet, the triples are de- there are joins between the subjects within the same and in others fined through the predicates and its correspondent objects (Table datasets. The need to represent so many mapping rules arises the 4). The columns Predicate and Object are responsible for their necessity to visualize them quickly, and write the repetitive parts of specification. The kind of data declared in Object is defined in Data the mappings easily, which can be done thanks to the structure and type (e.g. string, float, etc.). When there is a referencing object map, functions of the spreadsheets. Moreover, the fact that the spread- the triple is defined otherwise. There are three fields that are able sheets are an intermediate step in the mapping creation process to specify the join between the object of the new triple and the ref- makes it possible to write the transformation rules only once, and erenced subject. They specify which is the ID correspondent to the translate it into one or more languages. The tool developed to per- 3 https://doi.org/10.5281/zenodo.3526141 form the translation, Mapeathor, is still under development, and SciKnow’19, November, 2019, Iglesias-Molina et al. Table 4: Predicate-Object Map sheet. Here there are specified the Predicates (Predicate), Objects (Object), kind of data of the object (DataType), the references to other subjects (ReferenceID, InnerRef, OuterRef) and the subject that forms the triple (ID). Predicate Object DataType ReferenceID InnerRef OuterRef ID ex:name {name} string PERSON ex:birthdate {birthdate} date PERSON ex:sport SPORT sport_id id PERSON ex:name {sport} string SPORT ex:code {id} integer SPORT ex:comment SPORT Table 5: Function sheet. The function sql:upper is specified. It Topics in Semantic Technologies: ISWC 2018 Satellite Events (Studies on the Semantic only takes one parameter, the field sport from the source data. Web), Vol. 36. IOS Press, 235–244. [6] Oscar Corcho, Freddy Priyatna, and David Chaves-Fraga. 2019. Towards a New Generation of Ontology Based Data Access. Semantic Web Journal (2019). FunctionID Function Params [7] Souripriya Das, Seema Sundara, and Richard Cyganiak. [n. d.]. R2RML: RDB to RDF Mapping Language. https://www.w3.org/TR/r2rml/ sql:upper {sport} [8] Ben De Meester, Pieter Heyvaert, Ruben Verborgh, and Anastasia Dimou. 2019. Mapping Languages: Analysis of Comparative Characteristics. In 1st International Workshop on Knowledge Graph Building. [9] Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik it is available in GitHub4 , along with the spreadsheets mappings Mannens, and Rik Van de Walle. 2014. RML: A Generic Language for Integrated created for this use case. RDF Mappings of Heterogeneous Data. In LDOW. [10] Pieter Heyvaert, Ben De Meester, Anastasia Dimou, and Ruben Verborgh. 2018. Declarative Rules for Linked Data Generation at Your Fingertips!. In European 6 CONCLUSIONS AND FUTURE WORK Semantic Web Conference. Springer, 213–217. This paper shows a first approach to design a template spread- [11] Pieter Heyvaert, Anastasia Dimou, Aron-Levi Herregodts, Ruben Verborgh, Dim- itri Schuurman, Erik Mannens, and Rik Van de Walle. 2016. RMLEditor: a graph- sheet able to specify the mapping rules used to create knowledge based mapping editor for linked data mappings. In European Semantic Web Con- graphs. The full design is described in detail to show all the es- ference. Springer, 709–723. [12] Simon Jupp, Matthew Horridge, Luigi Iannone, Julie Klein, Stuart Owen, Joost sential elements contained in a mapping file that can be specified Schanstra, Katy Wolstencroft, and Robert Stevens. 2012. Populous: a tool for in a spreadsheet in a language-independent manner. Moreover, building OWL ontologies from templates. BMC bioinformatics 13, 1 (2012), S5. we present a real use case in which the use of spreadsheets has [13] Craig A Knoblock, Pedro Szekely, José Luis Ambite, Aman Goel, Shubham Gupta, Kristina Lerman, Maria Muslea, Mohsen Taheriyan, and Parag Mallick. 2012. Semi- facilitated the mapping construction and editing. automatically mapping structured sources into the semantic web. In Extended Both the template spreadsheet and tool developed to translate Semantic Web Conference. Springer, 375–390. the spreadsheets to different mapping languages are still under [14] Maxime Lefrançois, Antoine Zimmermann, and Noorani Bakerally. 2017. A SPARQL extension for generating RDF from heterogeneous formats. In European development. Our objective is to keep on improving the template’s Semantic Web Conference. Springer, 35–50. structure in order to erase the existing influence of the current [15] Franck Michel, Loïc Djimenou, Catherine Faron Zucker, and Johan Montagnat. 2015. Translation of relational and non-relational databases into RDF with mapping languages, and make it language-independent. For that xR2RML. In 11th International Confenrence on Web Information Systems and purpose, it’s necessary to make a design able to contain the essen- Technologies (WEBIST’15). 443–454. tial information to express the mapping rules, and take for each [16] Kunal Sengupta, Peter Haase, Michael Schmidt, and Pascal Hitzler. 2013. Editing R2RML mappings made easy. (2013). language the necessary elements in the translation. [17] Álvaro Sicilia, German Nemirovski, and Andreas Nolle. 2017. Map-On: A web- Moreover, an evaluation has to be carried out to test that using based editor for visual ontology mapping. Semantic Web 8, 6 (2017), 969–980. spreadsheets really helps in the mapping creation process, and give [18] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledge base. Commun. ACM 57, 10 (2014), 78–85. some guidelines on how the template can be improved. The tool has [19] Katy Wolstencroft, Stuart Owen, Matthew Horridge, Olga Krebs, Wolfgang to be developed as well, as the template changes, with the aim of Mueller, Jacky L Snoep, Franco du Preez, and Carole Goble. 2011. RightField: being able to translate the spreadsheets to any mapping language. embedding ontology annotation in spreadsheets. Bioinformatics 27, 14 (2011), 2021–2022. REFERENCES [1] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. In The semantic web. Springer, 722–735. [2] Jesús Barrasa Rodríguez, Óscar Corcho, and Asunción Gómez-Pérez. 2004. R2O, an extensible and semantically based database-to-ontology mapping language. (2004). [3] François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and Jean Morissette. 2008. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics 41, 5 (2008), 706–716. [4] Christian Bizer, Tom Heath, and Tim Berners-Lee. 2011. Linked data: The story so far. In Semantic services, interoperability and web applications: emerging concepts. IGI Global, 205–227. [5] David Chaves-Fraga, Freddy Priyatna, Idafen Perez-Santana, and Oscar Corcho. 2018. Virtual Statistics Knowledge Graph Generation from CSV files. In Emerging 4 https://github.com/oeg-upm/Mapeathor