Deducing Federated SPARQL queries from RDF Mappings Benjamin Moreau1 , Manoé Kieffer2 , and Patricia Serrano-Alvarado2 1 OpenDataSoft {Name.Lastname}@opendatasoft.com 2 Nantes University, LS2N, CNRS, UMR6004, 44000 Nantes, France {Name.LastName@}univ-nantes.fr ? Abstract. Datasets can be virtually integrated into the Linked Data space through RDF mappings. An RDF mapping consists of rules that map data from an input dataset to RDF triples. Defining SPARQL queries is an arduous task. As RDF mappings define semantic schemas, it is possible to use them to deduce federated queries. In this demonstra- tion, we propose MaRQ, a tool that deduces federated queries from a set of RDF mappings. The goal is to provide a clear idea of the conjunction possibilities of a dataset with other virtually integrated datasets. 1 Introduction and motivation Having data as linked data enables vast amounts of datasets to be interconnected creating new and innovative applications. To avoid expensive investments in terms of storage and time, some data providers use RDF mappings to integrate virtually and on-demand, non-RDF data into the Linked Data [6, 7]. Making mappings is not easy, but some tools help to generate them [2, 8]. For data providers, it would be interesting to know how a particular dataset might benefit from existing semantic datasets. That is, to which extent, their dataset can be combined with datasets of the Linked Data, i.e., which conjunctive queries could be processed by a federation that includes their dataset? In the state of the art, [1] allows to generate federated queries based on RDF datasets. This approach provides flexible parameterization of realistic conjunc- tive benchmark queries. Query parameterization includes structure, complexity, and cardinality constraints. Similarly, [9] can generate a variety of federated queries over a given set of RDF datasets to facilitate the process of benchmark- ing for federated query processing. Generated queries are conjunctive and use the OPTIONAL and FILTER keywords. [3] proposes a solution for generating federated queries from query logs executed in the past. Generated queries are conjunctive queries and queries using UNION, OPTIONAL, FILTER, etc. This work provides a low-cost solution that uses RDF mappings instead of RDF datasets or query logs. We consider, that the particular dataset that a ? Copyright c 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). provider wants to integrate into the Linked Data virtually is not materialized in the RDF format, and a log of executed SPARQL queries does not exist. Our tool, named MaRQ, analyses RDF mappings to deduced conjunctive BGPs. An RDF mapping can be seen as an RDF summary of the dataset it de- scribes. Thus, its analysis can provide helpful information with limited overhead. MaRQ identifies the conjunction capabilities between a particular dataset and a set of datasets of the Linked Data that are virtually integrated through RDF mappings. It provides a ranked list of RDF datasets according to their degree of conjunction with a particular dataset. 2 MaRQ: a tool to detect joins from RDF mappings Consider Figure 1 that shows two mappings. Mapping 1 concerns a doctors’ directory dataset. Mapping 2 concerns an Airbnb accommodations dataset. In bold, we highlight templates or references existing in the RDF mappings. We want to know how corresponding datasets could enrich one another. We recall that to deduce federated SPARQL queries, MaRQ uses nothing but RDF map- pings. So deduced queries are not verified against instances of corresponding datasets. MaRQ deduces queries of type subject-subject, object-object, and subject- object. The matching of terms is based on the pairwise Jaccard similarity of the types that describe them. In MaRQ, the Jaccard similarity is the number of types in common divided by the total number of types of two subjects (templates) described in two mappings. Two terms from different mappings are joinable if their similarity is greater than a configurable threshold between 0 and 1. For instance, the Jaccard similarity between Address and Home is 0.20, i.e., 1/5. This is because these templates have only one type in common (schema:Place), and the total number of types of both templates is 5. If the threshold of MaRQ is equal to or greater than 0.2, then a join between these templates will be deduced. Subject-subject queries. To identify two joinable subjects, MaRQ calculates the similarity of all subjects. Considering a similarity threshold ≥ 0.2, in our example, Doctor is similar to Host, and Address is similar to Home and Neigh- bourhood. Thus MaRQ deduces three queries. Each query contains one join BGP. BGPs will contain all types (rdf:type and corresponding classes) and all predicates with the same variable in subjects and, for not rdf:type predicates, different variables in objects. Listing 1.1 shows one of these queries. It asks for Doctors that are also BusinessEntities. Object-object queries. To identify two joinable objects, MaRQ uses objects that are described semantically in the mappings. In our example, three objects are described: Address, Host and Neighbourhood. If there is a pairwise similarity a query is deduced. Only Address is similar to Neighbourhood. BGP will con- tain different variables in all subjects, the same variable in the object, and the predicates where the objects are used. Listing 1.2 shows this query. It asks for Doctors’ locations that are neighbourhood of Airbnb homes. It is possible also to deduce object-object queries only based on types. In our example, MaRQ deduces two queries of this type, one with the BGP {?s1 rdf:type schema:Person} and another with the BGP {?s1 rdf:type schema:Place}. These queries ask for all persons and all places of both datasets. Mapping 1 rdf:type schema:Person rdf:type rdf:type juso:Address ex:Directory/$Doctor lgdo:Doctor schema:location rdf:type ex:Directory/$Address schema:Place juso:fullAddress dbo:speciality ex:Directory/$codeProfession ex:Directory/$address schema:postalCode rdf:type ex:Directory/$code_postal Mapping 2 vcard:Home rdf:type ex:AirBnB/$Home schema:Residence rdf:type rdf:type schema:Place schema:Person dbo:owner rdf:type ex:AirBnB/$Host gr:BusinessEntity schema:containedInPlace rdf:type ex:AirBnB/$Neighbourhood schema:Place rdf:type dbo:PopulatedPlace Fig. 1. Two RDF mappings. SELECT ∗ WHERE { ? s 1 r d f : t y p e schema : P e r s o n . SELECT ∗ WHERE { ? s1 r d f : type lgdo : Doctor . ? s 1 schema : l o c a t i o n ? o1 . ? s1 r d f : type gr : BusinessEntity . ? s 2 schema : c o n t a i n e d I n P l a c e ? o1 } ? s 1 schema : l o c a t i o n ? o1 . ? s 1 dbo : s p e c i a l i t y ? o2 } Listing 1.2. An object-object query. Listing 1.1. A subject-subject query. SELECT ∗ WHERE { ? t 1 r d f : t y p e schema : P e r s o n . SELECT ∗ WHERE { ? t1 r d f : type lgdo : Doctor . ? t 2 r d f : t y p e schema : P l a c e . ? t 1 schema : l o c a t i o n ? f 1 . ? t 2 r d f : t y p e dbo : P o p u l a t e d P l a c e . ? t 1 dbo : s p e c i a l i t y ? f 2 . ? f 1 schema : l o c a t i o n ? t 2 } ? f 3 dbo : owner ? t 1 } Listing 1.4. An object-subject query. Listing 1.3. A subject-object query. Subject-object queries. To identify subject-object joins, MaRQ identifies ob- jects that are described semantically in the second mapping. In our example, objects Host and Neighbourhood are described in the second mapping. Then, it identifies subjects in the first mapping that are joinable with these objects. In our example, the subject Doctor is similar to the object Host, and the subject Address is similar to the object Neighbourhood. Thus, two subject-object queries are deduced. BGPs will contain all types (rdf:type and corresponding classes) and properties of the joinable subject with the same variable in the subject, and the predicates where the joinable object is used with the same subject’ variable. Listing 1.3 shows one of these queries. It asks for doctors that are also owners of Airbnb homes. Deduction of object-subject queries follows the algorithm of subject-object ones, but mappings are taken in inverse order. In our example, the object Address is similar to the subjects Home and Neighbourhood. Thus, two more queries are deduced, and they are considered subject-object queries. Listing 1.4 shows one of these queries. It asks for Airbnb places that are also doctors’ locations. 3 Demonstration The MaRQ implementation uses YARRRML mappings [4]. You can test it in command line https://github.com/Manoe-K/MaRQ. It is also available as a Web application available at https://marq-priloo.univ-nantes.fr/. During the demonstration, attendees will be able to choose some RDF map- pings. They will then be able to choose one mapping in order to compare it to all the others. A graph representing the number of triple patterns that MaRQ gen- erates by query type will be shown for each mapping. The similarity threshold used by the Web application is 0.2 (this threshold is configurable in the command line version). Deduced queries will be shown, and they are also downloadable. Fig. 2. MaRQ screenshot. Figure 2 shows a screenshot of the MaRQ web application. In this exam- ple, the dataset evenements-publics-cibul is compared other three datasets. The graph shows the number of potential join queries and their types (subject- subject, subject-object, object-subject). Final remarks. MaRQ identifies possible conjunctive queries from RDF map- pings. The pertinence of deduced queries depends on the quality of mappings and the similarity threshold. In addition, as each BGP contains all the possible triple patterns of joinable terms, it is unlikely that these queries, with such a big number of constraints, give results. However, they provide a valuable set of BGPs whose triple patterns can be analyzed to identify joins that may return instances. In addition, join queries to integrate different datasets face the problem of entity-matching. For instance, URIs can be of different domains but referring to the same entity. This problem is out of the scope of this work, but several solutions exist, see [5] for a survey. Acknowledgment. Authors thank Fatim Touré (Master student of the University of Nantes) for her participation in the early stages of this work. References 1. Görlitz, O., Thimm, M., Staab, S.: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data. In: International Semantic Web Confer- ence (ISWC) (2012) 2. Gupta, S., Szekely, P., Knoblock, C.A., Goel, A., Taheriyan, M., Muslea, M.: Karma: A System for Mapping Structured Sources into the Semantic Web. In: Extended Semantic Web Conference (ESWC), Poster&Demo (2012) 3. Hacques, F., Skaf-Molli, H., Molli, P., Hassad, S.E.: PFed: Recommending Plausible Federated SPARQL Queries. In: International Conference on Database and Expert Systems Applications (DEXA) (2019) 4. Heyvaert, P., De Meester, B., Dimou, A., Verborgh, R.: Declarative Rules for Linked Data Generation at Your Fingertips! In: Extended Semantic Web Conference (ESWC), Poster&Demo (2018) 5. Köpcke, H., Rahm, E.: Frameworks for entity matching: A comparison. Data & Knowledge Engineering 69(2), 197–210 (2010) 6. Michel, F., Faron Zucker, C., Montagnat, J.: A Mapping-based Method to Query MongoDB Documents with SPARQL. In: International Conference on Database and Expert Systems Applications (DEXA) (Sep 2016), https://hal. archives-ouvertes.fr/hal-01330146 7. Moreau, B., Serrano Alvarado, P., Desmontils, E., Thoumas, D.: Querying non-RDF Datasets using Triple Patterns. In: International Semantic Web Conference (ISWC), Poster&Demo (Oct 2017) 8. Moreau, B., Terpolilli, N., Serrano-Alvarado, P.: A Semi-Automatic Tool for Linked Data Integration. In: International Semantic Web Conference (ISWC), Poster&Demo (Oct 2019) 9. Rakhmawati, N.A., Saleem, M., Lalithsena, S., Decker, S.: QFed: Query Set for Federated SPARQL Query Benchmark. In: International Conference on Information Integration and Web-based Applications & Services (2014)