=Paper= {{Paper |id=Vol-3263/abstract-11 |storemode=property |title=Ontology-based Data Federation (Extended Abstract) |pdfUrl=https://ceur-ws.org/Vol-3263/abstract-11.pdf |volume=Vol-3263 |authors=Zhenzhen Gu,Davide Lanti,Alessandro Mosca,Guohui Xiao,Jing Xiong,Diego Calvanese |dblpUrl=https://dblp.org/rec/conf/dlog/GuL00XC22 }} ==Ontology-based Data Federation (Extended Abstract)== https://ceur-ws.org/Vol-3263/abstract-11.pdf
Ontology-Based Data Federation
Extended Abstract

Zhenzhen Gu1 , Davide Lanti1 , Alessandro Mosca1 , Guohui Xiao2,3 , Jing Xiong1 and
Diego Calvanese1,3,4
1
  KRDB Research Centre for Knowledge and Data, Free University of Bozen-Bolzano, 39100 Bolzano, Italy
2
  Faculty of Social Sciences, University of Bergen, Bergen, Norway
3
  Ontopic S.R.L., 39100 Bolzano, Italy
4
  Department of Computing Science, Umeå University, 901 87 Umeå, Sweden


                                         Abstract
                                         We formally introduce ontology-based data federation (OBDF), to denote a framework combining
                                         ontology-based data access (OBDA) with a data federation layer, which virtually exposes multiple het-
                                         erogeneous sources as a single relational database. In this setting, the SQL queries generated by the
                                         OBDA component by translating user SPARQL queries are further transformed by the data federation
                                         layer so as to be efficiently executed over the data sources. The structure of these SQL queries directly
                                         affects their execution time in the data federation layer and their optimization is crucial for performance.
                                         We propose here novel optimizations specific for OBDF, which are based on “hints” about existing data
                                         redundancies in the sources, empty join operations, and the need for materialized views. Such hints can
                                         be systematically inferred by analyzing the OBDA mappings and ontology and exploited to simplify the
                                         query structure. We also carry out an experimental evaluation in which we show the effectiveness of
                                         our optimizations.

                                         Keywords
                                         Ontology-based data access, OBDA, data federation, query optimization




1. Research Problem and Contribution
Ontology-based data access (OBDA) [1, 2, 3], also known as Virtual Knowledge Graphs, is a
well-established paradigm for querying data sources via a mediating ontology. Such paradigm
has been successfully applied in many different domains [4]. In OBDA, the ontology is expressed
in a lightweight ontology languages, such as OWL 2 QL [5], which has its formal foundations in
the DL-Lite family [6]. Typically, it is assumed that the underlying data are stored in a single
relational data source, to which the ontology elements are mapped in a declarative way. Notably,
for query answering, OBDA follows a virtual approach, i.e., the data are not actually extracted
from the source to populate the classes and properties, but instead a SPARQL query posed over
the ontology is transformed in a sequence of steps into a SQL query over the data source [6, 1, 7].

   DL 2022: 35th International Workshop on Description Logics, August 7–10, 2022, Haifa, Israel
" zhenzhen.gu@unibz.it (Z. Gu); davide.lanti@unibz.it (D. Lanti); alessandro.mosca@unibz.it (A. Mosca);
guohui.xiao@uib.no (G. Xiao); jing.xiong@unibz.it (J. Xiong); calvanese@inf.unibz.it (D. Calvanese)
 0000-0002-7346-6093 (Z. Gu); 0000-0003-1097-2965 (D. Lanti); 0000-0003-2323-3344 (A. Mosca);
0000-0002-5115-4769 (G. Xiao); 0000-0002-3604-9645 (J. Xiong); 0000-0001-5174-9693 (D. Calvanese)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                             DATA FEDERATION             DATA SOURCES
                                       OBDA SYSTEM                SYSTEM

                            SPARQL Q                 SQL q                 sub-queries
                                         M                        V
                                                                  D
                                        T                         B
                            answers             answers                 sub-answers                     …




Figure 1: The general framework of OBDF and the full query answering procedure.


Sophisticated optimization techniques have been proposed and actually implemented in most
commercial and open source OBDA systems that are currently available [8, 9, 2, 10].
   So far, such techniques have been tailored towards optimizing queries that are executed over
a single data source to which the OBDA system is mapped. In many settings, however, there
is the need to virtually access multiple, possibly heterogeneous, data sources in an integrated
way. In this case, one can resort to data federation [11, 12], where multiple autonomous data
sources are exposed transparently as a unified federated relational schema, usually called virtual
database. Data federation has been studied extensively over the years, and is still an active
research area, and many mature and highly-optimized data federation systems are currently
available, both in the database community1 and in the Semantic Web community (such as [13]).
   Data federation systems are already supported by OBDA systems, by accessing them as if
they were a single relational data source2 . However, to the best of our knowledge, in current
OBDA systems no provision is taken for the optimization of the generated SQL query, to account
for the fact that the evaluation of a SQL query in a data federation system is fundamentally
different from query evaluation by a standard relational DBMS engine. Essentially, this is due
to federated joins, i.e., joins that involve different data sources of the federation.
   In this work, we consider specifically this setting, which we call Ontology-based Data Federa-
tion (OBDF), and provide the following contributions. (i) We formalize OBDF systems, where
a collection of multiple, possibly heterogeneous and federated data sources are accessed via
mappings from an ontology. (ii) We study query optimization in OBDF, by devising a set of
optimization techniques specifically tailored towards federated data sources. (iii) We carry out
an experimental evaluation in which we assess the effectiveness of the proposed optimizations.


2. Ontology-based Data Federation
Data Federation. We first formalize data federation, where multiple, possibly heterogeneous
data sources are integrated in a unified view, usually called virtual database (VDB). A data
source 𝑆 can be an RDB, a NoSQL DB, a CSV source, or of some other type, with the only
requirement that it supports a query language. Given a set S = {𝑆1 , . . . , 𝑆𝑛 } of sources to be
federated, and a function (given implicitly with S) transforming the (possibly, non-relational)
schema of each source ⋃︀𝑆𝑖 into a corresponding relational schema Σ𝑖 , a VDB schema (for S) is
the disjoint union Σ = 𝑛𝑖=1 Σ𝑖 . In the following, we use letters 𝑇 , 𝑈 to denote database tables,
and the subscript 𝑖 (e.g., 𝑇𝑖 ) to indicate that table 𝑇 belongs to source 𝑆𝑖 . A data federation
   1
       E.g., Denodo (www.denodo.com), Dremio (www.dremio.com), and Teiid (teiid.io).
   2
       See, e.g., ontop-vkg.org/tutorial/federation/.
instance D for Σ is the relational instance 𝑖 𝐷𝑖 made of the union of all instances of the source
                                           ⋃︀
schemas in Σ. For a query 𝑞, ans(𝑞, D) denotes the set of answers of 𝑞 evaluated over the data
federation instance D.
OBDF. Given an ontology 𝒯 , a VDB schema Σ for a set S of data sources, and a set ℳ of
mappings from Σ to 𝒯 , an ontology-based data federation specification is the OBDA specification
ℱ = (𝒯 , ℳ, Σ). The notions of OBDF instance and answers to a query over an OBDF instance
coincide with their OBDA counterpart. Figure 1 depicts the OBDF approach, where query
answering is carried out by translating an input SPARQL query into a SQL query over the VDB,
and then having the federation system evaluate such query.
Challenges. In OBDF, multiple sources are mapped to the same ontology, therefore the
translated SQL query typically contains federated joins, i.e., joins involving different data
sources. Computing such joins is expensive [13], since they cannot be evaluated locally and
may cause large amounts of data to be transfered from the sources to the federation system.
Besides, as also observed in our preliminary experiments, evaluating queries over inefficient
sources (e.g., CSV files) slows down the overall query answering significantly. This problem is
not addressed by optimization techniques in current OBDA systems, since they treat all tables
present in the VDB schema in the same way. Therefore, new techniques to optimize query
answering in OBDF are needed.


3. Query Optimization in Ontology-based Data Federation
We study query optimization in OBDF, by devising a set of optimization techniques specifically
tailored to federated data sources. We propose a set of “hints” based on empty federated joins,
data redundancy, and materialized views, which can be pre-computed by analyzing the sources,
the mappings, and the ontology, and we provide a novel optimized query unfolding algorithm
that minimizes the number of federated joins and reduces the access to inefficient data sources.
Method. The overall method consists of two parts: (1) an off-line hint pre-computation part,
and (2) an on-line translation optimization part. We provide an algorithm to pre-compute the
mentioned types of “hints”, based on a static analysis of the mappings and the ontology. The
number of hints is bound to 𝑛2 , where 𝑛 is the number of attributes in the DB schema. If the data-
source is updated, then the hints are no-longer valid. Hence, our approach works best in mostly
static scenarios, e.g., where new data undergoes a period of validation before being released.
   Our concrete approach to query optimization consists of a set of query optimization rules
that take into account the pre-computed hints, a simple cost model that distinguishes efficien-
t/inefficient and static/dynamic data sources, and a novel optimized query unfolding algorithm
that minimizes the number of federated joins and the access to inefficient data sources.
Example. Consider an OBDF specification ℱ = (𝒯 , ℳ, Σ) and a data federation instance
D of Σ containing the static sources 𝑆1 –𝑆4 to be federated. The VDB schema Σ contains,
possibly among others, the following relation schemas, which are mapped to the ontology
𝒯 : Product1 (id , name, reviewer ) for 𝑆1 , Product2 (num, nam, rew ) for 𝑆2 (where id and
num are primary keys of Product1 and Product2 respectively) Person3 (pid , pname) for 𝑆3 ,
and PerInf4 (id , name, address) for 𝑆4 . We also assume to have the empty federated join
hint Product1 ⋊  ⋉id=num Product2 =D ∅ and the redundancy hint 𝜋pid,pname (Person3 ) ≡D
 𝜋!, #, $, %# (( 𝜋!/!' (Product1) ∪ 𝜋!/#() (Product2) )⋈!*!+                    q
                                                                                          𝜋!, #, $, %# (( 𝜋!/!',!+/!',#/#,)- (Product1) ∪ 𝜋!/#(),!+/#(),#/#,) (Product2)) ⋈!+*!.   q1
               ( 𝜋!+/!',#/#,)- (Product1) ∪ 𝜋!+/#(),#/#,) (Product2)) ⋈!+*!.                            ( 𝜋!./!',$/$-/!-0-$ (Product1) ∪ 𝜋!./#(),$/$-0 (Product2)) ⋈$*%!
               ( 𝜋!./!',$/$-/!-0-$ (Product1) ∪ 𝜋!./#(),$/$-0 (Product2)) ⋈$*%!
                                                                                                     ( 𝜋%!/%!',%#/%#,)- (Person3) ∪ 𝜋%!/!',%#/#,)- (PerInf4 )))
               ( 𝜋%!/%!',%#/%#,)- (Person3) ∪ 𝜋%!/!',%#/#,)- (PerInf4 )))


 SELECT * WHERE {                           Q
                                                    𝜋!, #, $, %# ((𝜋!/!',#/#,)-,$/$-/!-0-$ (Product1) ∪   q3
                                                                                                                   𝜋!, #, $, %# (                                                           q2

 ?x a Product; proName ?y; hasReviewer ?z .             𝜋!/#(),#/#,),$/$-0 (Product2)) ⋈$*%!                        ( 𝜋!/!',#/#,)-,$/$-/!-0-$ (Product1) ∪ 𝜋!/#(),#/#,),$/$-0 (Product2)) ⋈$*%!
 ?z preName ?n }                                       (𝜋%!/%!',%#/%#,)- (Person3)))                                ( 𝜋%!/%!',%#/%#,)- (Person3) ∪ 𝜋%!/!',%#/#,)- (PerInf4 )))



Figure 2: An example of optimizing a translated SQL query based on the pre-computed hints.

                                       200K products                                                                                     2M products
             1000000                                                                                           10000000

              100000                                                                                           1000000

                                                                                                                100000
                10000                                                         hom                                                                                            hom
                                                                                                                 10000
                 1000                                                         het                                                                                            het
                                                                                                                  1000
                                                                              hom-opt                                                                                        hom-opt
                    100
                                                                              het-opt                               100                                                      het-opt
                     10                                                                                              10
                      1                                                                                               1
                          Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10Q11Q12                                                            Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10Q11Q12



Figure 3: Results (time in ms) of SQL query evaluation for the BSBM experimentation.


𝜋pid/id,pname/name (PerInf4 ), and that 𝑆1 , 𝑆2 , and 𝑆3 are labelled as efficient, while 𝑆4 as ineffi-
cient.
   Figure 2 illustrates how the above hints and labelling are exploited to unfold a SPARQL
query 𝑄. The translation proceeds as follows, where we assume that query operators are
processed from left to right. (1) The federated join ⋊   ⋉𝑖=𝑖1 is first unfolded into a union of 4
joins and then traslated into 𝜋𝑖/𝑖𝑑,𝑖1/id,𝑛/name (Product1 ) ∪ 𝜋𝑖/num,𝑖1/num,𝑛/nam (Product2 ) on the
basis of the empty federated join hint and the elimination of self-joins. (2) Similarly to the
previous step, the intermediate query 𝑞1 is translated into 𝑞2 . (3) Finally, on the basis of the
data redundancy hint and the given source labeling, the union between Person3 and PerInf4
is removed, and only the projection over the efficient source Person3 is kept in the resulting
query 𝑞3 . Each step reduces the cost of 𝑞.
Experimentation. We have carried out an extensive experiment to verify the effectiveness of
the proposed optimizations, where we have employed main-stream DB engines (like PostgreSQL,
MySQL, and MongoDB), the OBDA system Ontop, and the data federation engine Teiid. Based
on the BSBM benchmark [14], we have created two OBDF specifications: a homogeneous one,
where we have used as sources five relational databases, and a heterogeneous one, where some
of these five data sources are translated into NoSQL DBs and CSV files. For scalability testing,
we have generated three groups of instances using the BSBM data generation tool, setting the
number of products to be 20K, 200K, and 2M, respectively. Thus, each OBDF specification has
three instances. For space reasons, we show in Figure 3 only the results of evaluating the SQL
translations of 12 SPARQL queries in BSBM for the instances with 200K and 2M products, where
hom and het denote the results of evaluating the SQL queries translated by Ontop over the homo-
geneous and heterogeneous situation respectively, and homopt and hetopt denote the results of
evaluating the optimized SQL queries over the two situations. From Figure 3, we can observe that
our optimization strategies are effective for OBDF (e.g, for 200K products and 𝑄2, we observe
a reduction from 2942.2 ms to 24.9 ms in query answering time for the homogeneous situation)
Acknowledgments
This research has been partially supported by the EU H2020 project INODE (grant agreement
No. 863410), by the Italian PRIN project HOPE (2019-2022), by the Free University of Bozen-
Bolzano through the project MP4OBDA, and by the “Fusion Grant” project HIVE. D. Calvanese
is supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded
by the Knut and Alice Wallenberg Foundation. We thank our colleagues, in particular Francesco
Corcoglioniti, for their discussions and feedback.


References
 [1] A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, R. Rosati, Linking data to
     ontologies, J. on Data Semantics 10 (2008) 133–173. doi:10.1007/978-3-540-77688-8_
     5.
 [2] D. Calvanese, B. Cogrel, S. Komla-Ebri, R. Kontchakov, D. Lanti, M. Rezk, M. Rodriguez-
     Muro, G. Xiao, Ontop: Answering SPARQL queries over relational databases, Semantic
     Web J. 8 (2017) 471–487. doi:10.3233/SW-160217.
 [3] G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, M. Zakharyaschev,
     Ontology-based data access: A survey, in: Proc. of the 27th Int. Joint Conf. on Artificial
     Intelligence (IJCAI), IJCAI Org., 2018, pp. 5511–5519. doi:10.24963/ijcai.2018/777.
 [4] G. Xiao, L. Ding, B. Cogrel, D. Calvanese, Virtual Knowledge Graphs: An overview of
     systems and use cases, Data Intelligence 1 (2019) 201–223. doi:10.1162/dint_a_00011.
 [5] B. Motik, B. Cuenca Grau, I. Horrocks, Z. Wu, A. Fokoue, C. Lutz, OWL 2 Web Ontology
     Language Profiles (Second Edition), W3C Recommendation, World Wide Web Consortium,
     2012. Available at http://www.w3.org/TR/owl2-profiles/.
 [6] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, R. Rosati, Tractable reasoning
     and efficient query answering in description logics: The DL-Lite family, J. of Automated
     Reasoning 39 (2007) 385–429. doi:10.1007/s10817-007-9078-x.
 [7] F. Priyatna, O. Corcho, J. F. Sequeda, Formalisation and experiences of R2RML-based
     SPARQL to SQL query translation using morph, in: Proc. of the 23rd Int. World Wide Web
     Conf. (WWW), 2014, pp. 479–490. doi:10.1145/2566486.2567981.
 [8] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, A. Poggi, M. Rodriguez-Muro,
     R. Rosati, M. Ruzzi, D. F. Savo, The Mastro system for ontology-based data access, Semantic
     Web J. 2 (2011) 43–53.
 [9] J. F. Sequeda, D. P. Miranker, Ultrawrap: SPARQL execution on relational data, J. of Web
     Semantics 22 (2013) 19–39.
[10] G. Xiao, D. Lanti, R. Kontchakov, S. Komla-Ebri, E. Güzel-Kalayci, L. Ding, J. Corman,
     B. Cogrel, D. Calvanese, E. Botoeva, The virtual knowledge graph system Ontop, in: Proc.
     of the 19th Int. Semantic Web Conf. (ISWC), volume 12507 of Lecture Notes in Computer
     Science, Springer, 2020, pp. 259–277. doi:10.1007/978-3-030-62466-8_17.
[11] A. P. Sheth, J. A. Larson, Federated database systems for managing distributed, heteroge-
     neous, and autonomous databases, ACM Computing Surveys 22 (1990) 183–236.
[12] L. M. Haas, E. T. Lin, M. A. Roth, Data integration through database federation, IBM
     Systems J. 41 (2002) 578–596.
[13] A. Schwarte, P. Haase, K. Hose, R. Schenkel, M. Schmidt, FedX: Optimization techniques
     for federated query processing on linked data, in: Proc. of the 10th Int. Semantic Web Conf.
     (ISWC), volume 7031 of Lecture Notes in Computer Science, Springer, 2011, pp. 601–616.
[14] C. Bizer, A. Schultz, The Berlin SPARQL benchmark, Int. J. on Semantic Web and
     Information Systems 5 (2009) 1–24.