<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Towards Faster Reformulation-based Query Answering on RDF Graphs with RDFS Ontologies</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Maxime</forename><surname>Buron</surname></persName>
							<email>maxime.buron@cs.ox.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="institution">Univ Oxford</orgName>
								<address>
									<settlement>Oxford</settlement>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Univ Rennes</orgName>
								<address>
									<settlement>Lannion</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Cheikh-Brahim</forename><surname>El Vaigh</surname></persName>
							<email>cheikh-brahim.el-vaigh@irisa.fr</email>
						</author>
						<author>
							<persName><forename type="first">François</forename><surname>Goasdoué</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Univ Rennes</orgName>
								<address>
									<settlement>Lannion</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Towards Faster Reformulation-based Query Answering on RDF Graphs with RDFS Ontologies</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">61D212CFC4F56252DB0D49CD3F37BD0E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T01:34+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>RDF/S</term>
					<term>query answering</term>
					<term>reformulation</term>
					<term>optimization</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Answering queries on RDF knowledge bases is a crucial data management task, usually performed through either graph saturation or query reformulation. In this short paper, we optimize our recent stateof-the-art query reformulation technique for RDF graphs with RDFS ontologies [2], and we report on preliminary encouraging experiments showing performance improvement by up to two orders of magnitudes!</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The RDF graph data model and its associated SPARQL query language are the two well-established W3C standards for sharing data and knowledge, especially on the Semantic Web through DBpedia, Wikidata, or more generally the Linked Open Data cloud. Their rapid adoption over the last decade has made the study of dedicated query answering techniques a hot topic in the data management community at large, to design database management or data integration systems, e.g., <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b2">3]</ref>.</p><p>Answering queries on an RDF graph (graph in short) is not an easy task as it combines database-style query evaluation with AI-style reasoning, to answer queries from both the explicit and implicit information that the graph models. Saturation-based query answering aims at computing first the saturation of the graph by adding to it all its implicit information obtained through reasoning. Then, the answers to a query on the graph are obtained by simply posing the query to a data management system (DMS) that stores the graph saturation. This technique provides fast query answering in general, but the saturation needs possibly costly maintenance upon updates <ref type="bibr" target="#b5">[6]</ref>. By contrast, reformulation-based query answering amounts to transforming the query into a query reformulation, so that the answers to the query are obtained by posing the query reformulation to a DMS that stores only the explicit graph part. This technique does not need maintenance upon updates, as it does not rely on the graph saturation, but query reformulations may be too complex to be efficiently evaluated by a DMS <ref type="bibr" target="#b3">[4]</ref>.</p><p>In this paper, we consider the optimization of the reformulation-based query answering approach that raises so far unsolved performance issues. In particular, we start with our recent technique <ref type="bibr" target="#b1">[2]</ref> that applies to the core SPARQL conjunctive queries, a.k.a. Basic Graph Pattern Queries (BGPQs), and RDF graphs with RDF Schema (RDFS) ontologies; this technique supports the four RDFS ontological constraints (subclass, subproperty, domain, and range) between classes (i.e., node types) and properties (i.e., edge labels), and all the RDF entailment rules <ref type="bibr" target="#b9">[10]</ref> that allow reasoning with these constraints. Other techniques only focus on a subset of all these RDF entailment rules <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b3">4]</ref>. Crucially, query answering performance decreases as the number of supported RDF entailment rules increases (because the complexity of its query reformulation increases), hence efficient reformulation-based query answering in the setting of <ref type="bibr" target="#b1">[2]</ref> is particularly challenging.</p><p>Our contributions are twofold. First, in Sec. 2, we identify two causes of limited performance for the reformulation-based query answering technique in <ref type="bibr" target="#b1">[2]</ref> and those it encompasses (e.g., <ref type="bibr" target="#b5">[6]</ref>), and we propose three optimizations to address them. They all consists in identifying and avoiding useless query reformulation evaluation efforts either by reasoning on a query reformulation itself via query containment or by reasoning on a query reformulation based on the queried graph using the cardinalities (number of instances) of its classes and properties, or a summary of it. Second, in Sec. 3, we discuss in which order these optimizations must be applied to provide best performance and we report on experiments we made to assess the effectiveness of our optimization workflow. As we shall see, our workflow makes reformulation-based query answering (i) feasible when it was failing or timing out due to too complex query reformulations, (ii) faster when it was feasible (up to 2 orders of magnitude), and (iii) better than saturation-based query answering in a non-negligible number of occasions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Optimizing reformulation-based query answering</head><p>The reformulation-based query answering technique we devised in <ref type="bibr" target="#b1">[2]</ref> computes for a given BGPQ q a reformulation q ′ of it w.r.t. the RDFS ontology O of the queried graph G and the set R of entailment rules. q ′ is expressed as a union of BGPQs (UBGPQ) that enumerates the -worst case exponentially manyspecializations of q w.r.t. O and R. We present next three optimizations to apply to q ′ before it is sent for evaluation to some DMS that stores G; they consist in removing useless BGPQs from q ′ to speed up its evaluation time. Containment-based BGPQ elimination. Because every BGPQ within the UBGPQ reformulation q ′ is computed independently from the others through different reasoning paths, it may happen that q ′ contains redundant BGPQs, i.e., whose answer sets are subsets of those of other BGPQs in q ′ . These can be identified thus pruned away from q ′ by comparing every pair of its BGPQs through containment checking. We name this first optimization UBGPQ minimization (M). We note that such minimization has been used in the literature in other reformulation-based query answering settings, e.g., existential rules <ref type="bibr" target="#b7">[8]</ref>.</p><p>Cardinality-based and summary-based empty BGPQ elimination. An essential property of reformulation-based query answering, which is common to all reformulation-based techniques and not specific to <ref type="bibr" target="#b1">[2]</ref>, is that the reformulation q ′ computed from q w.r.t. O and R allows retrieving the answers to q on all the graphs with ontology O. Though theoretically nice, the generality of q ′ needlessly limits its performance: q ′ may contain BGPQs with no answer on the specific queried graph G, while evaluating them may take (significant) time.</p><p>BGPQs in q ′ may have no answer just because G has no instance for classes or properties these BGPQs mention. To prune them away from q ′ , we compute and store the cardinalities of the classes and properties in G, which, clearly, can be maintained very fast upon graph updates. We name this second optimization empty relation pruning (ER). Also, BGPQs in q ′ may have no answer because, though every class and property they mention has instances, their selection or join conditions cannot be matched in G. Thus, for each BGPQ in q ′ , we check whether it has no answer on a summary of G, in which case it is pruned away from q ′ . We call this optimization summary pruning (S). To do this, we reuse the notion of RDF graph summary we defined in <ref type="bibr" target="#b4">[5]</ref>: a summary is an RDF graph G ≡ computed as the quotient graph of the graph G to summarize, using an RDF equivalence relation ≡ over the nodes of G (discussed shortly); the particular ≡ to use depends on the target summary usage. We recall that the quotient operation basically fuses every set of equivalent nodes into a single new node representing them all in G ≡ ; we say that this latter node represents the former equivalent ones, in particular it inherits of all their types and edges. We pointed out in <ref type="bibr" target="#b0">[1]</ref> an important property of a summary: a BGPQ q has no answer on G if the BGPQ q ≡ has no answer on G ≡ , where q ≡ is obtained from q by replacing each mention of a G's node by the G ≡ 's node it is represented by. We remark here that q may have no answer on G while q ≡ does have some answer on G ≡ . Importantly, the ability of a summary to prune BGPQs depends on the choice of RDF equivalence relation ≡, all of those presented in <ref type="bibr" target="#b4">[5]</ref> have been devised for graph visualization. Therefore, we define a novel RDF equivalence relation for BGPQ pruning that behaves as follows: two nodes a,b in G are equivalent, noted a ≡ b, iff (i) they both have the same class (i.e., node type), or (ii) they are both equivalent to a third node c in G, i.e., a ≡ c and b ≡ c. The rationale for this definition is that w.r.t. data compression we want a single representative for every class/type (item (i) above) and w.r.t. of the evaluation of BGPQs we want to capture joins between different classes/types (item (ii) above). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Optimization workflow and experimental results</head><p>Optimization workflow. With the above three optimizations in place, the crucial question is now in which order they should be applied to a reformulation q ′ to maximize the performance gain. After experimenting on the possible orders, the best we found is: ER, then M, and finally S. The intuition behind this optimization workflow is that ER is very fast to perform (linear in the size of q ′ ) and may reduce the cost of both M by avoiding containment checks and S by avoiding issuing BGPQs to the DMS while we "statically" know that they have no answer. In practice, we observed that ER does help M but not S, because DMSs are smart enough to efficiently detect queries with an empty relation. Then, we chose to apply M because it is independent of the size of G and, though it consists in a number quadratic in the size of q ′ of NP-Complete BGPQ containment checks, it is performed (i) in-memory and (ii) on BGPQs that are not large in practice (although the UBGPQ q ′ that unions them may be large). By contrast, the difficulty of S augments with the size of q ′ and of G ≡ that grows with the size of G. Applying M before S allows handling larger graphs. In our experiments, the percentage of BGPQs pruned by ER, M and S is respectively in average 32%, 19% and 25%, i.e., 76% in total! Further, while the time spent in applying ER is negligible (always less than 2ms), in average, 3.73% of the query answering time is spent in applying M and 24.97% in applying S (dropping to 14.32%, if we disregard the four queries with no answer for which most of the query answering time is spent in applying S as expected). Experiments. We have implemented our optimizations in OntoSQL<ref type="foot" target="#foot_0">3</ref> [1,2,4,6], a Java platform for RDF data management on top of Postgres v12.7, in which we stored a LUBM <ref type="bibr" target="#b6">[7]</ref> graph G of 100M edges with its summary G ≡ of 8.3M edges (7.6% G's size). For our experiments, we used a CentOs 7.5 linux server with a 2.7GHz Intel Core i7 CPU, 160GB of RAM and a fast HDD. Loading G took 2h27min and building G ≡ took 13min. We reused 26 queries from <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b3">4</ref>] with 1 to 11 joins and 0 to 20M answers. In Figure <ref type="figure" target="#fig_0">1</ref>, we can see their corresponding reformulation size (number of BGPQs in the UBGPQ) before optimization (shown as x-axis labels), and their answering times using the reformulation without and with part of or all our optimization workflow; all are compared to the saturationbased approach. We set a timeout to 10min; it is reached when answering Q22, Q20 and Q17 without optimization. We observe that our optimization workflow yields performance improvement except on very simple queries where its overhead shows; we remark performance gain up to one order of magnitude for Q20, Q04 and two orders for Q15, and quite surprisingly optimized reformulationbased query answering is faster than saturation-based query answering on Q04, Q15 and Q03.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Perspectives</head><p>A direct perspective to this work is to support fast summary maintenance upon graph updates. We will investigate this using the union-find-delete data structure, which is suited to model and update equivalence relations.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Query answering times comparison using different optimization workflows.</figDesc><graphic coords="4,134.77,115.83,345.83,132.65" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://ontosql.inria.fr</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This work was partially supported by the ANR project CQFD (ANR-18-CE23-0003).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Revisiting RDF storage layouts for efficient query answering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Buron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Goasdoué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Manolescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Merabti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mugnier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Workshop on Scalable Semantic Web Knowledge Base Systems</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Reformulation-based query answering for RDF graphs with RDFS ontologies</title>
		<author>
			<persName><forename type="first">M</forename><surname>Buron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Goasdoué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Manolescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mugnier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ESWC</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Ontology-based RDF integration of heterogeneous data</title>
		<author>
			<persName><forename type="first">M</forename><surname>Buron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Goasdoué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Manolescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mugnier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EDBT</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Optimizing reformulation-based query answering in RDF</title>
		<author>
			<persName><forename type="first">D</forename><surname>Bursztyn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Goasdoué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Manolescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EDBT</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">RDF graph summarization for firstsight structure discovery</title>
		<author>
			<persName><forename type="first">F</forename><surname>Goasdoué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Guzewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Manolescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">VLDB J</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="issue">5</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Efficient query answering against dynamic RDF databases</title>
		<author>
			<persName><forename type="first">F</forename><surname>Goasdoué</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Manolescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roatis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EDBT</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">LUBM: A benchmark for OWL knowledge base systems</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Heflin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Web Sem</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">2-3</biblScope>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Sound, complete and minimal ucq-rewriting for existential rules</title>
		<author>
			<persName><forename type="first">M</forename><surname>König</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Leclère</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mugnier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Thomazo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Semantic Web Journal</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="451" to="475" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">RDFox: A highly-scalable RDF store</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Nenov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Piro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Motik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Horrocks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Banerjee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ISWC</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<ptr target="https://www.w3.org/TR/rdf11-mt/" />
		<title level="m">W3C: RDF 1.1 Semantics</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
