A Geospatial Join Optimization for Federated
GeoSPARQL Querying
Antonis Troumpoukisa , Stasinos Konstantopoulosa and
Nefeli Prokopaki-Kostopouloub
a
 Institute of Informatics and Telecommunications, NCSR ‘Demokritos’, Ag. Paraskevi 15341, Greece
b
 Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens 16122,
Greece


                                         Abstract
                                         We present a join optimization method for federated linked geospatial data. This optimization targets
                                         within-distance queries in cases where the shapes that are compared are served from different sources
                                         of the federation. This operation is computationally expensive, because it cannot be answered from
                                         the spatial index of any of the individual data sources in the federation. Our optimization augments
                                         the subqueries prepared for each source with additional restrictions on topological relations that must
                                         be satisfied and that (a) can be answered from the spatial index of the individual data sources; (b) do
                                         not change the semantics of the query. We evaluate our optimization on data and queries derived
                                         from a real-world workflow over land usage data. Evaluation shows that our optimization substantially
                                         improves query processing time.

                                         Keywords
                                         Federated and distributed query processing, linked geospatial data, GeoSPARQL query processing


1. Introduction
Linked geospatial data brings into the scope of the Semantic Web and its technologies a wealth
of datasets that combine semantically-rich descriptions of resources with these resources’
geo-location. There are, however, various Semantic Web technologies where technical work is
needed in order to achieve the full integration of geospatial data, and federated query processing
is one of these technologies.
   Indexing and query optimization are key features of all database management systems, both
geospatial and relational. For instance, consider the query that retrieves “all restaurants within
1km from Syntagma Square, Athens, Greece.” A centralized geospatial database relies on its
geospatial index to exclude instances that cannot possibly be within 1km from Syntagma Square,
so that the distance only needs to be computed for a small number of candidates. A query
optimizer is aware that it is much more efficient to first retrieve the location of Syntagma Square

GeoLD 2022: 5th International Workshop on Geospatial Linked Data co-located with ESWC, May 30 2022, Hersonissos,
Greece
Envelope-Open antru@iit.demokritos.gr (A. Troumpoukis); konstant@iit.demokritos.gr (S. Konstantopoulos);
nefelipk@di.uoa.gr (N. Prokopaki-Kostopoulou)
Orcid 0000-0003-1078-8121 (A. Troumpoukis); 0000-0002-2586-1726 (S. Konstantopoulos); 0000-0002-8645-1341
(N. Prokopaki-Kostopoulou)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
and then filter for restaurants, than it would be to first retrieve the location of all restaurants in
the world. So query optimization ensures that query execution is organized optimally and not
following the ordering of the sub-queries in the original query.
   Although both indexing and query optimization are based on mature technologies, it is not
always straightforward to port these technologies to linked geospatial data and, even more
so, to federated query processing of linked geospatial data. Suppose that we are federating a
database that knows points of interest such as “Syntagma Square” with a database that knows
commercial points of interest, including restaurants. There is no single geospatial index that
can restrict the restaurants sub-query to the instances can potentially be solutions.
   In the remainder of this paper, we present some background information (Section 2), and we
describe our optimization for federated within-distance queries in detail (Section 3). We imple-
ment this technique in Semagrow and we compare the optimized version with the unoptimized
one (Section 4). Finally, we discuss relevant work (Section 5) and conclude (Section 6).


2. Background
2.1. The GeoSPARQL query language
The GeoSPARQL specification [1] defines a set of classes and properties for asserting and query-
ing geospatial information. Each feature is linked with a geometry with the geo:hasGeometry
property1 , while the geo:asWKT property is used to provide the concrete geographical shape
of a geometry as an RDF literal of the geo:wktLiteral datatype.2 Given the above, the link
between features and their concrete coordinates follows the pattern:
r geo:hasGeometry g . g geo:asWKT "coords"^^geo:wktLiteral .

    Naturally, inference about geo:wktLiteral values falls outside RDF graph entailment and
can only be performed by specialized geospatial databases. Such entailment is accessed via
filter expressions that contain geospatial functions. For example:
SELECT ?r1 ?r2 WHERE { ?r1 geo:hasGeometry ?g1 . ?r2 geo:hasGeometry ?g2 .
?g1 geo:asWKT ?w1 . ?g2 geo:asWKT ?w2 . FILTER(geof:sfIntersects(?w1,?w2)) }

uses the geof:sfIntersects function3 to access the geospatial operator that computes if the
WKT values ?w1 and ?w2 retrieved from the graph pattern intersect; such a query fetches all
features that have intersecting geometries. Notice that a geospatial join is essentially a cross
product that is being filtered by a geospatial condition comparing values from both expressions.

2.2. Join evaluation in federated query processors
Bind join [2] is a join implementation suitable for distributed and federated environments
because it is designed to drastically reduce the communication costs of joining two relations.
    1
      We use geo: for the http://www.opengis.net/ont/geosparql # namespace.
    2
      Two alternative serializations are foreseen by GeoSPARQL, geo:wktLiteral and geo:gmlLiteral , and two
datatype properties, geo:hasWKT and geo:hasGML . We restrict the discussion in this paper to the WTK serialization,
and it is straightforward to transfer this discussion to GML or any other serialization.
    3
      We use geof: for http://www.opengis.net/def/function/geosparql/
The idea, similar to the nested loop join, is that if we have a very selective outer relation of a join,
we can pass the results as bindings to the inner relation in order to filter out a large number of
tuples. The difference is that bind join can also work for remote queries since it substitutes the
results of the outer relation as bindings to the query of the inner relation.
   A naive implementation of bind join [3, 4] can be highly inefficient since it will create and
execute a different query for each result of the outer relation. A more elaborate solution for
reducing the overall number of queries produced is to group multiple bindings into a single
query [5, 6]. Fortunately, SPARQL 1.1 specification foresees a VALUES keyword as a mechanism
for passing multiple variable bindings at once. In addition, Schwarte et al. [6] proposed a
technique that can simulate this behavior by using a more complex UNION expression in order
to support legacy SPARQL 1.0 endpoints. An important component for the implementation of a
bind join evaluation strategy is a SPARQL query executor. Given a SPARQL endpoint, a SPARQL
query and a set of bindings, the query executor prepares the SPARQL query to be issued to
the endpoint (either using the VALUES expression or the UNION transformation), makes the
connection with the endpoint, issues the query, and fetches the result set to the federator.
   Filter pushdown [7] is a standard optimization technique in query processing. The basic idea
is that SPARQL filters should be “pushed” as deep in the execution plan as possible, because the
query processing time can be reduced by filtering out the data earlier during the evaluation of
the query. Especially in the context of federated query processing, by pushing the evaluation
of the filters in the source endpoints, we have a reduced query processing time because the
results are filtered before being transferred over the network. Moreover, filter pushdown can be
important in the context of GeoSPARQL filters, because each federated GeoSPARQL endpoint
has its own spatial index for evaluating geospatial functions, so it makes sense for a federated
engine to push these filters in the federated endpoints.


3. Optimization of federated within-distance queries
3.1. Preliminaries
Let 𝑎 and 𝑏 be two spatial objects, that can be points, lines and/or polygonal areas. We say that
𝑎 and 𝑏 intersect iff they share any portion of space. Moreover, we say that 𝑎 contains 𝑏 iff no
points of 𝑏 lie in the exterior of 𝑎, and at least one point of the interior of 𝑏 lies in the interior of
𝑎. Let 𝑑 be a length measurement. Then, the buffer of size 𝑑 around 𝑎 is a spatial object that
contains all points whose distances from 𝑎 are less or equal to 𝑑. Finally, the minimum bounding
box of 𝑎 is a rectangle whose sides are parallel to the 𝑥 and 𝑦 axises and minimally enclose 𝑎; it
is formed by the minimum and maximum p𝑥, 𝑦q coordinates of 𝑎.
   Let us now focus on GeoSPARQL queries. A variable binding is a pair 𝑢{𝑣, where 𝑢 is a
variable and 𝑣 is an RDF value. Each element of the result set of a GeoSPARQL query is a query
binding, which is a set of variable bindings. Let 𝑏 be a query binding. We say that 𝑢 is a bound
variable w.r.t. 𝑏 if 𝑏 contains a variable binding 𝑢{𝑣. In this case, we write 𝑏p𝑢q to denote the
RDF value 𝑣. Moreover, we say that 𝑢 is a free variable w.r.t. 𝑏 iff 𝑢 is not bound w.r.t. 𝑏. Finally,
let 𝑆 be a set of query bindings and 𝑢 be a variable. We say that 𝑢 is a free (resp. bound) variable
w.r.t. 𝑆 if it is a free (resp. bound) w.r.t. to all query bindings 𝑏 ∈ 𝑆. For the rest of the paper,
when we will refer to a binding we will mean a query binding.
3.2. Intuition and example
A federated within-distance query is a query that involves fetching pairs of shapes that are
served from different federated sources and their distance is less or equal than a given length
measurement. Since the GeoSPARQL specification does not offer any specialized function for
the within-distance operator (contrary to e.g., PostGIS) such queries contain filters of the form:
FILTER( geof:distance(?x, ?y, uom) < d ) .

where uom is a unit of measure URI, d is a numeric literal, and the bindings of ?x and ?y are
retrieved from different sources of the federation.
   We design our optimization technique under the general assumption that the federation
engine evaluates all federated joins in a bind join fashion with a filter pushdown optimization
(cf. Subsection 2.2 for details). In this case, the federator operates by fetching “left-hand” WKTs
to partially bind two-variable functions and then it pushes the filter to the “right-hand” endpoint.
Notice that the filter pushed in “right-hand” endpoint contains a single free variable, because
the other variable is bound with WKTs that have been already retrieved.
   Recall that in the naive implementation of bind join, for each binding of the left part of the
join, the federator issues one query to the endpoint of the right part of the join. Assume, without
loss of generality, that the bindings of ?x are retrieved from the left part of the join. Thus, a
query issued by the federator to the right-hand endpoint should contain a filter of the form:
FILTER( geof:distance(WKT_LITERAL, ?y, uom) < d ) .

where WKT_LITERAL is obtained by the bindings of the left part of the join. Such queries are
usually slow, because the evaluation of this filter cannot use the spatial index of the source.
  The evaluation of the source query in the remote endpoint can be sped-up if the federator
adds an additional geospatial filter for filtering out all irrelevant candidate shapes for ?y :
FILTER( geof:sfIntersects(?y, BBOX_BUF_LITERAL) ) .

where BBOX_BUF_LITERAL is the minimum bounding box of the buffer of size 𝑑 around
WKT_LITERAL . This additional filter is evaluated by the geospatial source using its spatial
index, thus giving fast access to a subset of shapes, in which the exact distance filter is then
applied to. Notice that this optimization can be performed during query execution time (and
not before evaluating the query), because the WKT_LITERAL is not known at the planning time.
   In the following, we will extend this approach so as to apply in more elaborate implementa-
tions of bind join, where multiple bindings can be passed in a single query.

3.3. Join optimization algorithm
We assume the existence of some helper routines that their implementations are not included
here as separate algorithms. Given a WKT literal 𝑎, a numeric literal 𝑑 and a unit of measure 𝑢,
the routine Bufferp𝑎, 𝑑, 𝑢q returns a WKT literal that represents the buffer of size 𝑑 around 𝑎
measured in units of type 𝑢. Given a WKT literal 𝑎, the routine Mbbp𝑎q returns a WKT literal
that represents the minimum bounding box of 𝑎. Given an endpoint 𝐸, a query 𝑄 and a set
of bindings 𝐵, the routine EvaluateRemoteQueryp𝐸, 𝑄, 𝐵q implements the behavior of the
query executor (cf. Subsection 2.2) and returns a set of bindings obtained from the source. Our
Algorithm 1 EvaluateRemoteQueryOpt
Input: a GeoSPARQL endpoint 𝐸, a GeoSPARQL query 𝑄, and a set of bindings 𝐵.
Output: a set of bindings 𝑅
 1: if 𝑄 contains a filter of the form FILTER( geof:distance(?w1,?w2, 𝑢) < 𝑑 )
           where 𝑢 is a unit of measure URI, 𝑑 is a numeric literal,
           one of ?w1 , ?w2 is a free variable w.r.t 𝐵 (denoted as 𝑥), and
           one of ?w1 , ?w2 is a bound variable w.r.t 𝐵 (denoted as 𝑦) then
 2:     𝑧 ≔ a fresh variable
 3:     Obtain 𝑄 ′ by augmenting 𝑄 with the filter FILTER( geof:sfIntersects( 𝑥, 𝑧) )
 4:     𝐵′ ≔ t𝑏 ∪ t𝑧{MbbpBufferp𝑏p𝑦q, 𝑑, 𝑢qqu ∶ 𝑏 ∈ 𝐵u
 5:     𝑅′ ≔ EvaluateRemoteQueryp𝐸, 𝑄 ′ , 𝐵′ q , 𝑅 ≔ t𝑟 ´ t𝑧{𝑣u ∶ 𝑟 ∈ 𝑅′ u
 6:     return 𝑅
 7: else
 8:     return EvaluateRemoteQueryp𝐸, 𝑄, 𝐵q
 9: end if


goal is to extend EvaluateRemoteQuery with the optimization sketched in Subsection 3.2,
adapted for the efficient bind join implementation that groups multiple query bindings from
the left-hand endpoint in a single query for the right-hand endpoint. Thus, in Algorithm 1, we
define EvaluateRemoteQueryOpt, which behaves as a wrapper for the original executor.
   The first step (line 1) is to identify whether the input query is in the form of our interest.
We consider within-distance joins (i.e., a filter of the form geof:distance(?w1,?w2, 𝑢) < 𝑑),
and the join is a federated one (i.e., only one of the arguments ?w1 , ?w2 is bound from the
bindings that came from the left part of the join). When the optimization is not applicable,
EvaluateRemoteQueryOpt falls back to the original EvaluateRemoteQuery behavior.
   The next step (lines 2-4) is to rewrite the query. After we identify that the free variable is
the variable 𝑥 and the bound variable is the variable 𝑦 (line 1), we add an additional filter that
forces 𝑥 to intersect with the minimum bounding box of the buffer of size 𝑑 around 𝑦 (line 3).
Notice though that unlike Subsection 3.2, we have several bindings for the variable 𝑦, thus
we cannot hard-code the bounding box directly in the query. Instead, the additional filter is
geof:sfIntersects( 𝑥, 𝑧) , and we provide additional bindings for the new fresh variable 𝑧 (line
2) by extending each binding 𝑏 from 𝐵 with the binding 𝑧{𝑣, where 𝑣 is obtained by calculating
the minimum bounding box of the buffer of size 𝑑 around the value that 𝑏 assigns in 𝑦 (line 4).
   The final step (line 5) is to use the original EvaluateRemoteQuery with the rewritten query
and the extended bindings to obtain the result set of the query. Since the original query did
not contain the newly constructed variable 𝑧, we need to remove through a post-processing all
variable bindings that refer to 𝑧 from the result set.

3.4. Correctness of the optimization
Lemma 1. Let 𝑝, 𝑞 be WKT literals and 𝑑 be a length measurement. If the distance between 𝑝 and
𝑞 is less than 𝑑, then 𝑝 intersects with the minimum bounding box of the buffer of size 𝑑 around 𝑞.
Proof. Let 𝑝.𝑞 be geometric literals, such that the distance between 𝑝 and 𝑞 is less than 𝑑. Then,
by definition, 𝑝 intersects with the buffer of size 𝑑 around 𝑞. Since all points of a shape 𝑠 belong
also to the minimum bounding box of 𝑠, we can conclude that 𝑝 also intersects with the minimum
bounding box of the buffer of size 𝑑 around 𝑞.

  Following Pérez et al. [8] we denote by v⋅w𝐷 the evaluation of a SPARQL query over a dataset
𝐷. If a query 𝑄 is a SELECT query, then v𝑄w𝐷 is a set of bindings, which are the solutions that
satisfy 𝑄 over 𝐷. Moreover, given a query 𝑄 and a query binding 𝑏, we write 𝑄𝑏 to denote
the query obtained by substituting each bound variable 𝑥 of 𝑏 in 𝑄 with 𝑏p𝑥q. We assume set
semantics of GeoSPARQL.
Lemma 2. Let 𝑏 be a binding, 𝑥 be a free variable w.r.t. 𝑏, 𝑦 be a bound variable w.r.t. 𝑏, 𝑢 be a unit
of measure URI, and 𝑑 be a numeric literal. We define 𝐹1 “ (geof:distance( 𝑥, 𝑦, 𝑢) < 𝑑) , 𝐹2 “
geof:sfIntersects( 𝑥, 𝑧) , where 𝑧 “ MbbpBufferp𝑏p𝑦q, 𝑑, 𝑢q. Let 𝑄1 , 𝑄2 be two GeoSPARQL
queries, s.t. 𝑄2 contains 𝐹1 and 𝑄1 is obtained by augmenting 𝑄2 with 𝐹2 . Then, v𝑄1 𝑏w𝐷 “ v𝑄2 𝑏w𝐷 .
Proof. Suppose that the lemma does not hold. Since 𝑄1 is essentially 𝑄2 augmented with an
additional filter, it must be v𝑄1 𝑏w𝐷 ⊆ v𝑄2 𝑏w𝐷 , i.e., there must exist some binding 𝑏 ˚ such that
𝑏 ˚ ∉ v𝑄1 𝑏w𝐷 and 𝑏 ˚ ∈ v𝑄2 𝑏w𝐷 . According to the meaning of 𝐹1 , 𝐹2 , this means that (a) the
distance between 𝑏 ˚ p𝑥q and 𝑏p𝑦q is less than 𝑑 and (b) 𝑏 ˚ p𝑥q does not intersect with the minimum
bounding box of a buffer of size 𝑑 around 𝑏p𝑦q. Contradiction (Lemma 1).

Theorem 1. Let 𝐸 be a GeoSPARQL endpoint, 𝑄 be a GeoSPARQL query, and 𝐵 be a set of bindings.
Then, EvaluateRemoteQueryOptp𝐸, 𝑄, 𝐵q “ EvaluateRemoteQueryp𝐸, 𝑄, 𝐵q.
Proof. If 𝑄 is not of the form of the query in Line 1 of Algorithm 1, then the equation
holds (cf. Line 8). Otherwise, notice that it holds ℰ𝑜𝑝𝑡 p𝐸, 𝑄𝑏, ∅q “ ℰp𝐸, 𝑄𝑏, ∅q for all
𝑏 ∈ 𝐵 (Lemma 2) (for brevity, we write ℰ𝑜𝑝𝑡 and ℰ instead of EvaluateRemoteQueryOpt
and EvaluateRemoteQuery), respectively). Therefore, notice that for all 𝑏 ∈ 𝐵 it holds
ℰ𝑜𝑝𝑡 p𝐸, 𝑄, t𝑏uq “ ℰ𝑜𝑝𝑡 p𝐸, 𝑄𝑏, ∅q “ ℰp𝐸, 𝑄𝑏, ∅q “ ℰp𝐸, 𝑄, t𝑏uq, and thus ℰ𝑜𝑝𝑡 p𝐸, 𝑄, 𝐵q “
⋃𝑏∈𝐵 ℰ𝑜𝑝𝑡 p𝐸, 𝑄, t𝑏uq “ ⋃𝑏∈𝐵 ℰp𝐸, 𝑄, t𝑏uq “ ℰp𝐸, 𝑄, 𝐵q.

3.5. Implementation
We provide an implementation of our geospatial join optimization integrated in the Semagrow
SPARQL federation engine.4 Our implementation extends the existing query execution mecha-
nism with a GeoSPARQL query executor that uses this optimization.


4. Evaluation
4.1. Experimental Setup
We evaluate our technique using datasets and queries from the GeoFedBench benchmark [9],
which was derived from practical use cases in the agro-environmental domain; that is, linking
land usage data with ground observations for the purpose of estimating crop type accuracy.
    4
        Cf. https://github.com/semagrow/semagrow
Table 1
Dataset statistics.
  dataset         # shapes      # all triples   # geospatial triples   # thematic triples   # properties
  INVEKOS        2,008,137       14,056,959               4,016,274           10,040,685                  7
  LUCAS              4,325           30,379                   8,650               21,729                 11


Datasets For the experimental evaluation, we use the following publicly available data sources:
   1. The Austrian Land Parcel Identification System (INVEKOS)5 , which contains the geo-
      locations of all crop parcels in Austria and the owners’ self-declaration about the crops
      grown in each parcel.
   2. The EUROSTAT’s Land Use and Cover Area frame Survey (LUCAS)6 , which contains
      agro-environmental and soil data by field observation of geographically referenced points.
Table 1 gives more details about these datasets. Each dataset is loaded on a different SPARQL
endpoint; we use the Strabon geospatial RDF store [10] for serving the data. Strabon encapsulates
PostGIS for performing spatial operations, and uses a spatial index to optimize query time.

Federations We use two federations of the above datasets; one Semagrow federation in
which the execution engine uses technique discussed in Section 3 (semagrow-opt) and one that
does not use it (semagrow-std). Regarding the evaluation of federated geospatial joins, each
Semagrow federation uses a bind join evaluation strategy with a filter push-down optimization.

Queries In Table 2 we summarize the queries of the experiment. Each row of the table
corresponds with a query template with a specific parameter. Q1-3 are used to estimate the crop-
type reliability of the INVEKOS dataset. For each LUCAS instance URI, we check if it provides
a positive validation (Q1), a negative validation (Q2), or it is irrelevant for the analysis (Q3). In
order to get more information about the instances, each query either returns a single result (if
the LUCAS URI has the desired property) or an empty result set (otherwise). These queries
have several characteristics apart from federated within-distance geospatial joins (cf. [9] for a
discussion on the complexity of the query set). In order to focus on the behavior of the within-
distance geospatial join, we have also included Q4. This query returns all INVEKOS instances
that are within a given distance from a specific LUCAS instance. The query is parameterized
with various distances ranging from 10 meters to 100 kilometers. As for the LUCAS instance,
we used the one that is closest to the center of the minimum bounding box of Austria.

Experiment deployment and execution We use a Kubernetes 1.14 cluster with 1 master
node and 8 worker nodes with a total of 120 cores and 264GB RAM. All queries are executed three
times; the execution times reported are the average of the second and third run. Experiment
deployment and execution is done through the KOBE benchmarking engine [11], and the KOBE
configurations for reproducing the experiments are publicly available.7
    5
      cf. http://www.data.gv.at/katalog/dataset/f7691988-e57c-4ee9-bbd0-e361d3811641
    6
      cf. https://esdac.jrc.ec.europa.eu/projects/lucas
    7
      The experiment specifications can be found in https://github.com/semagrow/benchmark-geofedbench.
Table 2
Queries used in the experiment.
             parameter   query                                    #tp    characteristics
                         return the nearest INVEKOS instance if
             LUCAS                                                       Subquery, ORDER,
     Q1                  it is within 10 meters and their crop     10
             URI                                                         LIMIT 1
                         types match
                         return the nearest INVEKOS instance if          Subquery, ORDER,
             LUCAS
     Q2                  it is within 10 meters and their crop     10    LIMIT 1, FILTER
             URI
                         types do not match                              NOT EXISTS
                                                                         Subquery, ORDER,
             LUCAS       return the LUCAS instance if there is
     Q3                                                            10    LIMIT 1, FILTER
             URI         no INVEKOS instance within 10 meters
                                                                         NOT EXISTS
                         return all INVEKOS instances within 𝐷
     Q4      distance                                              5     -
                         meters from a LUCAS instance


4.2. Experimental Results
4.2.1. Data Validation Query Set (Q1-3)
In the first part of the experimental study, we compare the optimized (semagrow-opt) version
of Semagrow over the unoptimized one (semagrow-std) using the query load obtained by the
data validation task. Table 3 contains the experimental results. For every query template, we
illustrate: the number of queries for each template (#queries), the number of queries that return
result (#results), and the query processing time. For each federation, we display the total time
to evaluate the query load and the average time for each query of the query load.
   We notice that the queries are much faster if we use the optimized version of Semagrow. The
unoptimized version would require several days for the evaluation, while with our optimization
technique the task reduces to several hours.
   Even though the query set contains several complex characteristics, the bottleneck for the
query evaluation is the calculation of the within-distance operation. This can be verified if
we analyze the execution plan of each query. For Q1 and Q2, Semagrow retrieves the WKT of
the given point from LUCAS (which is a fast operation since it requires fetching information
from a single instance), and then it retrieves all parcels that are within 10 meters from this
WKT (together with their distance and crop type) from INVEKOS. Then, Semagrow sorts the
INVEKOS parcels according to their distance and takes the first one. Since the distance is only
10 meters, the parcels are very small in number, therefore the sorting operation is very fast as
well. Then, Semagrow retrieves the point crop type from LUCAS and compares it with the crop
type of the parcels. Again, this operation is fast because it requires retrieving information from
a single instance. Similarly, for Q3, Semagrow retrieves the crop type and WKT of the given
point from LUCAS (a fast retrieval of information from a single instance), and checks if there is
any point within 10 meters from this WKT from INVEKOS.
   The previous discussion suggests that the operation “retrieve all WKTs from INVEKOS within
10m distance from a specific WKT (obtained from LUCAS)” is the costliest operation in the data
Table 3
Experimental results for Q1-3.
                                                         query processing time
                                              semagrow-std                  semagrow-opt
              #queries     #results        total       average           total       average
     Q1         2488           1650      83 hours        120 sec       106 mins         2.6 sec
     Q2         2488            396      82 hours        119 sec        99 mins         2.4 sec
     Q3         2488            400      81 hours        117 sec        74 mins         1.8 sec


Table 4
Experimental results for Q4.
                             semagrow-std                            semagrow-opt
                         query proc.                  query proc.      shapes pruned by
 query    distance                    #results                                              #results
                            time                         time            optimization
   Q4        10 m           58 sec            2         0.1 sec      2,008,134    (ą99%)          2
   Q4       100 m           57 sec            7         0.1 sec      2,008,129    (ą99%)          7
   Q4        1 km           58 sec           70         0.1 sec      2,008,004    (ą99%)         70
   Q4       10 km           57 sec        4,739         1.2 sec      1,996,169      (99%)     4,739
   Q4       50 km           72 sec      141,973         26 sec       1,702,032      (84%)   141,973
   Q4      100 km          110 sec      528,026         86 sec       1,212,393      (60%)   528,026


validation task. As a result, it is safe to conclude that our optimization technique which targets
such queries is the reason for the extreme speed-up of the INVEKOS validation task.

4.2.2. Within-distance operation from a given LUCAS point (Q4)
In the second part of the experimental study, we compare the optimized (semagrow-opt) over
the unoptimized (semagrow-std) version of Semagrow using a query of fetching all INVEKOS
parcels that are found within increasing distances from a specific LUCAS point. Table 4 contains
the experimental results. For every instance of the query template Q4, we illustrate: the distance
parameter of the within-distance operation, the query processing time and the number of results
for semagrow-std and semagrow-opt. Moreover, we display the number and percentage of
shapes that are pruned from INVEKOS by the additional filter that the optimization process
places in the source query that corresponds to the right part of the federated join.
   First, we notice that in both cases, Semagrow returns the same number of results. This
validates that the implementation of the optimization is correct, i.e., the bounding box in the
additional filter is constructed correctly, thus the filter does not prune any correct shapes.
   Regarding semagrow-std, which issues the unoptimized query in INVEKOS, we notice that
every query requires at least 57 seconds. For a parameterized distance between 10 meters and
10 kilometers, the query time is 57 – 58 seconds, even if the result is either very small (2 results)
or moderate (4,700 results). Only for distances greater than 50 kilometers, where we have a very
large result set (greater than 140,000 results), the query time seems to increase as we increase
the distance parameter. This behavior can be explained if we consider that the unoptimized
query cannot be answered using the spatial index of the source, and the source has to check for
every shape in INVEKOS if it is within a specific distance. Thus, the experimental results can
be explained as follows; the time needed to check potential candidates within a given distance
is around 1 minute, and the remaining time is used for passing the results back to the clients.
Notice that in our case, the transfer cost can be relatively high (e.g., 50 seconds for 500,000
results), because the result set includes WKT values which are in general long strings.
   Regarding semagrow-opt, which issues the optimized query in INVEKOS, we notice that
the query processing time is analogous to the size of the distance parameter, i.e., for smaller
distances we have a faster query processing time. Recall that the optimized source query has an
additional filter expression, which is used to prune all shapes that are too far away from the
given LUCAS WKT. Indeed, we observe that there exists a connection between the amount of
the pruning of the additional filter and the query processing time. Unlike previously, where
the source had to consider all shapes from INVEKOS, the source here computes the distance
for a reduced set of candidate shapes, which is relevant to the distance parameter. Moreover,
since the geof:sfIntersects function of the additional filter can be evaluated using the spatial
index, this filter does not introduce any time overhead in the evaluation of the query.
   Comparing the two approaches, we observe that the optimized version reduces the query
processing time by 3 orders of magnitude for distances less than 1 km and by 2 orders of magni-
tude for distances around 10 km. For larger distances, the time difference is less pronounced,
but in any case, we can safely conclude that the optimization technique can be effective for
federated within-distance queries for any distance length.


5. Related Work
The literature on federated geospatial query processing is very sparse, in either the Semantic
Web community, the geographical information systems community, or the wider databases
community. Recent studies [12, Section 5] find that there is no mature federated GeoSPARQL
query processing system. Recent work on data integration methods cites systems that collect
and integrate distributed geospatial data into a single store as well dynamic federation of non-
geospatial data sources, but also does not include systems that are both federated and support
geospatial operations [13, Sect. 2].
   PostGIS already has a build-in function calculating the geometries that are within a given
distance, called ST_DWithin [14], that GeoSPARQL does not support. ST_DWithin function
includes a bounding box comparison that makes use of any indexes that are available on the
geometries, which is what we are implementing with our geospatial join optimization method.
   Similarly, this is kind of how PostGIS proposes to handle the nearest neighbour search [15].
Traditionally, the naive way to find the nearest neighbour is to force the database to calculate
the distance between the query geometry and every candidate geometry and then sort them all.
For a large table of candidate geometries, it is not an efficient approach. One way to improve
performance is to add an index constraint to the search; first find which candidate geometries
overlap or touch with an arbitrary box around the query geometry and only calculate the actual
distance with them. The above introduces the problem of somehow choosing the smallest box,
and that is why PostGIS introduces KNN, a pure index based nearest neighbour search. PostGIS
uses an R-Tree index implemented on top of GiST (Generalized Search Tree) to index GIS data.
The KNN system evaluates distances between bounding boxes inside the PostGIS GiST index.
So, in the case of within-distance queries the computations will be between the bounding boxes
of geometries; they will not be precise on the exact geometries.
   Another optimization can be performed by creating on-the-fly spatial indexing for spatial
objects [16]. The queries will be more efficient due to the indexing being dependent specifically
on the spatial predicate and the geometry type of each spatial object. Geospatial joins can also
be optimized by rewriting the GeoSPARQL queries into simpler more primitive sub-queries [16]
and parallelizing them [17]. For example, a query using the nearby function can be divided to
hasGeometry and within-distance (or distance) sub-queries. Alternatively, the filtering within-
distance computation can be parallelized by partitioning the spatial objects into approximately
equal-sized patches based on spatial indices. The spatial sub-queries within each patch and
around the borders of the patches are then processed in parallel, and their results are combined.


6. Conclusions and Future Work
We presented a geospatial join optimization method for federated within-distance queries. In
this optimization, the query execution module of a federation engine can be extended, so that it
rewrites the query to be issued to the geospatial source endpoint by adding an extra filter to
the within-distance filter. This can help the federated source to calculate faster the result set
since the additional filter can make use of the spatial index of the source, without changing the
semantics of the original query. The implementation of our method is provided as open source,
integrated with the Semagrow federated GeoSPARQL processor.
   In the experimental setup, we showed that the optimization substantially speeds up query
execution, especially for restrictive within-distance limits. This can prove useful for many
real-world applications, as it is reasonable to expect that within-distance restrictions are used
to limit results to a local scope. In other words, the value of the proposed method might be
smaller for within-distance limits of thousands of kilometers, but it is hard to imagine use cases
where such queries would be useful.
   As future work, we plan to develop a GeoSPARQL extension where within-distance restric-
tions are expressed with a single function. This will allow the scope of application of our
optimization to be made explicit, and not depend on statically analysing the query in order to
identify queries where the optimization is applicable. Another direction for future work is to
develop similar rewriting techniques for optimizing federated queries with other GeoSPARQL
functions (such as geof:sfCrosses or geof:sfTouches ). Finally, an interesting open question
is whether (and to what extend) our approach can be effective for triple stores that do not
provide a complete implementation of the GeoSPARQL standard.

Acknowledgements The work described here has received funding from the European
Union’s Horizon 2020 research and innovation programme under grant agreement No 825258.
For more details, please visit http://earthanalytics.eu. The authors also acknowledge that the
work was supported by SKEL, NCSR ‘Demokritos’ https://www.iit.demokritos.gr/labs/skel/.
References
 [1] Open Geospatial Consortium, OGC GeoSPARQL: A geographic query language for RDF
     data, version 1.0, 2012. URL: http://www.opengis.net/doc/IS/geosparql/1.0.
 [2] L. M. Haas, D. Kossmann, E. L. Wimmers, J. Yang, Optimizing queries across diverse data
     sources, in: Proceedings of VLDB’97, 1997. URL: http://www.vldb.org/conf/1997/P276.PDF.
 [3] O. Görlitz, S. Staab, SPLENDID: SPARQL endpoint federation exploiting VOID descriptions,
     in: Proceedings of the 2nd Intl. Workshop on Consuming Linked Data COLD, 2011. URL:
     http://ceur-ws.org/Vol-782/GoerlitzAndStaab_COLD2011.pdf.
 [4] B. Quilitz, U. Leser, Querying distributed RDF data sources with SPARQL, in: Proc. of ESWC
     2008, Tenerife, Canary Islands, Spain, 2008. doi:10.1007/978- 3- 540- 68234- 9\_39 .
 [5] A. Charalambidis, A. Troumpoukis, S. Konstantopoulos, SemaGrow: Optimizing federated
     SPARQL queries, in: Proc. of SEMANTICS 2015, 2015. doi:10.1145/2814864.2814886 .
 [6] A. Schwarte, P. Haase, K. Hose, R. Schenkel, M. Schmidt, FedX: A federation layer for
     distributed query processing on linked open data, in: Proceedings of ESWC 2011, Heraklion,
     Crete, Greece, 2011. doi:10.1007/978- 3- 642- 21064- 8\_39 .
 [7] M. Schmidt, M. Meier, G. Lausen, Foundations of SPARQL query optimization, in: Pro-
     ceedings of ICDT 2010, Lausanne, Switzerland, 2010. doi:10.1145/1804669.1804675 .
 [8] J. Pérez, M. Arenas, C. Gutierrez, Semantics and complexity of SPARQL, ACM Trans.
     Database Syst. (2009). doi:10.1145/1567274.1567278 .
 [9] A. Troumpoukis, S. Konstantopoulos, G. Mouchakis, N. Prokopaki-Kostopoulou, C. Paris,
     L. Bruzzone, D.-A. Pantazi, M. Koubarakis, GeoFedBench: A benchmark for federated
     GeoSPARQL query processors, in: Posters and Demos of ISWC 2020, Virtual Event, 2020.
[10] K. Kyzirakos, M. Karpathiotakis, M. Koubarakis, Strabon: A Semantic Geospatial DBMS,
     in: Proc. of ISWC 2012, Boston, MA, USA, 2012. doi:10.1007/978- 3- 642- 35176- 1\_19 .
[11] C. Kostopoulos, G. Mouchakis, A. Troumpoukis, N. Prokopaki-Kostopoulou, A. Charalam-
     bidis, S. Konstantopoulos, KOBE: Cloud-native open benchmarking engine for federated
     query processors, in: Proc. of ESWC 2021, 2021. doi:10.1007/978- 3- 030- 77385- 4\_40 .
[12] K. Bereta, H. Caumont, U. Daniels, E. Goor, M. Koubarakis, D. Pantazi, G. Stamoulis, S. Ubels,
     V. Venus, F. Wahyudi, The Copernicus App Lab project: Easy access to Copernicus data,
     in: Proceedings of EDBT 2019, Lisbon, Portugal, 2019. doi:10.5441/002/edbt.2019.46 .
[13] M. Masmoudi, S. B. A. B. Lamine, H. B. Zghal, B. Archimède, M. Karray, Knowledge
     hypergraph-based approach for data integration and querying: Application to Earth
     Observation, Future Gener. Comput. Syst. (2021). doi:10.1016/j.future.2020.09.029 .
[14] PostGIS Manual, ST_DWithin, 2021. URL: https://postgis.net/docs/ST_DWithin.html.
[15] P. Ramsey, M. Leslie, Nearest-neighbour searching, PostGIS, 2012. URL: https://www.
     postgis.net/workshops/postgis-intro/knn.html, accessed: 2021-05-19.
[16] T. Zhao, C. Zhang, L. Anselin, W. Li, K. Chen, A parallel approach for improving geosparql
     query performance, Int. J. Digit. Earth (2015). doi:10.1080/17538947.2014.904012 .
[17] C. Zhang, T. Zhao, L. Anselin, W. Li, K. Chen, A map-reduce based parallel approach for
     improving query performance in a geospatial semantic web for disaster response, Earth
     Sci. Informatics (2015). doi:10.1007/s12145- 014- 0179- x .