<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GeoFedBench: A Benchmark for Federated GeoSPARQL Query Processors</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonis Troumpoukis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stasinos Konstantopoulos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giannis Mouchakis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nefeli Prokopaki-Kostopoulou</string-name>
          <email>nefelipkg@iit.demokritos.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia Paris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Bruzzone</string-name>
          <email>lorenzo.bruzzoneg@unitn.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Despina-Athanasia Pantazi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manolis Koubarakis</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept Information Engineering and Computer Science, University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute and Informatics and Telecommunications, NCSR \Demokritos"</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National and Kapodistrian University of Athens</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Performance benchmarks are invaluable for evaluating and comparing federated query processing systems, but it is hard to design benchmarks that are both realistic and informative about the systems being tested. In this paper we present GeoFedBench, a benchmark that has been obtained from an actual, practical application of geospatial and linked data querying and uses GeoSPARQL constructs that challenge all phases of federated query processing. The benchmark is publicly available as part of the Kobe suite.</p>
      </abstract>
      <kwd-group>
        <kwd>Benchmarking</kwd>
        <kwd>GeoSPARQL</kwd>
        <kwd>Federated querying</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Performance benchmarks are invaluable for evaluating and comparing systems,
but designing benchmarks is subject to considerations that are di cult to satisfy
simultaneously. One potential tension is the creation of a realistic benchmark
that accurately re ects how the benchmarked systems will behave in real-world
use cases against the design of a benchmark that is informative with respect to
system characteristics we know in advance that we need to test and measure.</p>
      <p>Given the above, we are excited to present a benchmark that has been
obtained from an actual, practical application of geospatial and linked data
querying. The benchmark federates a database of Earth Observation data about land
usage and a database of ground observations about land usage, to search for
pairs between them that simultaneously satisfy geospatial and thematic (land
usage) constraints (Section 2).</p>
      <p>Besides being extracted from a real work ow in the Earth Observation
domain, the benchmark queries use GeoSPARQL constructs that challenge all
? Copyright (c) 2020 for this paper by its authors. Use permitted under Creative</p>
      <p>Commons License Attribution 4.0 International (CC BY 4/0).
phases of federated query processing, from source selection to query planing
and execution. Besides a detailed analysis of the queries, we also present
emprical tests demonstrating that the benchmark is both challenging but feasible
(Section 3). Finally, we recap and conclude (Section 4).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Use case: Validating land usage data</title>
      <p>Detailed land usage data is crucial in many applications, ranging from
formulating agricultural policy and monitoring its execution, to conducting research on
climate change resilience and future food security. Land usage can be inferred
from Earth Observation images or collected through self-declaration, but in
either case needs to be validated against land surveys. The standard approach for
this validation is to match each instance in the land survey dataset (GPS points)
with the nearest land parcel (a GIS shape) and compare the crops observed in
the survey against the crops declared or inferred for the matching parcel.</p>
      <p>Although conceptually straightforward, in operational scenarios this rule can
be misleading. Ground observations are geo-referenced to a point on the road
adjacent to the eld, which is often ambiguous in agricultural areas with several
adjacent parcels; further exacerbated by GPS accuracy. However, a more
sophisticated (and also computationally demanding) approach can estimate the error
rate of the land usage data: for every survey point there must be at least one
parcel with the same label in reasonable proximity; otherwise at least one nearby
parcel is mis-labelled (although we cannot automatically infer which one).</p>
      <p>For our benchmark, we use the Invekos dataset, the Austrian administration's
Land Parcel Identi cation System with owners' self-declaration about the crops
grown in each parcel, compared against the observations from the 2018 Land
use and cover area frame statistical survey (Lucas). Table 1 gives more details
about these datasets.</p>
      <p>Besides geospatial processing, using these datasets also introduces a data
integration aspect to our benchmark. Speci cally, Lucas annotations follow the
Land Cover Classi cation System (LCCS) whereas Invekos follows its own codelist
of 212 crop types. There is no one-to-one mapping between instance labels (e.g.,
Invekos grassland can be Lucas E10, B55, or E30, while Lucas B13 can be Invekos
spring barley or winter barley ).</p>
      <p>All triples Geospatial triples Thematic triples
Lucas 30,379 4,325 26,054
https://esdac.jrc.ec.europa.eu/projects/lucas
Invekos 14,036,799 2,005,257 12,031,542
https://www.data.gv.at/katalog/dataset/e21a731f-9e08-4dd3-b9e5-cd460438a5d9</p>
    </sec>
    <sec id="sec-3">
      <title>Benchmark queries</title>
      <p>In order to estimate the reliability of the Invekos dataset, we used queries that,
for each given Lucas instance, check if: (Q1) the closest Invekos instance is under
10 meters away and their crop labels match; (Q2) the closest Invekos instance
is under 10 meters away and their crop labels do not match; or (Q3) there is no
Invekos instance within 10 meters.</p>
      <p>Since geo-linked data vocabularies link instances with a geometry object
(which then has as an attribute the actual shape), these queries (and geoSPARQL
queries in general) challenge FILTER optimizers because it presents them with
comparisons between variable groundings (as opposed to constant values), and
because these comparisons are non-standard extensions (the geospatial
extensions of GeoSPARQL).</p>
      <p>
        In most benchmarks, lters are either not present at all [LUBM, 2] or only
have unary functions or comparisons against constants [FedBench, 5] that can
always be pushed into one data source. LargeRDFBench [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] includes multi-variable
lters that compare values from di erent repositories, challenging the optimizer
to select the correct strategy: to fetch the left-hand side values and push the lter
into the right-hand side source or to fetch both sides and apply the lter locally.
Both approaches are valid, but can vary dramatically in terms of e ciency. Our
benchmark presents the same challenge in a geospatial context; the federator is
tested not only on correctly selecting the best strategy but also on the e ciency
of its local implementation of the GeoSPARQL extension.
      </p>
      <p>Properties of standard vocabularies, which can appear possibly in all sources
of a federation, present another challenge in the e cient evaluation of a query.
When evaluating a triple pattern that contains a property such as rdf:type
or owl:sameAs the source selector is prone to overestimate the set of relevant
sources, thus increasing both network tra c and the overall query processing
time. Current benchmarks already contain such commonly used properties. But
GeoFedBench stresses source selections more on this direction by exploiting a
query characteristic that appears frequently in Geospatial data; a resource ?x is
linked with its geometry representation ?wkt using chains of known properties
of the form ?x geo:hasGeometry ?g . ?g geo:asWKT ?wkt, where all members
of the chain usually appear in the same dataset. The federation engine is tested
on distinguishing which geospatial triple patterns refer to which dataset, thus
avoiding to fetch redundant bindings for the variable in the middle of the chain.</p>
      <p>Finally, the complex nature of our queries challenges query planning. Current
benchmarks usually contain simple queries consisting only joins between triple
patterns and FILTER operations, or some additional operators such as UNION,
ORDER, LIMIT, etc. In GeoFedBench, Q1 and Q2 use a subquery for discovering
the closest Invekos instance. Also, Q2 and Q3 use negation, in the form of the
FILTER NOT EXISTS operator to check that there does not exist and matching
Invekos instance. Both subqueries and negation are not present in any of the
currently existing federated SPARQL benchmarks.</p>
      <p>
        To demonstrate that the benchmark is feasible but challenging, we tested on
Semagrow [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], to the best our knowledge the SPARQL federation engine that
supports geospatial operators. The datasets are served by Strabon geospatial
RDF stores [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Table 2 gives the query execution time for three runs of each
query, where each run grounds the query with a di erent Lucas point.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We presented GeoFedBench, a benchmark for federated geospatial query
processing. GeoFedBench is based in openly available datasets and queries challenge
all phases of federated query processing. The benchmark is distributed as part
of the benchmark suite of the KOBE Open Benchmark Engine, available from
https://github.com/semagrow/kobe</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This project has received funding from the European Union's Horizon 2020
research and innovation programme under grant agreement No 825258. Please see
http://earthanalytics.eu for more details.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Charalambidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troumpoukis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konstantopoulos</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>SemaGrow: Optimizing federated SPARQL queries</article-title>
          .
          <source>In: Proceedings of the 11th International Conference on Semantic Systems (SEMANTiCS</source>
          <year>2015</year>
          ), Vienna, Austria,
          <volume>16</volume>
          {
          <issue>17</issue>
          <year>September 2015</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , He in, J.:
          <article-title>LUBM: a benchmark for OWL knowledge base systems</article-title>
          .
          <source>Web Semantics</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          ) (
          <year>Jul 2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kyzirakos</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karpathiotakis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koubarakis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Strabon: A semantic geospatial DBMS</article-title>
          . In: P.
          <string-name>
            <surname>Cudre-Mauroux</surname>
          </string-name>
          et al. (eds.)
          <source>Proceedings of ISWC</source>
          <year>2012</year>
          , Boston, MA, USA,
          <fpage>11</fpage>
          -
          <lpage>15</lpage>
          November
          <year>2012</year>
          . LNCS vol.
          <volume>7649</volume>
          , Springer (
          <year>2012</year>
          ), https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -35176-1 19
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasnain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.N.:</given-names>
          </string-name>
          <article-title>LargeRdfBench: A billion triples benchmark for SPARQL endpoint federation</article-title>
          .
          <source>J. Web Semant</source>
          .
          <volume>48</volume>
          ,
          <issue>85</issue>
          {
          <fpage>125</fpage>
          (
          <year>2018</year>
          ), https://doi.org/10.1016/j.websem.
          <year>2017</year>
          .
          <volume>12</volume>
          .005
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          orlitz,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Haase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Ladwig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Schwarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Tran</surname>
          </string-name>
          , T.:
          <article-title>FedBench: A benchmark suite for federated semantic data query processing</article-title>
          .
          <source>In: Proceedings ISWC</source>
          <year>2011</year>
          , Bonn, Germany,
          <fpage>23</fpage>
          -
          <lpage>27</lpage>
          October
          <year>2011</year>
          . LNCS vol.
          <volume>7031</volume>
          . Springer (
          <year>2011</year>
          ), https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -25073-6 37
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>