=Paper=
{{Paper
|id=Vol-1097/STIDS2013_T19
|storemode=property
|title=Supporting Evacuation Missions with Ontology-Based SPARQL Federation
|pdfUrl=https://ceur-ws.org/Vol-1097/STIDS2013_T19_StolpeEtAl.pdf
|volume=Vol-1097
|dblpUrl=https://dblp.org/rec/conf/stids/StolpeHH13
}}
==Supporting Evacuation Missions with Ontology-Based SPARQL Federation==
1
Supporting Evacuation Missions with
Ontology-Based SPARQL Federation
Audun Stolpe, Jonas Halvorsen and Bjørn Jervell Hansen
Norwegian Defence Research Establishment (FFI)
P O Box 25
2027 Kjeller, Norway
Email: {audun.stolpe | jonas.halvorsen | bjorn-jervell.hansen}@ffi.no
Abstract—We study ontology-based SPARQL federation in sup- The main contribution of this paper consists in defining a
port of coordinated action by deployed units in military oper- novel federation strategy specifically designed for domains
ations. It is presumed that bandwidth is limited and unstable. that share the general characteristics of this case. The most
Thus, we need an approach that generates few HTTP requests. important characteristics are firstly that bandwidth is limited
Existing techniques employ join-order heuristics that may cause so the total communication costs induced by the number of
requests to multiply as a factor of the number of joins in a
query. This can easily lead to an amount of traffic that exceeds
HTTP request is a non-negligible factor, and secondly that the
network capacity. We propose an approach that builds an in- network topology is dynamic, i.e. sources may come and go.
memory excerpt of the remote sources, sending one request to Our aim is thus to define a federation strategy that is sound and
each source. A query is answered against this excerpt, which is a complete with respect to query answering, issues a minimal
provably sound and complete representation of the sources wrt. number of HTTP requests, and is compatible with run-time
query answering. The paper ends with a case study involving detection of sources.
three military sources used for planning evacuation missions.
The paper is organized as follow: In Section II we identify a
list of tentative desiderata that our federation strategy should
satisfy. The desiderata points to using an ontology-based data
I. I NTRODUCTION access paradigm, which is explained in Section III. Section
IV outlines our solution, which is based on querying against
The planning of evacuation missions is a complex and impor- an excerpt, or cropping, of the remote sources relative to an
tant process in military operations. One of the most challenging incoming query. The case study is presented in Section VI, and
aspects, is making all necessary information available to the the main experiences drawn from the case study is presented
decision makers. These information fragments will typically in Section VII.
be distributed across different systems.
The paper assumes familiarity with W3C’s Semantic Web
This is particularly the case when the military force con- technology stack, in particular RDF, OWL, and SPARQL.
ducts its operations according to network-based concepts, like Readers not familiar with these technologies are referred to
NATO’s Network Enabled Capability, henceforth NNEC [1]. [2], [3], and [4] for an introduction.
The primary objective when conducting operations according
to this concept, is to support the creation of a high degree of
shared situational awareness among decision makers in order II. C HARACTERISTICS OF THE DOMAIN
to obtain increased mission effectiveness. A prerequisite for
achieving this, is extensive information sharing and a robust The NNEC concept presupposes a network-based environment
scheme for information integration, enabling decision makers in which information about own and enemy units is typically
to retrieve and utilize all relevant information when needed. distributed across several autonomous data sources contributed
So far, the emphasis of the technical work on NNEC has by coalition members. In order to support evacuation planning,
been on how to make information available throughout the these information fragments need to be integrated in order for
environment. However, in order for NNEC to be of use the decision makers to obtain the highest possible degree of sit-
to decision makers, the challenge of establishing a robust uational awareness. This involves tackling some idiosyncratic
scheme for information integration ultimately also needs to challenges:
be addressed. This is the focus of the research reported on in
this paper. • The information systems are in general semantically het-
erogeneous, especially in coalition operations, and cannot
We present an information integration approach that combines be accessed in a coherent and unified way,
query rewriting with data federation, and we study it in relation • the underlying communication network often relies on IP
to an example from military evacuation planning based on live radios, and is hampered by limited bandwith, latency, and
reporting of incidents over IP radio networks. limited range,
STIDS 2013 Proceedings Page 141
2
• the network topology is highly dynamic, meaning that To that end, sources are selected at query time based on the
information systems can appear and disappear at any time, outcome of the reasoning process, by inspecting DNS records
and that are multicasted in the network (cf. section VI.). The
• the shared information is mission-critical, which makes entire federation process can thus be seen as comprised of
it crucial that the integration scheme yields correct and two distinct steps. First, the query is rewritten into a query
exhaustive data. expressed directly in terms of the data according to the domain
model expressed by the ontology. Next, the rewritten query
These characteristics means that we want to define an infor- is decomposed into sub-queries that yield a set of mutually
mation integration approach that: exhaustive partial answers extracted from each of the selected
sources.
A allows a user to access available sources in a unified way,
B utilizes the available bandwith efficiently, particularly by It is not automatically the case, however, that the above men-
restricting the number of HTTP requests to the remote tioned steps are separable. That is, depending on the expressive
sources, power of the ontology language, reasoning may require data-
C allows the relevant sources to be discovered at run-time, access. This is not the case for the class of ontology languages
and that are first-order rewritable (cf. [5], [6]). This notion was first
D guarantees the soundness and completeness of query introduced by Calvanese et al. ([5]) in the context of the class
answering. of ontology languages called description logics. A description
logic L—more generally an ontology language—is first-order
III. O NTOLOGY- BASED DATA ACCESS rewritable if, for every ontology ⌃ expressed in L and a query
Q , Q can be compiled into a first-order query Q⌃ that A)
Based on the desiderata from section II, we decided to use an compiles away all concepts from the ontology ⌃, and B) is
ontology to mitigate heterogeneities and to provide uniform such that given a data repository R, Q⌃ evaluated over R
access to the data. That is, we based our approach on the yields exactly the same result as Q evaluated against R and
paradigm usually called ontology-based data access in which ⌃.
a conceptual model—the ontology—is used to express the First-order rewritable ontology languages is a crucial presuppo-
relationship between the content of the respective sources, and sition behind our approach. It is important precisely because it
to act as a single query interface towards them. ensures that the reasoning process can be decoupled from data
According to the W3C Web Ontology Working Group1 , an access. This has two hugely beneficial consequences: First,
ontology defines a set of concepts or term used to describe the complexity of reasoning remains unaffected as data size
and represent some domain of information in an abstract way increases. In other words the computation time allocated to
that gives a formal semantics to the data in question. More reasoning will not vary with changes in the network topology
specifically, an ontology gives the semantics of the data in the and/or availability of sources—recall that we do not assume
form of a set of logical axioms that explicate the relationship these to stay fixed. Secondly, since reasoning can be decoupled
between classes of data items, and it enables computers to from data access, it does not affect the federation process per
reason over the data as part of the process of answering a se. That is, source selection can be performed independently of
query. One particular form that this process can take, is that the reasoning process, which means, among other things, that
in which a query formulated in terms of the concepts of the a query which is run repeatedly will only have to be rewritten
ontology is successively refined until the query can be executed once.
directly against the data. This is usually referred to as query As always there is a price to pay, though. The property of
rewriting and forms the basis for our approach, as explained first-order rewritability imposes a serious constraint on the
in the next section. expressivity of an ontology language, and, as explained in
Ontology-based data access is useful in all scenarios in which section VII, must be very carefully selected in order to be able
accessing data in a unified and coherent way is difficult. This capture the salient aspects of our case. In particular, it turns
may happen for several reasons. The data sources may have out that none of the standard fragments of the W3C-endorsed
been developed for different purposes by different agencies ontology language OWL will do.
or institutions, may not have a coherent design, and may not As regards HTTP-minimality (by which we mean keeping the
record similar types of information in the same manner. A number of required HTTP requests as low as possible), we
well-designed ontology gives a unified view of the domain in found existing approaches to federated query processing not
terms of the concepts that are of interest to the user. to be well suited. One of the main reasons is that they all rely
on forms of join-order heuristics that tends to multiply the
IV. O UTLINE OF APPROACH number of HTTP requests as a factor of the depth and number
of joins: a standard distributed join algorithm will evaluate a
Our federation engine is designed to be suitable for a dynamic query iteratively one triple at a time, while propagating values
network topology, in accordance with our listed desiderata. in a nested loop join fashion. This multiplies HTTP request in
1 http://www.w3.org/2001/sw/WebOnt/
proportion to the result sets returned by evaluating each join
STIDS 2013 Proceedings Page 142
3
Q ⌃ the federator component which performs service discovery at
run-time (cf. Section VI) to identify live and relevant sources.
Rewrite Relevance here means signature overlap, where a signature
is understood as a set of RDF properties. The extent of the
overlap between the signature of the query and the signature
Q⌃
of a given endpoint determines a SPARQL CONSTRUCT query
which will be routed to that endpoint.
Distribute
The CONSTRUCT queries are designed to adhere to a logical
form which is sufficiently structured to enable us to guarantee
... the soundness and completeness of the query answering pro-
R1 R2 Rn cess wrt. the set of sources R, as explained in more detail in
the next sections. Taken together with the obvious minimality
of our approach wrt. the number of HTTP requests—only a
Crop single request is sent to each source—as well as the relevance-
based per-query discovery of sources, we conclude that our
approach meets all our tentative desiderata A) to D).
Evaluate
Answer
V. P ROPERTIES OF THE CROPPING
Figure 1. System overview
In this section we formally define the notion of the cropping
of a distributed set of sources R relative to a query Q, and we
state its essential properties. We shall assume familiarity with
argument.2 Admittedly, there are several improved versions of SPARQL syntax and semantics (cf. [12]).
this algorithm on offer. The bound join technique implemented Notation. We use Ri , where i is in some index set I, to denote
in FedX [9], for instance, groups several instances of a join RDF graphs—variably referred to as sources, repositories or
argument in a single subquery using the SPARQL UNION endpoints. A SPARQL SELECT query is a pair hP, ~xi, where
construct. This reduces the number of request with a factor P is a SPARQL graph pattern and ~x a vector of elements of
equivalent to the the number of instances in the grouped query variables. Similarly, a CONSTRUCT query is a tuple hT, P i,
(ibid.).3 Yet, experimental evaluation shows that the number where T is a basic graph pattern and P is a union of such.
of HTTP request can still grow quite fast in the number T will be identified with the CONSTRUCT block of the query,
of joins.45 . What is common to all these approaches is that aka. the template, and P with the WHERE block, aka. the query
the number of HTTP request varies in the number of results pattern. We shall allow ourselves the convenience of blurring
returned by the sub-queries. It is a design goal of our approach, the distinction between SPARQL queries on the one hand and
in contrast, to make a factor of the size of the query only. sets and families of triple patterns on the other. Where Q :=
To that end, we designed our federation engine to evaluate hP, ~xi is a SELECT query and G an RDF graph we denote
the query, not against the sources directly, but against an the result of evaluating Q against G as Q(G), and similarly
excerpt, or cropping as we call it, that is pulled from the for CONSTRUCT queries. The proofs of the claims that follow
sources by sending a single HTTP-request to each. Unlike can be found in technical report [13].
traditional warehousing strategies, however, our local copy is As mentioned in the previous section, our approach to federa-
not persisted, but exists only in-memory for the duration of tion is signature-based in the sense that the RDF properties that
the query execution process. It is essentially a snapshot of that are found in a query are used for routing different sub-queries
part of the remote sources which is relevant for answering to different endpoints. This is a common strategy (cf. [8], [9])
the query in question. In realistic cases, the cropping is much for which we claim no originality. Now, given a query pattern
smaller than the total amount of data that it is extracted from P the relevant subset of P in relation to a source Ri is defined
(see Section V-A). as the maximal subset of P whose signature is contained in
An overview of the resulting system, is shown in Figure 1: the the signature of Ri . We shall denote this set as ⇢(P, i).
system takes as input a SPARQL query Q, and a collection Recapitulating briefly, our federation engine is designed to
of aligned ontologies ⌃, which are used by the rewriter to be HTTP-minimal, as well as sound and complete wrt. to
produce the query Q⌃ . This rewritten query is next handed to query answering over the selected sources. A strategy that
2 DARQ [7] and SPLENDID [8] both implement a version of this algorithm.
supports all three is to execute the query against an in-memory
3 Similar techniques are considered in [8] and [10]
representation of the remote sources rather than against the
4 See e.g. the results for FedX on Life Sciences 3 query from the FedBench sources themselves. More specifically our federation engine
suite. routes a single CONSTRUCT query to each of the selected
5 Another notable optimization is the star-shaped pattern technique of [11]. sources—achieving HTTP minimality—whereas the logical
Numbers of requests are however not reported in this study. form of this construct query is defined in such a manner as
STIDS 2013 Proceedings Page 143
4
to guarantee that the answer to the query assembled from the for every ?x 2 dom( j ) \ dom( k ).
selected sources is both correct and complete with respect to
those sources. Here soundness and completeness means that if Our CONSTRUCT queries now become:
R is a set of sources selected for federation, then the answer Definition 3: For a set ofSsources R := {Ri }i2I and a query
that the federator provides to a query Q should be exactly the pattern P : C (P, i) = h f (s(P, i)), f (s(P, i))i where f is
same as the one that would be obtained were Q to be evaluated some separation function for s(P, i).
conventionally over a single repository holding the union of
the data sets in R. To the best of our knowledge, our strategy Example 2: Suppose we have two endpoints JOCWatch and
is currently the only one that guarantees that this is the case. MedWatch,6 and the following conjunctive graph pattern P :
?mission medics:missionType medics:Rescue.
The logical form in question is in turn defined by distinguish- ?mission medics:jocWatchIncident ?incident.
ing between exclusive and non-exclusive triples in a query ?incident jocw:status ?stat.
pattern P . Exclusive triples are those that are satisfied, if at Suppose further that each property prefixed by medics
all, at one endpoint only. Non-exclusive triples, on the other belongs to the signature of MedWatch, that each property
hand, may be satisfied by two or more. Exclusive triples can prefixed by jocw belongs to the signature of JOCWatch, and
safely be grouped together and executed against the source that the jocw:status property belongs to both. The queries
for which it is exclusive in as a single conjunctive pattern. that are routed to the respective endpoints are then:
Non-exclusive triples, however, must be shipped to the remote
MedWatch:
sources as separate UNION clauses. This holds even if a group CONSTRUCT {
of triple patterns are relevant to exactly the same sources since ?_1 medics:missionType medics:Evac.
?_1 medics:jocwIncident ?_2.
an answer to the original query may require joining triples ?_3 jocw:status ?_4. } WHERE {
across these sources. This gives rise to the following definition { ?_1 medics:missionType medics:Evac.
of the set of clauses induced by P and Ri : ?_1 medics:jocwIncident ?_2.}
UNION
Definition 1 (Clause set): For Ri a source and P a query { ?_3 jocw:status ?_4.}}
pattern: s(P, i) := {✏(P, i)} [ {{t} : t 2 ⇢(P, i) \ ✏(P, i)} JOCWatch:
CONSTRUCT {
Here ✏(P, i) denotes the exclusive group of a pattern P relative ?_1 jocw:instigator ?_2.
?_3 jocw:status ?_4.} WHERE {
to Ri . { ?_1 jocw:instigator ?_2.}
UNION
Now, the basic idea behind our federation strategy is to use the { ?_3 jocw:status ?_4.}}
set of clauses induced by P and Ri to define a CONSTRUCT
query that extrapolates the part of Ri that is relevant for The cropping may now be defined as follows:
answering P . The most straightforward way to do that may
seem to be to use the clause set itself as a query pattern, whilst Definition 4 (Cropping):
S Put Q := hP, ~xi and R = {Ri }i2I .
using the set-theoretic union of its elements as a template. Call Then AQR := i2I C (P, i)(Ri ).
this the naive strategy. Interestingly, the naive strategy, whilst We now have:
complete, is not sound. Consider the following rather abstract
example: Theorem 1 (Soundness/Completeness):
S Let R be any set of
sources, then Q( R) = Q(AQR ) for any SELECT query Q.
Example 1: Let G be the RDF graph containing only the
two triples s := (c1, p, d1) and t := (c2, q, d2), and assume Note that here Q is the SELECT query that is being posed
a clause-set {{(?s, p, ?o)}, {(?s, q, ?s)}}. The corresponding to the system, whereas AQR , i.e. the cropping, is the result
naive CONSTRUCT query is: of assembling the results of the CONSTRUCT queries that are
CONSTRUCT {?s q ?o. ?s p ?o.} required for providing an excerpt guaranteed to answer it.
WHERE {{?s p ?o} UNION {?s q ?o}}
As regards time complexity, since every CONSTRUCT query
Executing this query against G will produce a graph containing C (P, i) is in union normal form, Corollary 1 of [12] im-
the triple (c1, q, d1). mediately entails that the cropping can be built efficiently.
The example shows that the naive strategy may create bindings In our actual implementation, the CONSTRUCT queries that
in the resulting graph that do not exist in the graph that is are allocated to the respective endpoints are, moreover, all
queried. To counteract this effect it is necessary to standardize executed in parallel, so time is not a precarious measure.
apart the elements of the clause-sets before using taking the
union and using it as a CONSTRUCT template. To this end we
introduce the notion of a separation function: A. Restricting the size of the cropping
Definition 2 (Separation function): Let S := {c1 , . . . , cn } be Although time is not a precarious measure, the size of result
a clause set, and let i be a uniform substitution of variables sets quickly becomes an issue. The CONSTRUCT queries that
for variables in ci . A separation function f for S is a function are passed around to the remote endpoints, if not constrained,
s. t. 1) f (S) = { 1 (c1 ), . . . , n (cn )}, and 2) j (?x) 6= k (?x) 6 These systems are described in VI
STIDS 2013 Proceedings Page 144
5
may well distribute a triple pattern (?a, rdf : type, ?b) to all VI. C ASE
remote endpoints, in effect requesting huge chunks of the data
contained in each. The case we use to exemplify our approach is based upon the
following scenario: A military analyst is monitoring planned
Now, there is no need in our approach for join-ordering
medical evacuation flight missions, and is on the lookout for
heuristics in the conventional sense, since, per the approach,
missions that might be threatened by enemy activity. If this
joins are either executed remotely, or executed locally by
is the case, she is also interested in finding friendly units
a standard query processing engine after the cropping has
able to counter the particular threat. More specifically, the
been built. Rather, what we do, is to build the cropping
information requirement is the following: Find all medical
incrementally by assessing the relative selectivity of triple
evacuation missions and friendly units such that a) the mission
patterns and processing the most selective ones first. We can
can be classified as being threatened; and b) that the friendly
only describe this procedure in general outline here:
unit can handle the specific type of threat that the enemy poses.
The selectivity of a triple pattern may be assessed along several
Normally, in order to obtain an answer to this information
dimensions. For instance, studies show that a triple pattern with
requirement, the analyst has to keep an eye on several systems,
a literal in object position will usually be more selective than
as information about evacuation flights and information about
one with a URL in the same position [14]. Moreover, triple
enemy activity are usually not kept in the same system. With
patterns can be ordered in a plausible sequence of decreasing
the aid of a system as outlined so far in this paper, however,
selectivity based on the distribution, and position, of variables
the analyst can pose a query formulating what she is looking
in the pattern (? denotes a variable):
for and let the information integration system take care of the
(s, p, o) (s, ?, o) (?, p, o) (s, p, ?) (?, ?, o) rest.
(s, ?, ?) (?, p, ?) (?, ?, ?)
To evaluate the approach and the case outlined above, we
The entire set of heuristic rules that we have used in our conducted an experiment at NATO CWIX 20137 set as close
solution can be found in [14]. as possible to the dynamic and multinational environment of
NNEC. The experiment involved three operational information
Now, the idea is to build the cropping in layers by employing systems: 1) JOCWatch, information on incidents of relevance
the following three-step procedure: 1) construct the graph to the command in an event log, 2) MedWatch, a system for
corresponding to the most selective patterns pertaining to each medical mission tracking designed to support the planning,
endpoint 2) extract variable bindings from the cropping so far, logging and monitoring of medical evacuation missions, and
and 3) pass them on to the next iteration as constraints for the 3) Track Source, a unit tracking service providing times-
next round of queries. tamped geopositional information regarding friendly units in
Step 2, the extraction of variable bindings, is realized by re- the field. The information in MedWatch and JOCWatch were
using the triple patterns as SELECT queries that are evaluated made available through SPARQL endpoints by using D2R [15],
against the cropping as it exists so far, whereas the propagation while the Track Source service had a native SPARQL interface.
of values from one layer to the next is realized with the In addition, all sources were supplied with a service description
VALUES feature of SPARQL 1.1, which allows a set of according to the SPARQL 1.1 specification, and each source
bindings to be shipped with a query in order to constrain the made available its ontology at an URL described in the service
answers. description.
Our procedure is designed to treat each exclusive group as Our prototype system performed service discovery using
an atomic unit, since exclusive groups are likely to be more
• mDNS8 for broadcasting and discovering the presence of
selective as a set. For the same reason, they are given maximum
information sources,
priority, That is, the first layer of the incremental construction
• DNS-SD9 for high-level description of the source in
of the cropping consists of the result of executing the exclusive
terms of pointers to query endpoint location and content
groups as CONSTRUCT queries. The subsequent layers are then
description location, and
constructed from the non-exclusive patterns by rating them
• the SPARQL 1.1 service description and VoID10 vocabu-
according to the heuristic criteria.
laries for describing source content.
This procedure is sufficient to ensure that very unconstrained
patterns such as (?s, ?p, ?o) will be processed late, when This approach addressed the NNEC needs as outlined in
bindings are available for some of its variables. It can also section II, and had the advantage that it was independent
be tuned to give low priority to predicates from existing RDF of a central registry, thus eliminating the issue of network
vocabularies that are known to have a low selectivity rate,
7 Coalition Warrior Interoperability eXploration, eXperimentation and eX-
such as e.g. rdf : type or dcterms : title. The procedure
amination, eXercise – an annual NATO event aimed at improving alliance-wide
preserves the soundness and completeness of the cropping wrt. interoperability
the underlying sources, and, although the number of HTTP 8 http://www.multicastdns.org/
request is no longer minimal it is constant in a small factor of 9 http://www.dns-sd.org/
the number of triple patterns in the original query. 10 http://www.w3.org/TR/void/
STIDS 2013 Proceedings Page 145
6
JOCWatch Concept Definition
ThreatenedMission MedWatch missions that are related to a
ThreateningIncident
jw:Hostile ThreateningIncident All JOCWatch incidents that are related to a
jw:instigator rdf:type ThreateningEvent
rdf:type jw:incident ThreateningEvent All JOCWatch events that are both a
MilitaryOperation (from the JOCWatch
ontology) and a HostileEvent
jw:SAFIRE HostileEvent All events that has a HostileInstigator
HostileInstigator All event participants that are classified as being
hostile.
med:incident Relation Definition
MedWatch canHandle A relation between a military unit type and the
type of events those units types are equipped to
rdf:type useront:canHandle handle.
med:Mission
hasEvent If a mission involves an incident, and there exists
an event that belongs to the same incident (inverse
property), then the event is also related to the
mission
TrackSource
Table I. R ELEVANT DEFINITIONS IN THE USER ONTOLOGY
nf:unit rdf:type